I am astounded that people would question citing an arxiv paper. If you get an idea from somewhere else, you cite it. This doesn't even have to do with the archive. I have on multiple occasions read papers where the author wrote a proof and then cited "private communications" with another mathematician as the source. In a business where ideas are ideas, you should always cite the source. Anything else is extremely dishonest.
edit: To be clear, I am personally entirely okay with not citing papers containing ideas you were unaware of during your own formulation (though I think if you become aware of it, you should probably point out that it was previously independently discovered by someone else). Your paper may still have merit even if it's idea isn't "new" (especially if the first paper is shit as is often the case).
edit2: I personally don't see much wrong with Yoav Goldberg's blog post linked in this blog post. It's refreshing to hear his honest opinions out loud. As a graduate student I always lost a little sanity each time I read a paper with "great ideas", but terrible follow-through (i.e. explanation and proof of those ideas). I personally think that clarity of exposition is at least as (if not more) important than the novelty of an idea. However, you should still cite the sources of your ideas. Feel free to point out the source's flaws, but cite them nonetheless.
It just occurred to me that by quickly posting early in this thread and then continuing to develop my argument in further edits, I'm essentially guilty of flag-planting. The irony of doing so in this discussion (albeit subconsciously) compels me to point out my own hypocrisy. :)
I guess it's a good reminder that many of these habits aren't purposeful. Though I'd hope that people would stop and think a bit more than I just did before uploading papers to the archive (or publishing elsewhere). But I guess impatience afflicts us all...
At least in biology, a lot of journals have been cracking down on "personal/private communication" citations lately -- the point of a citation isn't just to give credit, but to tell the reader where to find the info, and chatting with your friend isn't a realistic option for that. If it is unpublished, and is important enough to be cited, then you should make the friend a co-author and include the info in your paper. That is a different case than an arxiv paper, though.
Of course the info needs to be included in the paper if it cannot be found elsewhere. I didn't want to imply otherwise. Whether the friend needs to be a co-author is of course up to them and certainly depends on the size of the contribution, but that is beside the point of both citing the source of a certain idea and making sure the reader is able to verify its validity.
In fact, I personally lean towards often including arguments even if they can be found in sources you cite. If you can provide a much clearer argument than the source (and it doesn't detract from your own work), you should include the improved one in your work. If it is a minor detail that is both worth citing as well as easy to include, then I think you should include it to avoid requiring the reader to hop around from paper to paper to get an understanding of your work. You should make your work as accessible to the reader as possible.
I'd argue that the latter purpose---allowing the reader to recover the original sources and arguments to your thinking---is more important than the 'giving credit' one. If a conversation or email exchange with someone was indeed critical to the development of your ideas, then you offer them co-authorship, not a "citation' to something not only unpublished but undocumented. If it was only incidental to the thinking, then no citation is necessary.
I will note that many a graduate supervisor or lab administrator has been offered (or simply taken) co-authorship for a far more scant contribution than a key conversation.
Five years ago (during my PhD) I would not have been "allowed" to cite a paper which has not been peer-reviewed and published in the proceedings of a recognised conference. Papers on arXiv where specifically black-listed. Mentioning Wikipedia (even off the record, in personal meetings) would have been the start of fast road to the bottom.
This dissonance seems to come from the fact that citing has two purposes
1. Assigning credit to where you heard about something.
2. Giving a claim (the best possible, and sufficiently strong) support.
From the credit perspective you should cite wherever you heard something, even if it was alleyway graffiti. From the support perspective, you should take the idea you read on the wall look it up to see if someone credible has said the same thing, and then cite them, and if not maybe not mention it all.
When publishing a paper/book with a certain publisher there is an interaction between the prestige of the publisher and the prestige of the author:
An author publishing with prestigious publishers is considered prestigous; a publisher that publishes prestigous authors is considered prestigous.
Since publishers are commercial entities, their prestige is an asset they want to protect.
So there are other incentives at play when it comes why citing ArXiv is currently not en vogue.
I always find it refreshing to see how pre-print driven reserach communities like physics operate in comparison.
"If you haven't climbed the ivory tower you don't get to speak in the ivory tower... Also, we in the ivory tower don't listen to those not in the ivory tower. Less we ourselves are cast down from our perched position."
Even during my masters I have always felt that it was not done or at least questionable to cite wikipedia. However, wikipedia is often an excellent source for a first read on a new topic. The next step should then be to read wikipedia's sources, and slowly expand your view on the subject.
However, sometimes it happens that you read first about a topic on wikipedia, and worked something out on the information you found there, before you had time to consult wikipedia's sources. In this case you definitely should cite wikipedia.
> However, sometimes it happens that you read first about a topic on wikipedia, and worked something out on the information you found there, before you had time to consult wikipedia's sources. In this case you definitely should cite wikipedia.
I was advised to always cite the primary source, even if I learnt of something in a secondary source, eg. lit review or Wikipedia. The reason being that if the reader wants to follow up on it, it's the quickest path. I'd say it's also just good practice to credit the original authors for their work.
Citing Wikipedia also puts the reader in the uncomfortable position of either taking your word for something, or having to go through Wikipedia's sources themselves to verify something.
> Citing Wikipedia also puts the reader in the uncomfortable position of either taking your word for something, or having to go through Wikipedia's sources themselves to verify something.
The problem is that Wikipedia is a secondary source. If I cite wikipedia you need to go through the reference list in the end of the article to find the primary source for the fact that I cited. If I had instead cited that source directly you wouldn't need to go through this extra step.
No, I couldn't. In academia, when publishing a technical paper, only final/published papers count. I doubt that this was a "rule" enforced only at my University or even just in my country. I'd love to hear from others where citing arXiv was at least allowed, but I doubt that that would have been the "standard" way of citing in technical papers.
I must add that this was such a pain for me, as I found several relevant articles on arXiv and I could download and read them. I can't say the same for articles found on Elsevier or ACM, where the relevant articles were mostly in the journals to which my University did not have access...
This strikes me as the wrong way to think about citations.
Citations to a finalized version of a published, peer-reviewed article are "best", both in terms of assigning credit (this is what the authors are supposed to be producing) and as a pointer to more information for the reader (the article has been reviewed[0], it won't change, and there's a stable location for it). Work that isn't peer reviewed shouldn't be outright banned or ignored, but the citation should carry a lot less weight. It hasn't been reviewed, it's subject to change, etc. Since these are essentially someone's musings on a topic, when you cite paper to "prove something" (e.g., you write "The work of XYZ et al. (2017) shows that <some confound> is not a problem"), people will give it correspondingly less weight.
There is a long tradition, predating arXiv by decades, of citing technical reports or "white papers". These are usually written up like a journal article, but might be difficult to publish (all negative results) or contain more details than a typical journal publication would allow. If there is a "journal" version and a "tech report" version, it would probably be better to cite journal version, but I would be shocked if someone actively objected to including a tech report.
(In some disciples, the white papers are also the only thing available. The World Bank and Federal Reserve, for example, often release white papers containing their own data. They rarely bother to publish them in a journal though).
I couldn't resist writing that so I read the article looking for something substantive to include as a sop to my conscience.
He didn't contextualize or establish the existence of the problem to my satisfaction. A naive reader would assume that he is that he is arguing against academics who feel literally entitled to plagiarize pre-print publications. I don't buy it. I suspect that the actual debate he's engaged in involves is better characterized by the following three quotes:
"…many authors are peeved, pricked, piqued, and provoked by requests from reviewers that they cite papers which are only published on the arXiv preprint"
"Any time that our work follows … ideas from other people, and when we can reasonably be expected to be aware of this, we ought to cite the related work."
"If similar work comes to our attention during a proper literature review, we ought to cite it."
To which the counter-argument would the following, in his closing passage:
"Many reviewers are abusing the system and asking for ridiculous comparison to recently-posted preprint papers…"
This acknowledgement comes too late to be given any useful answer, even though one can easily see it to be the core of a real problem; the reviewers, after all, are serving as gatekeepers to publication, and if one should not put "too much faith in … the overworked cohort of peer reviewers, roughly 30% of whom typically fail to even comprehend the basic outline of the paper," then it is probably extremely frustrating when someone insists that you cite papers that you haven't read in your bibliography as 'related literature', let alone papers you have read and dismissed as insufficiently important to refer your reader to, let alone papers that were clearly written by the reviewer's pet pony in crayon on a stable wall before being photographed and uploaded to the arXiv as uncompressed IMG files.
When I started my PhD I was even told not to cite book, regardless of their footprint in the field. They are/were regarded as some sort of "common knowledge". Many ideas come from such works, but one cannot cite them and must, instead, cite a previous work (which surely got their (base)ideas from the same books...).
What I'm saying here is that the "academic code of conduct" is a bit outdated.
I think this is alright as well. Many things do become "common knowledge". The book probably didn't come up with the original idea/argument anyway.
Of course tangential to the citing of novel ideas is the citing of material that helps explain your work. I think it's certainly a good idea to cite a source of common knowledge if that source does an especially good job of presenting that knowledge.
As with all things, this is a judgment call. Just try to be honest...
Again, it depends on the purpose of the citation. I believe no-one would have a problem with e.g.
"To solve this discretized Poisson equation we use the BiCGStab method \cite{vanLeersPaperAboutBiCGStab}, which is an iterative Krylov method; see \cite{SaadsTextbook} for a general introduction."
exactly. all this talk of reputation, authority, credit, etc.
the bibliography is an important part of your work. its there to help an interested reader understand your work more fully by placing it in context, providing more access to a more exhaustive discussion of the finer points, and to assist in their studies of related topics.
I don't think the issue is the act of citation itself, more that the citation is meant to provide a level of assurance that the idea you're citing is true. If you cite a paper in Science that carries with it a knowledge that some clever people have looked over it and can't see any obvious issues with the work. If you cite a napkin sketch that doesn't carry that weight and pushes the onus on to the reader to verify the cited claim themselves. In a paper with 100 citations that's not really a practical position to be in.
Im not saying it's necessarily a good system (plenty of shit makes it in to journals after all), but I can understand why an author would be hesitant to cite lots of non-peer reviewed sources in a paper of theirs. Having said that, I guess that's really just a sign that you're on shakey ground if you're reliant on dodgy sources!
> If you cite a paper in Science that carries with it a knowledge that some clever people have looked over it and can't see any obvious issues with the work. If you cite a napkin sketch that doesn't carry that weight and pushes the onus on to the reader to verify the cited claim themselves.
I think no matter the source, you (as the author) have the final responsibility of citing correct work. You can (reasonably) choose to only cite nature because it is "safer" choice, but you can also cite other sources though you should of course take care to vet that source well yourself. As you point out, a lot of shit makes it into journals and in fact many journals are themselves shit (and outright frauds) so adding arxiv as a source doesn't really fundamentally change anything.
As a side note, I believe that many papers cite way _too_ many papers. Unless you use or expand upon a paper's work, I think you shouldn't be citing it (except possibly as general background knowledge). I just can't understand how you can write a paper that's explicitly doing so with 100 previous works (a book sure, but not a paper). Then again, my citation philosophy goes against many others (including my former advisor). Many think you should cite basically anything tangentially related (especially the works of the academic king makers). For me this is no longer important since I no longer am an academic researcher. I have the freedom to pontificate on the subject without any worries of having a career. :)
I disagree with your statement about papers in Science. pretty much every paper I read in science has basic invalidating errors that would have been caught if it had more review. Papers in Science are fast-tracked to be published as soon as possible so the competitors don't get published in Nature first.
I'm not saying that it necessarily shouldn't apply everywhere else, but it's certainly a fact that in those industries the business is _not_ the idea (at least at that grand scale). The business is the timing and follow-through. Abstract ideas are a dime a dozen in most industries. The idea of providing an improved search engine, or simplifying financial transactions is relatively trivial (for each of those companies there were certainly hundreds of failed ones that all had the same grand idea). (This triviality is in contrast to the specific method--for example, PageRank--that was used in the case of Google.)
As a side note, why do you believe that it doesn't apply outside those industries? I'm sure if you asked Brin and Page, they would say they were inspired by (the limitations of) previous search engines. Ditto for Stripe and Paypal and Facebook and Myspace.
Academic ideas are (hopefully) much more specialized than these broad ideas. In business they are more comparable to patents and for patents prior art is of course extremely important (enough so to legally invalidate your legal claim to your idea!). In contrast to patents, however, there aren't strong legal mechanisms to enforce priority of ideas and that's why being honest and open about it (community policing of these ideals) are especially important. If academics were to entirely stop caring about citing their sources, academic research would probably totally cease to function.
Agree with this: if you take an idea from somewhere, cite it. There’s no cost to you personally or professionally for doing so, and you’re giving credit where it is due.
>A large number of seminal works have never been published. The greatest mathematics paper of our lifetimes remains unpublished.
Author here: Grisha Perelman's proof of the Poincaré conjecture has never been (by him, to my knowledge) submitted to or published in any journal. He decided, as is his right, that he could not care less about the professional community of publishing mathematicians or their protocols. Does not invalidate his achievement.
I should note, since I am not and do not expect to be the level of mathematician that Perelman is, I have not actually read his proof. So I defer to other superior mathematicians for this assessment and come by it as hearsay. :)
I think the logic that his work is unpublished itself is wrong, he published/posted his work for world to see in arXiv , He has very strong opinions about current scientific publishing status quo, hence did not go through usual route of submitting in journal, He did published/posted on arXiv. Who are we to decide that posting your work in Blog post , arXiv etc does not constitute published.
It becomes a problem if the people with the money want it. Grant proposals usually require you to list your important and relevant "published" papers.
There is actually two aspects about "published". One is archival, so people can expect to access the work decades later (if they have to pay for that is another discussion). The second aspect is peer-review aka quality control.
Personally, I once submitted a paper to a workshop. After submission, peer-review, and acceptance the workshop committee decided that they will not publish proceedings. I could have submitted the paper elsewhere, which I find weird. Instead I published it as a techreport. However, it is now unusable for proposals, because a techreport is "not published" even if it is properly archived and went through peer review.
> However, it is now unusable for proposals, because a techreport is "not published" even if it is properly archived and went through peer review.
IMO, the definition of 'published' is a huge issue. I have always read published as archived peer reviewed research. Your paper is both, but remains in an not published state which I think is wrong and hinders future research.
Hey Zack, are there any intermediate/advanced "mathematics for machine learning" books you'd recommend? I find the classic recommendations are not exhaustive enough to cover the kind of math recent papers have started getting into.
You should keep in mind that probably none of those machine-learning researchers has studied only math specific to that domain, so their papers are likely to include whatever math they have a background in, plus any new techniques they had to learn to get their results.
That said, everything I saw in the papers you linked was linear algebra, calculus or probability theory plus the usual smattering of background notation and set theory.
Once you have a solid background in those areas, it is likely more productive to look up the specific concepts mentioned in a paper (such as the Kullback-Leibler divergence or the Bellman equation), because by then you are probably too deep in the woods to find one resource that adequately covers all those different directions.
That's mostly linear algebra, probability theory and calculus. You're going to have a difficult time self-studying all of that if you haven't had much exposure to it.
Books are probably a less efficient method of learning the mathematics if you have targeted subjects you want to learn about. They're typically suited to introductions and breadth-wise coverage of fields, but once you get higher up, "linear algebra" (for example) can get fuzzy with things like abstract algebra. That means you'll end up with several tome-like books to work through which can be productive, but it'll take a while and you'll need to map the material to the applications you're interested in on your own. It's more efficient to develop a good baseline of understanding about a broad subject area, learn the foundational theorems, then move on to the specific areas you need to learn. This is typically doable if you've developed the requisite mathematical maturity overall and if you have learned the "essentials."
Practically speaking: maybe pick up foundation texts like Strang's (linear algebra), Spivak's (calculus) and Ross' (probability theory). You're going to want a solid foundation in analysis before moving on to higher order probability theory, so drill down on that after you do a refresher on the calculus. From there you should attempt to read each paper (even if you struggle a lot), take notes on what confuses you or doesn't make sense, read the prior art on those topics and then come back to it.
I don't particularly read machine learning papers often, but I read mathematical cryptographic ones very often (at least once per day I find myself in a new one). It's not typical that I read a research paper introducing a novel primitive or construction where I follow the math immediately on a single pass, and I often come across things I need to read about first. From a thirty thousand foot view the math for both of these subjects is broadly similar in rough topical surface area, so I think this methodology for academic reading is fairly applicable to most subjects that involve a lot of mathematics understanding.
Basically: don't approach learning the heavy math with a monolithic, brute-force approach as if you were in university. That's a slog and it's demotivating. Learn the minimum foundation for each area you need, then proceed to more advanced topics as you need them.
For calculus, books, why Spivak over Swokowski? Can you compare and contrast and suggest why someone might suggest one over the other? I don't have a preference myself, but it would be good to understand the differences.
The linear algebra is obviously key, and I wish we'd done more of the in my advanced high school classes instead of elementary analysis.
Thanks a lot for your comment. I do have exposure to all the three topics. I self-studied with Strang's MIT OCW course in high school, took calculus and probability in high school and undergrad. So, I'm not really looking for big introductory books for two reasons, I don't really have time to go through big books, and since I already have some exposure, it becomes hard to find new things to learn from such introductory books. So, I was looking for something more concise which efficiently covers such mathematics.
EDIT: I think the main topic missing from my background is this so-called "analysis". I never formally studied it. Is there a more efficient way to study analysis than spivak's, for someone who has a decent background otherwise?
Analysis is basically "really rigorous calculus". Basic analysis courses are also usually where you learn to do proofs.
(To some reasonable generality "calculus" stands for "rules of manipulation", while analysis is the mathematical theory of calculus. So I can teach you stochastic calculus in a couple of two-hour sessions but understanding what the hell is going on (stochastic analysis) requires measure theory, some functional analysis and much courage)
I don't know why, but people always seem to forget that optimization is an important topic in machine learning that requires study. Boyd's book is the canonical source (and free online). If you want to get some functional analysis background at the same time, you can look at Optimization by Vector Space Methods. It's an older book but it is still worth a read and provides more theoretical foundations than Boyd.
> I don't particularly read machine learning papers often, but I read mathematical cryptographic ones very often (at least once per day I find myself in a new one).
Where do you find new ones? I'd like to get into this.
The IACR eprint archive, which is essentially arXiv for cryptography. Essentially everything worth reading in cryptography is either in the IACR eprint archive or a conference proceeding. All conference proceedings from the IACR conferences can be read online for a fairly cheap membership fee. More often than not everything is cross-posted to the eprint archive even if it's published in a journal (which there's basically one: The Journal of Cryptology) or a conference.
I tend to view arXiv as mainly an aggregated repository of documents that would ordinarily be tech reports. It is accepted practice to cite tech reports when the paper author is aware of them and they are relevant. I don't really see how an arXiv paper should be any different.
Seems like this is a case of "whatever you measure will be gamed": counting citations is an important part of how academics are evaluated at work, so we get flag-planting behavior to maximize that metric with minimal effort.
There's a similar issue in journal publishing: counting "published works" without regard for where leads to journals that will publish literally anything for cash.
Yes, this is a thing. The existence of a citation format isn't a blanket endorsement of its use in all ways in a scientific work. For example, in linguistics I could cite a specific usage example from a (non peer-reviewed) webpage. On the other hand I wouldn't want to cite a (non-peer reviewed) webpage that outlines a specific linguistic theory.
The main problem is giant publication latency. Typically there is at least 6 months delay by the time you submit paper, it gets reviewed and then actually published with proper DOI etc. These days 6 months is loooong time.
I wish there was some way to generate all relevant bib information as soon as paper gets accepted which then can be added on arxiv immediately. This would allow folks to distinguish between peer reviewed papers vs those which are submitted only for flag planting.
I don't see how this is even a question. A paper "published" to arXiv is published, in the more general sense of the word. Just because it isn't "journal published" doesn't change anything.
Publishing in a (reputable) journal generally means some degree of peer review. While noisy, this process generally means that really outrageous methodological errors or theoretical claims get weeded out. For a good paper, it means that other researchers have pressed them on specific aspects of the work, which often produces stronger work (new methods, better baselines, clearer argumentation, clearer math).
I adore arXiv but still believe it's a preprint. In my field (cognitive science) it would be great if we had more methods to sidestep Elsevier and the other commercial publishers and have an open stack with rigorous peer review (PLoS being the main way currently).
While noisy, this process generally means that really outrageous methodological errors or theoretical claims get weeded out. For a good paper, it means that other researchers have pressed them on specific aspects of the work, which often produces stronger work (new methods, better baselines, clearer argumentation, clearer math).
I see that as all true, but irrelevant in this context. If you source material from a pre-print on arXiv, then you should cite it. Seems totally obvious to me. Of course you would prefer the final, published paper if it's available. But that wasn't the question at hand.
And even with all that said... I would argue that in some fields, (cs / ml / etc.) we're getting close to a point where arXiv itself is become almost a parallel publishing mechanism where people cite/publish completely within the arXiv realm, with less regard for "traditional" journals and what-not in general. Especially when you factor in papers from researchers who come from industry, as opposed to academia, and care less about some of the normal trappings of academic publishing.
I adore arXiv but still believe it's a preprint.
Of course it's a pre-print. I didn't contend otherwise. I'm just saying that, from my perspective, it's obvious that you should cite a pre-print if it's relevant.
I will allow though, that norms probably vary from field to field, and as a non-academic, my take is likely different from, say, somebody who is deeply immersed in academia, pursuing tenure, etc.
> While noisy, this process generally means that really outrageous methodological errors or theoretical claims get weeded out.
There is a fair amount of evidence that this isn't true. In general, most statistics in scientific research aren't done by statisticians, and there are whole classes of methodological errors that are regularly not caught because the "peers" have the same lack of statistical education as the people whose papers they are reviewing.
It's a bit tougher when a paper hasn't been peer reviewed, although not too different from a good paper published in a low quality or unknown conference. I think you should cite any idea you pick from a paper when the said paper has some the following qualities:
- novelty
- technical correctness
- clarity
- good experimental evaluation
Novelty is the most important: if without the citation your paper looks like the original idea, this is plagiarism.
If the paper has big shortcomings that your paper adresses, it is fair to give yourself the credit you deserve of course, but it doesn't harm to cite the other paper, in fact it gives a way to give some sort of peer review: In [1], Foo and al. attempted to explore <subject> but the experiments were inconclusive/the technique sucked compare to state of the art/they didn't explain how they did it... In this paper we did this and that and it gives us awesome results (Said in a nicer way)
"If you built on it, cite it", is necessary but does not suffice. Most papers have a related work section that describes all work similar to your paper. In hot areas, such as deep learning, some of the related work may have been done in parallel with your work, so it did not inform it. If this related work is uncited on arxiv, you find it when you are about done, and it had no influence on your work, do you cite? Reviewers sometimes demand this.
I've been told a rule of thumb is that if a related unrefereed arxiv paper has been cited six or more times, with the justification being that this means it is somewhat well known once it has some citations.
Of course you cite it. You want to help the reader find the related work. It doesn't matter whether it "had influence" on your work (a fuzzy criterion). If you aren't happy that you are doing the same work as others, then find a more original problem.
It definitely does not matter how many citations the other paper has. The point isn't to avoid getting caught, it is to inform the reader. Your citation is more useful the less well known the cited paper is.
How is "flag-planting" even relevant to this ? You only cite what you use and you should not be using vapid hollow flag-planting sources from _anywhere_. Someone needs a fresher course on academia 101. Is it the reviewers ?
The issue is that some people (appear to be) submitting very preliminary, and arguably low quality work to arXiv to stake a claim to some area. Once that work is "out there", people who were making a more serious effort to do things more carefully are obligated to cite the original work, presumably as part of the related work/background/etc.
I can see how this would be maddening, particularly if you started before the flag-planting paper was even written.
> Yes, of course. Any time that our work follows, copies, or borrows ideas from other people, and when we can reasonably be expected to be aware of this, we ought to cite the related work.
> We should not have to cite nonsense. Many reviewers are abusing the system and asking for ridiculous comparison to recently-posted preprint papers. Bald-faced flag-planting should not be rewarded. And we should not be faulted by reviewers for failing to compare against 2-week old algorithms that may or may not work.
So what position is the author advocating? Citing or not citing?
Include the next sentence for the first quote: Yes, of course. Any time that our work follows, copies, or borrows ideas from other people, and when we can reasonably be expected to be aware of this, we ought to cite the related work.
Something being on arXiv or in a blog post or … doesn't excuse not citing it if it influenced your work. It's important to document where your ideas and data come from, both to give credit to the author and to allow others to evaluate what you base your claims on for themselves.
The second quote is about stuff that didn't influence your work. While you are expected to keep up with and document related developments, forcing authors to constantly update references to new, not yet properly evaluated work just because it makes some related claim doesn't make sense.
> The second quote is about stuff that didn't influence your work. While you are expected to keep up with and document related developments, forcing authors to constantly update references to new, not yet properly evaluated work just because it makes some related claim doesn't make sense.
It doesn't say that though, it just says that you shouldn't have to cite nonsense on arXiv. This implies that you can read a paper, implement something similar, and afterwards decide that the paper was nonsense, didn't influence your work, and shouldn't be cited.
> Many reviewers are abusing the system and asking for ridiculous comparison to recently-posted preprint papers. Bald-faced flag-planting should not be rewarded. And we should not be faulted by reviewers for failing to compare against 2-week old algorithms that may or may not work.
The context seems pretty clear to me. Your example clearly is covered by the first case: if it influenced your work, you cite it, even if it is "nonsense". You can't just "decide" something didn't have influence if it had. (These rules do not prevent cheating, they are guides for people acting ethically)
The second rule is to prevent the opposite case: You shouldn't be forced to create the impression your work is based on or just a mere repeat of someone else's "who had the idea first" when they have no good claim to that, or inferior to something that hasn't been shown to be actually better.
I don't get the distinction. Nobody expects you to cite something which you didn't read, or which didn't influence your work.
The question here is whether you should cite something which you did read, and does relate to your work, even if it's a shitty flag-planting paper.
If you think it's shitty, then you can cite and dismiss it in a sentence. You can dismiss 30 papers in a single sentence if you like. There's no requirement to wax lyrical for 3 paragraphs about a paper just because it was first. But it strikes me as dishonest to advocate that sole researchers become arbiters of a paper's merit, citing or not citing it at their personal discretion.
Also, if their paper was published first then that's the only claim necessary to demonstrate that they were first out with the idea.
The point is exactly that it happens that reviewers demand comparisons with other work you haven't read yet. And while missing an established publication that influences your findings is a fault on your part and totally fair critique, "missing" something that didn't exist when your work happened is obviously not something you can control.
> Also, if their paper was published first then that's the only claim necessary to demonstrate that they were first out with the idea.
To quote myself: impression your work is based on or just a mere repeat of someone else's, not just being first. Ideally, everyone looking at you referencing it would take note that it was published months after you started work and your work was independent (or even earlier), but that easily gets lost.
Should you be encouraged to throw out every idea and snippet to arXiv just so you can claim "FIRST!" in case it turns out to be useful/true, over "competing" works that spent more effort on quality and verification and are now in peer-review forced to reference you as the pioneer (even if you maybe had the idea months later, but rushed it out and got lucky with it holding up)? That's what the "flagplanting" is about.
It's clear enough to me. In a well written article, he's advocating a middle way.
On the one hand, if you borrow an idea from another person or source, then you should cite it. Just because it's only the arXiv doesn't give you a pass not to.
On the other hand, "flag-planting" articles are not something that you borrow from so you don't have to cite them - and you shouldn't as the practice should not be rewarded.
As `CogitoCogito points out, it's routine to cite completely unpublished material such as private communications.
Conversely, if a flag-planting article somehow makes it into a very prestigious journal, then you can still ignore it.
So really, the publishing status is only an initial filter, and the potential source should always be judged on its merits.
But this implies that you could re-implement/modify/publish something from a paper and just not cite it because you deem it to be nonsense. That's surely dishonest. If you've read something which closely relates to your work then it's your responsibility to make this clear, whether you like the paper or not.
If it doesn't relate to your work, or you haven't read it, then obviously it needn't be cited.
I did, I don't get it though. I think it's dishonest to read something and decide yourself that it's not worth mentioning, even if your work relates to it in some manner.
To take an example, read the article linked to in the OP's. The author describes how terrible two papers are (implying that they're not worth citing) only for one paper's author, and other researchers, to come on and tell him why he's wrong about his interpretation and understanding. This leads to him retracting his claim that it wasn't worthy of merit.
So immediately you have an example of a sole researcher deeming himself to be the only judge of merit necessary, only to be wrong. His judgment has a 50% failure rate already, and that's with him cherry-picking 'bad' papers.
> So immediately you have an example of a sole researcher deeming himself to be the only judge of merit necessary, only to be wrong.
If you have chosen to publish on arXiv, then you have chosen to step out of the peer-review route. The author of the linked article did not deem himself to be the only judge of merit necessary, his position as sole reviewer came about through the decision of the papers' authors to publish on arXiv, and it seems the article's author would have preferred it if the papers had been well-reviewed before publication. You are not advocating for arXiv papers to be immunized from evaluation, are you?
It seems that we are rediscovering why the peer-review process, with all its flaws, was created in the first place. Complex problems rarely have simple solutions.
I must be missing some nuance of the argument here. If it is nonsense then why is it in his paper? If you are citing poor quality sources then that tells us something about your paper and to not do so would be dishonest. We're not seriously saying that you should be trawling for similar ideas to your own and then citing them, just where you have used other work you must give credit.
There is a massive disjunct between citation in the humanities and citation in science. People seem to have forgotten this. Ideas are two a penny, they literally do not matter at all they should not be cited. The person who introduces a concept to science deserves no credit whatsoever. What deserves credit is the provision of evidence or proof.
What? Leibniz/Newton introduced the concept of calculus; shouldn't we credit them with this (amazing) concept just because it wasn't formalized very well until Weierstrass came along?
In my view, Leibniz / Newton did enough heavy lifting to deserve credit for developing the field. This is way different than just blurting out a "concept" without even knowing if it's workable.
I'm in agreement that ideas are a dime a dozen. Sure, it's necessary to cite the first known mention of an idea, and there are situations where the first mention of an idea is important such as in the patent system.
"Planting" happens in my world all the time. Unfortunately, managers give a lot more importance to "ideas" than they are really worth, because they over-value their own interventions in general. Somebody will blurt out an idea in a meeting, wait until someone else has developed it, and then rush in to take credit. If a manager does this, it's a blow to morale. I have my own rule of thumb, which is "show your work" from math class. Just writing down the answer doesn't get you full credit.
A couple of historical examples: The ancient Greeks are credited with the atomic theory, but they had no concept of even turning it into a serious hypothesis. Lots of ideas are anticipated in science fiction, but do those authors really deserve credit?
I agree, both Newton and Leibniz worked out chunks of method and demonstrated their utility. Interestingly Newton's alchemical approach to publication is rather similar to an arXivists... bits and snippits! In the end we reference the formal publication (if we are developing fundamental changes to calculus which would be f*ing impressive... or writing about science!)
> Somebody will blurt out an idea in a meeting, wait until someone else has developed it, and then rush in to take credit.
I see what you mean, but this is mostly office politics, which is not very related to the academic citation process. If an author makes a conjecture that stimulates further work (even if it's just on the ArXiv), he deserves to be cited.
> Lots of ideas are anticipated in science fiction, but do those authors really deserve credit?
Well, I guess it depends on how much the idea is fleshed out. Lucian was probably the first author to conceive space travel, but it was just "people land on the moon" (still pretty far out for his times, though). Meanwhile, Asimov's three laws of robotics depict a reasonable control scheme for e.g. an autonomous vehicle, so if they are somehow implemented their author should be credited.
Write a good paper and cite, cite, cite! If your citations are ever demonstrated to be incorrect or fraudulent other researchers can continue work to disprove the citations and work on correcting the errors.
Good article. We want to incentivize researchers to share useful ideas in a timely manner, and allow others to trace the genealogy of these ideas to stimulate their thinking and avoid dead ends. A healthy citation culture helps to build the shoulders of giants.
What if somebody independently came to similar or worse results and only then read about the research - with possibly more results - made before that?
Given the amount of information available, it could often be the case of independent research into something which is known and available for some time. If one only learns about similar - and possibly greater - results after making one's own, and wants to talk about the work done - should one cite other, possibly earlier, works?
Please set up a redirect to the canonical URL. This is helpful to third party systems that interact with your site, such as search engines and content sharing sites. It also eliminates the issue you raised.
Redirect what to the canonical URL? The entire tag? Why, just to fix this HN submission? The proper solution here is to shoot an email to hn@ycombinator.com (which I've done).
Redirect the tag to the particular article that it currently references. Only show users the "correct" url that you want them to use in the future (history, bookmarks, sharing, etc).
The tag [0] doesn't currently reference the article, it references a collection of articles. But in this instance there is currently only one article in the collection.
When there are two articles with the tag, they will both be shown on the 'publishing' page, in their entirety, newest first. See [1] for an example.
Redirecting would be akin to Google redirecting you to the current top result (like I'm feeling lucky, but for all searches)
Bruh, the tag is a page containing a list of articles that are tagged with that tag. Redirecting to the latest post with the tag would defeat the entire point of having that page.
My masters thesis is based on a single arXiv paper. I didn't know that arXiv had this questionable reputation, but it sure explains a lot.
The project is to make an FPGA implementation of a technique presented in an arXiv paper. The paper had some big gaps, so I had to spend a lot of time researching the technique, and I had very little time to spend on the actual implementation.
arXiv is just a repository for papers to make them available before they are published/peer-reviewed. Just because a paper is on arXiv does not make it bad, but it doesn't mean that it is high quality either. Typically the people in the field will know which papers are important.
It's not bad, there are just some gaps. I expect a follow-up paper which explains everything thoroughly. I feel like my masters thesis should have started after that paper.
There are two kinds of "cite" here. Citing a non-academic source is different from citing a (published) paper; you should cite anything that precedes your work in the second sense, whereas the first is only obligatory for work that you actually took something from. If you work on calculus you're obliged to cite Leibniz even if you didn't read him, but you are not obliged to cite Newton's unpublished work unless you read it. Unpublihed arXiv papers fall in the latter category.
Unpublihed arXiv papers fall in the latter category.
In CS that almost certainly isn't true. I'm most familiar with the NLP field, but there, if you have some kind of embedding of your words/tokens/sentences/something you cite https://arxiv.org/pdf/1301.3781.pdf (Word2Vec, Mikolov).
That paper says there is a follow up paper published at NIPS2013, but I don't think I've ever seen that published.
The field just moves too fast to wait for conferences anymore.
edit: To be clear, I am personally entirely okay with not citing papers containing ideas you were unaware of during your own formulation (though I think if you become aware of it, you should probably point out that it was previously independently discovered by someone else). Your paper may still have merit even if it's idea isn't "new" (especially if the first paper is shit as is often the case).
edit2: I personally don't see much wrong with Yoav Goldberg's blog post linked in this blog post. It's refreshing to hear his honest opinions out loud. As a graduate student I always lost a little sanity each time I read a paper with "great ideas", but terrible follow-through (i.e. explanation and proof of those ideas). I personally think that clarity of exposition is at least as (if not more) important than the novelty of an idea. However, you should still cite the sources of your ideas. Feel free to point out the source's flaws, but cite them nonetheless.
(Hopefully no more edits...)