Suppose we had an index of snippets, meaning you've parsed them and are able to search isomorphically. So, e.g. variable names are not significant. Some techniques discussed[1].
Then we run that against source repos, we could get update notifications for copypasta'd code.
"In file F at line L, it looks like you used some code from SO at revision R. In revision R', it's been corrected."
SO copypasta is better than NPM, because no one can change the codesnippet to steal bitcoins once you've copied it into your code base. It's much more secure than a mutable database.
> Just look at the Left-pad thing, or the event-stream thing.
Those prove that we could see the problem. Brokenness doesn't go away when you grab a snippet of code or reinvent the wheel, you're simply unaware of how much of it is buggy or broken.
What do you mean by this? As far as I understand, NPM provides access to packages, not snippets and doesn't as far as I know provide a way to search the code in those packages let alone isomorphically.
A lot of npm packages aren't longer than a typical stackoverflow answer, and they get used everywhere, to the point where installing a dozen packages can lead to tens of thousands of sub-packages being installed.
At that point, the packages are essentially "indexed snippets" of code.
> I qualitatively analyzed the top 50 clones in that list and was able to identify the source (or at least a source) of the snippets in most of the cases.
No, it doesn't. It has the exact same rounding bug described in the article.
The article even mentions:
> FWIW, all 22 answers posted, including the ones using Apache Commons and Android libraries, had this bug (or a variation of it) at the time of writing.
In the end, the number will be displayed to one decimal place, eg. "1.1 MB" or similar due to the "%.1f" format specifier.
If the input is 999999 bytes, the loop will see that it is less than 1 MB and so will format the number 999.999 into "%.1f kB". When this number is rounded to one decimal place as part of that formatting, it rounds up to "1000.0 kB". This is the wrong output.
There's a bit of a catch 22 because ideally, you would be able to do the rounding first, and then see what units need applying, but you can't do the rounding until you know what the divisor will be.
The article solves this by first manually determining what the cut-off should be (it will be some number like "...999500...") but personally I'd probably just decide to round to significant figures instead so that you can cleanly separate the "rounding" and "unit selection" steps.
Oh, that makes sense, although I'm not sure "1000.0 kB" is strictly wrong in this case.
With the loop at least it's easy to adjust the thresholds if that is desirable, although a comment will be necessary to explain why you are making the cutoffs in weird places.
Interestingly enough in the past I've used a loop similar to this:
static char suffix[] = { ' ', 'k', 'm', 'g', 't' };
magnitude = 0;
while ( value > 1000 )
{
value /= 1000;
magnitude++;
}
printf("%.1f%cB", value, suffix[magnitude]);
Which is bad because it's subject to repeated rounding from the division, but avoids the problem you described.
You run into the same problem if you are writing something like the C 'itoa' function (integer to ascii); if you want to write the digits out front to back you need to know what divisor to use for the leading digit so you need to either look it up in a table or take the log.
Taking the log is a lot slower than the table lookup, I found that out the hard way.
People convert so many integers to ascii and it is shocking how slow ascii <-> binary numeric conversions are compared to binary numeric operations, so it's not a matter of "premature optimizations".
Now you can write an itoa which generates the digits from back to front and not have to worry about copying the results because you return a pointer to the middle of the result buffer but then memory management gets more complex...
I feel like this would be a good argument in favour for small scoped packages like we sometimes see on npm. Often enough it turns out that a trivial code snippet like this turns out to be not so trivial after all.
edit:
The point being that you lose all connection with a snippet after you copy+paste it. I can clearly see benefits when you centralize its development, make use of the collective mind to harden it, and get notified about possible updates whenever an edge-case is found.
More languages could adopt that idea, and a good StackOverflow answer would include those tests in the snippet. StackOverflow might even automatically run the tests and add a passing/failing badge!
What I would really like is for unit tests to become full fledged features of a language. Any object can contain a Test method (which would be static), this method contains all the unit test code for that object. Select "Run tests" from your compiler, it compiles everything and goes through calling any Test methods it found but the main entry point is never called. A release compile doesn't link the Test methods, nor any method marked [Test] (support functions only needed for testing.)
> I feel like this would be a good argument in favour for small scoped packages like we sometimes see on npm.
Rather, I think this is an argument that this kind of functionality should be in the standard library; perhaps in the equivalent of `*printf` for each language.
There is no issues with languages, packages, or what not.
Random code snippets from the internet are obviously completely unsafe. There is therefore basic "due diligence" to apply when considering using one such snippet:
1. Very carefully read the code to understand it.
2. Test it (corner cases/threshold values are the trivial things to test for such a piece of code doing conversions)
In general I do not copy-paste code snippets. I use them as examples of how to perform a task or how to use an API, then I write my own code.
This also avoids IP issues.
then it's never possible to use any package/module/plugin anywhere. I get the danger but I'd rather have convenience than writing every function from scratch
There is a very big, and obvious, difference between using a plugin published by a well-known source, and a random code snippet posted by a random person.
This does not tell you anything about the source of the snippet and, almost by definition, people who blindly copy snippets from SO are likely not experts in the field.
On the other hand, when I download and use Openssl (for example) I am reasonably confident that the code was developed and scrutinised by people who know what they are doing.
No absolutely not! I wholeheartedly despise npm, whenever I try to install a small node app to try it out, npm literally creates tens of thousands of directories, that's not okay for any reason! This is a risk worth taking.
In what language is it any different? It tried to help edit the rust docs. It downloaded > 100 packages and thousands of files. I tried to use some command via brew, it downloaded 15+ dependencies each derived from 100s or 1000s of files.
I agree I don't like the risk but is npm more risky?
C and C++, which haven’t made the decision to bundle a package manager with a programming language (which is dubious IMO because they are almost completely unrelated concerns), and for which you’re normally supposed to get dependencies from your curated, maintained, OS-provided repositories.
NPM creates a node_modules folder and then fills it with the libraries that your app has specified. Then each of those libraries has their own node_modules folder and NPM will install the dependencies of that library and this happens recursively which is absolutely crazy. The directory structure is A-5.0/B2.0, C-3.0/B2.0, D-3.0/B2.0 which leads to B2.0 being duplicated three times even if it has the same version. Almost every package manager uses a completely different strategy. First of all every package gets a globally unique identifier (in NPM package identifiers are relative to the node_modules folder of which there are many). Usually it is the name of the package and if a package manager needs to support multiple versions of a library within the same program it just adds the version itself to the identifier. This means that if you need B2.0 then it would be stored in node_modules/B/2.0/ and libraries A, C and D would use that single version.
The NPM community is well known for their one liner packages with most of the work done in the dependencies. If that one liner has 5 dependencies and your project transitively depends on it 5 times through react or something then you end up creating 5 times as many files than are needed. It's very easy for a trivial application that uses NPM to have a million files in the node_modules folder.
I forget that everything I copy from SO, and everything I post there, is under a CC BY-SA license. That SA is "share-alike" and I don't think people really understand what that means. From Wikipedia's article on it:
"These licences have been described pejoratively as viral licences, because the inclusion of copyleft material in a larger work typically requires the entire work to be made copyleft."
Now how much code uses something copied from SO? And I wonder how copyright even applies to "code snippets"?
The important point here is that there's no minimum length for code to be copyrightable. It simply needs to be original and at least minimally creative. Since at least thousands of other developers have found the snippet to be useful enough to directly borrow rather than writing an equivalent, it sure looks copyrightable to me.
Thanks for that link! I found this part especially interesting:
"In particular, the laws stress that it is a programmer’s expression of some functionality that may be protected by copyright, and not the functionality itself. If code embodies the only way (or one of very few ways) to express its underlying functionality, that code will be considered unoriginal because the expression is inseparable from the functionality. Similarly, if a program’s expression is dictated entirely by practical or technical considerations, or other external constraints, it will also be considered unoriginal."
Sounds like a case that at least some snippets aren't copyrightable.
I don't understand how the above principle can distinguish anything, at all.
You could reasonably argue that every piece of code is completely and only expressing functionality, because it's all inherently directing the computer to do stuff. So only comments would be protected.
On the other hand, you could instead argue that every piece of code can be translated into another language, and in fact is, whether interpreted or compiled, so the source code is exclusively expression only as the functionality is never tied to it.
But it doesn't seem to me to make any sense to say that some part or aspect is expression and another is functionality. It's all or nothing.
That's interesting, comparing it to plagiarism, reminds me of when I was shamed when I was like 8 for rewriting a paragraph from a book in my own words for an essay. At least when I was that age, that was totally considered plagiarism (at least by my parents). It was crushing to find that even though I'd worked really hard on paraphrasing each sentence, it didn't count and I'd missed the whole point.
I wonder what standards colleges and research journals have now.
For what it's worth, in the common user journey through SO (Google SRP to SO question page, scroll to answers), where does SO ever tell you about the license or the requirements of complying? If they simply rely on people knowing that generally there are licenses and such and they should look into it, then it's hardly a surprise almost no one complies. I've worked with many people, developers and not, who probably couldn't even rattle off a few common licenses.
I'm not a professional developer, but I've copied snippets from SO before. I've always included the answer URL in a comment next to it though. But mostly because if I ever had issues with it, I wanted to know where I got it and on the off chance anyone else looks at my code, I wouldn't want them thinking I wrote code I didn't write.
It’s basically only time I would ever put a URL in the comment.
It’s a testament to SO that those URLs have still worked when I clicked on them years later and often provide valuable context to some esoteric bit of code.
The rest of your code? IANAL but I don’t think it’s viral like the GPL.
The way I understand it, if you fix a bug in an SO code snippet, then in theory you own that bug fix back to the crowd.
EDIT: The more I read about this the more cringeworthy it becomes. CC BY-SA is not designed for source code. The language in the actual license is extremely imprecise for trying to reason about using modular bits of CC licensed code in a complex system.
“Adapted Material” is defined simply as;
Adapted Material means material subject to Copyright and Similar Rights that is derived from or based upon the Licensed Material
... and ...
in which the Licensed Material is translated, altered, arranged, transformed, or otherwise modified in a manner requiring permission under the Copyright and Similar Rights held by the Licensor.
Is my 100,000 LOC application “derived from” or “based upon” a function which converts a hex string to a byte[]? That’s a question that can only be decided in a courtroom, and to my knowledge CC BY-SA has never been litigated in the context of source code.
My inclination is that the terms “derived from” and “based upon” must necessarily carry a stronger meaning than, for example, “incorporates” or other similar terms which would not imply a central shared function or feature.
I think they should change it to MIT or BSD or CC0. I've put in my profile that all code I post is CC0. I don't want or need credit for explaining something and a 5 to 50 lines of code. Sometimes the code is longer but it's usually because it's setup for the actual important part. I don't really understand why people feel they need CC-BY-SA.
I really doubt that you could defend the copyright to a snippet like this in court. Saying it's CC BY-SA is all well and good, but if it's unenforceable in reality it's meaningless
I recently had the strange experience of writing a Promise.allSettled implementation in Node.js, but it wasn't quite working. I checked Stackoverflow and found an implementation that was almost line for line what I'd written.
I would imagine that's pretty common, especially for small helper functions for a very specific task. There might really only be one real logical way of doing it. How many different ways could you really implement leftpad?
I wonder what percentage of bugs in general are from people not understanding how floating points work. I think property based testing (QuickCheck) should be used whenever floats are involved. Nobody ever seems to get them right.
Would QuickCheck know to try 999999 as input? Most of the possible inputs would give correct results, and those that don't, aren't very 'special', such as 0,1,-1,MAX_INT, and so on.
Don't many QuickCheck-inspired libraries have special cases to ensure they generate common numbers like those? I could be misremembering, but I would have sworn I read that in the documentation of the last library I looked at (whose name escapes me).
Assuming you talk about publishing a library instead of a snippet in the first place, the main difference would be one puts the onus to every client and the other on you.
The best way to avoid confusion is for everybody to stop using powers of 2 instead of powers of 10 for K, M, etc. The meanings of Kilo, Mega, etc. were well defined before computers were invented. It was a mistake to steal those terms and use them for something different.
Actually, SO's obsession that the answers should contain code ready to copy/paste is flawed. They'd rather give them fish instead of teaching them how to fish.
When you're reading the documentation for a software library, do you prefer if it includes examples for common usage or do you prefer a long list of API methods?
Have you ever used the "examples" section of a man page or do you prefer to scroll through the (sometimes long) list of parameters?
In my opinion, code examples are an important tool for teaching real-life usage patterns.
On the other hand, often the people reading the answers are expert fishermen, but simply don't want to waste their time on trivial fish. If one wants to learn to fish, there are already many other resources.
Then we run that against source repos, we could get update notifications for copypasta'd code.
"In file F at line L, it looks like you used some code from SO at revision R. In revision R', it's been corrected."
[1]: https://wiki.haskell.org/Hoogle#Theoretical_Foundations