Hacker News new | past | comments | ask | show | jobs | submit login

Suppose we had an index of snippets, meaning you've parsed them and are able to search isomorphically. So, e.g. variable names are not significant. Some techniques discussed[1].

Then we run that against source repos, we could get update notifications for copypasta'd code.

"In file F at line L, it looks like you used some code from SO at revision R. In revision R', it's been corrected."

[1]: https://wiki.haskell.org/Hoogle#Theoretical_Foundations




We essentially have that, they're stored in NPM, and it's horrible.

It turns out when you can package snippets you use so many you can't possibly keep track and audit them all.

Just look at the Left-pad thing, or the event-stream thing.


SO copypasta is better than NPM, because no one can change the codesnippet to steal bitcoins once you've copied it into your code base. It's much more secure than a mutable database.


> Just look at the Left-pad thing, or the event-stream thing.

Those prove that we could see the problem. Brokenness doesn't go away when you grab a snippet of code or reinvent the wheel, you're simply unaware of how much of it is buggy or broken.


What do you mean by this? As far as I understand, NPM provides access to packages, not snippets and doesn't as far as I know provide a way to search the code in those packages let alone isomorphically.


A lot of npm packages aren't longer than a typical stackoverflow answer, and they get used everywhere, to the point where installing a dozen packages can lead to tens of thousands of sub-packages being installed.

At that point, the packages are essentially "indexed snippets" of code.


There’s going to be a massive amount of false positives:

“I see you used “for i in...” and that copies this SO question about iteration...”


Agreed, you'd definitely need a mechanism to mitigate false positives.

One technique would be to try and define what constitutes "trivial" code.

Another would be to prioritize sources. Documentation from standard or major third party libraries should take precedence over SO.

Another would be a feedback mechanism. If repo authors vote a particular snippet up or down, after a threshold it could be excluded from matching.

Or you could opt-in by means of a comment, though this might make it useless.


There has been a bit of research on this[1]:

> I qualitatively analyzed the top 50 clones in that list and was able to identify the source (or at least a source) of the snippets in most of the cases.

[1]: https://meta.stackoverflow.com/questions/375761/how-to-handl...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: