Hacker News new | past | comments | ask | show | jobs | submit login

Are the datasets all legit? For instance, this looks like a quarterly scrape of Reddit in full:

http://academictorrents.com/details/85a5bd50e4c365f8df70240f...




I'm pretty sure all the twitter datasets violate the twitter TOCs.


On a quick pass of the Twitter datasets, they all seem to conform to Twitter's developer Terms.


Like the requirement that you have to delete tweets in datasets that have been deleted on twitter?


As far as I could tell, none of them actually contain tweets (e.g. any JSON), just IDs, and mostly user IDs at that.


>Are the datasets all legit?

I mean Academia has destroyed the scientific method, turning it into:

Who needs a PhD and what does your Professor want to prove true?

Ive started to ONLY trust industry.


Because industry is dedicated to finding truth?


Not all, some are non-free and commercially licensed.


What makes them "non-legit"?


Seems like you can’t upload unless you have an account registered with an academic email address.


The form I saw didn't ask for an academic email address. Just an email address.


Legit how?


Legit as in not subject to firms potentially coming after them because they're distributing their data. (I've no idea what Reddit's terms are but I wouldn't be surprised if they had an issue or two with a dump of their historical data being available for download free of charge on a 3rd party website.)


Mostly - This repository is for data hoarders and archivists. They don't necessarily care whether it is legit or legal. The goal is to harvest the most quality data.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: