Hacker News new | past | comments | ask | show | jobs | submit login
Stanford Large Network Dataset Collection (stanford.edu)
74 points by betolink on June 4, 2015 | hide | past | favorite | 10 comments



A note here that SNAP at Stanford is funded by NSF grants through DARPA SMISC, which is a research group in the DoD looking to learn how to get better at influencing social media groups online for propaganda.

(Strategic Communication is the DoD term, well one of them, for propaganda)

http://www.theguardian.com/world/2014/jul/08/darpa-social-ne...

https://www.fbo.gov/index?s=opportunity&mode=form&id=972cbc8...


If you're looking for even larger graph datasets, the team at WebDataCommons[1] extracted hyperlink graphs from Common Crawl[2]. They're available at both page and domain levels of granularity.

The page level hyperlink graphs are 3.5 billion web pages and 128 billion hyperlinks for 2012 and 1.7 billion web pages connected by 64 billion hyperlinks for 2014.

[1]: http://webdatacommons.org/hyperlinkgraph/

[2]: http://commoncrawl.org/


Sad to see the beeradvocate and ratebeer datasets were removed before I could grab them.

https://snap.stanford.edu/data/web-BeerAdvocate.html

https://snap.stanford.edu/data/web-RateBeer.html



All of these datasets seem to be some kind of unweighted graph with no additional information (except community information in some cases).

Does anyone know where one can find richer network data sets? i.e. a graph in which the vertices have some attributes.


Not sure I agree. For example there's a patents set:

https://snap.stanford.edu/data/cit-Patents.html

with included patent classification data. You can also join the WikiTalk set:

https://snap.stanford.edu/data/wiki-Talk.html

against Wikipedia data. You can obtain more node attributes for these (and others) easily by joining against public sets.



Are there any large datasets out there representing n-partite networks? So instead of people connnecting w/people I see e.g. devedges between developers and languages, or products and users, and so on..


I'm just getting into machine learning so I'm looking forward to practicing on these datasets. Thanks for sharing.


The availability of real-world signed network datasets is really great, I've used the Stanford Large Network Dataset Collection in the past to test predictive accuracy in reputation systems. (Looks like they added a new dataset to the "signed" category -- wikipedia requests for adminship.)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: