Hacker News new | past | comments | ask | show | jobs | submit login
Yahoo open sources Anthelion web crawler for parsing structured data (github.com/yahoo)
159 points by fangwang on Dec 16, 2015 | hide | past | favorite | 9 comments



Just a couple of minor clarifications:

- Athelion is a plugin for Apache Nutch, which is the web crawler part and has been open source for a long time.

- As far as I can tell, Athelion parses structured data (microformats, microdata, RDFa...) but that's not the most interesting bit. The online classifier of pages and scoring of new links discovered looks like the real important piece.

- Actually, the other two parts of the plugin are modifications of existing Nutch plugins.

All in all, I can't wait to have some time to see it working.


Just curious. Wasn't Yahoo using Google and Bing for search? Is this crawler being used at Yahoo internally?


If I recall correctly, Yahoo had a couple other offerings / search tools. They had something called YQL which you enabled you to treat a specific, public (supported) websites, like Craigslist for example, as a SQL table. SELECT * from <house_for_rent> etc, etc, etc.

I have no idea if this is what was used here but I know this is an example where I'm sure they didn't out source the search work to Google.


You are correct. Formerly, Yahoo used it's own spider that was Slurp!. Currently, Yahoo is using Bing to power is search results. Yahoo has granted that Bing will grant the majority of it's search results. They have just struck a deal with Google, to start including Google search results in it as well. In the future, we should see 51% Bing results and 49% Google results.


It looks like Wikipedia agrees with you- Originally, "Yahoo Search" referred to a Yahoo-provided interface that sent queries to a searchable index of pages supplemented with its directory of websites. The results were presented to the user under the Yahoo! brand. Originally, none of the actual web crawling and data housing was done by Yahoo! itself. In 2001, the searchable index was powered by Inktomi and later was powered by Google until 2004, when Yahoo! Search became independent. On July 29, 2009, Microsoft and Yahoo! announced a deal in which Bing would henceforth power Yahoo! Search.[4]

I have no idea if this is still the case.


Once upon a time in a land long forgotten, before being a holding company for Alibaba investment, Yahoo was actually doing interesting stuff. You could say Yahoo was Google before Google was Google.

They did things like Yahoo webpage hyperlink connectivity graph, and tons of other research.


Remember, Yahoo acquired Inktomi, and effectively AltaVista & AllTheWeb.


I note that the package namespace is under com.yahoo.research. So this is probably from Yahoo Research.


I want to build a specialized search site only for cars. Think of it as indeed.com for cars. Is this the magic sauce I've been waiting for?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: