- Athelion is a plugin for Apache Nutch, which is the web crawler part and has been open source for a long time.
- As far as I can tell, Athelion parses structured data (microformats, microdata, RDFa...) but that's not the most interesting bit. The online classifier of pages and scoring of new links discovered looks like the real important piece.
- Actually, the other two parts of the plugin are modifications of existing Nutch plugins.
All in all, I can't wait to have some time to see it working.
If I recall correctly, Yahoo had a couple other offerings / search tools. They had something called YQL which you enabled you to treat a specific, public (supported) websites, like Craigslist for example, as a SQL table. SELECT * from <house_for_rent> etc, etc, etc.
I have no idea if this is what was used here but I know this is an example where I'm sure they didn't out source the search work to Google.
You are correct. Formerly, Yahoo used it's own spider that was Slurp!. Currently, Yahoo is using Bing to power is search results. Yahoo has granted that Bing will grant the majority of it's search results. They have just struck a deal with Google, to start including Google search results in it as well. In the future, we should see 51% Bing results and 49% Google results.
It looks like Wikipedia agrees with you- Originally, "Yahoo Search" referred to a Yahoo-provided interface that sent queries to a searchable index of pages supplemented with its directory of websites. The results were presented to the user under the Yahoo! brand. Originally, none of the actual web crawling and data housing was done by Yahoo! itself. In 2001, the searchable index was powered by Inktomi and later was powered by Google until 2004, when Yahoo! Search became independent. On July 29, 2009, Microsoft and Yahoo! announced a deal in which Bing would henceforth power Yahoo! Search.[4]
Once upon a time in a land long forgotten, before being a holding company for Alibaba investment, Yahoo was actually doing interesting stuff. You could say Yahoo was Google before Google was Google.
They did things like Yahoo webpage hyperlink connectivity graph, and tons of other research.
- Athelion is a plugin for Apache Nutch, which is the web crawler part and has been open source for a long time.
- As far as I can tell, Athelion parses structured data (microformats, microdata, RDFa...) but that's not the most interesting bit. The online classifier of pages and scoring of new links discovered looks like the real important piece.
- Actually, the other two parts of the plugin are modifications of existing Nutch plugins.
All in all, I can't wait to have some time to see it working.