It's mostly a "this is what I learned and the things I had to take into consideration" with a few "this is how you identify a CMS" bits sprinkled into it. These days I would probably change a thing or two, but people told me it's still an entertaining read.
(Not a native speaker though, so the English might have some stylistic kinks)
> The Windows operating system can dispatch different events to different window handlers so you can handle all asynchronous HTTP calls efficiently. For a very long time, people weren’t able to do this on Linux-based operating systems since the underlying socket library contained a potential bottleneck.
What? select()'s biggest issue is if you have lots of idle connections, which shouldn't be an issue when crawling (you can send more requests while waiting for responses). epoll() is available since 2003. What bottlenecks?
Turns out crawlers spend a lot more (wall clock) time waiting for a complete response than they do requesting it. However, scheduling is a much (much, much, much) harder problem to deal with than async i/o but it's not what many people here need to worry about.
The challenges with crawling on a large scale still persist as is evident by bloomreach and many other companies building custom solutions because available open source tools cannot handle the scale of such products. SQLBot aims to solve this problem.
Product a few weeks from launch. If any is interested: http://www.amisalabs.com/AmisaSQLBot.html
mmmmmmmm, on http://en.wikipedia.org/wiki/BitTorrent it says "BitTorrent is one of the most common protocols for transferring large files, and peer-to-peer networks have been estimated to collectively account for approximately 43% to 70% of all Internet traffic (depending on geographical location)" so... 60% of requests are done by crawlers?
I wish there were more articles about determining the frequency at which one page should be crawled. Some pages never change, some change multiple times per minute, and we do not want to crawl them all equally often.
This is a problem I've researched fairly extensively the last few months. My ideal solution looks something like:
* Initial pull
* Secondary pulls x time later, where x doubles each time, up to a maximum value, y
y is the one that's tricky to define. For us, it's a value computed based on the frequency of update of similar URLs for that domain, the domain as a whole, similar content, and a few other bits and pieces. Essentially, our thinking is that if we can understand how alike any page is to another cluster of pages, we can use their average frequency of update to give reasonably likely initial values for x, and sensible thresholds for y. We also temper this with how much change there is, to determine whether the differences are something we care about.
Obviously, should the system notice that if its change timings are particularly outside where it'd expect given the cohort assigned, it's then able to start moving around its comparison. An example would be a blog category page which updates so infrequently that it's particularly unusual, or a page with a lot of social feeds on it where there's a lot of flux constantly.
Works pretty well, but if anyone's got a better solution I'd love to hear of it.
That is why you have sitemaps, for new websites with no sitemaps we could start with a default frequency, say, once a day. If it changes more often than that you can increase it to multiple times a day and so on. We can also look for the type of website to determine a starting frequency — for example an e-commerce websites ought to be crawled more frequently because new products, reviews, ratings could be added every few minutes whereas a personal blog may stay the same for weeks altogether.
It's relatively easy. First you crawl a page every hour. If it changed, you halve the time. If it hasn't, you double it. You set some limits, like once a minute to once a month. You can also adjust the multiplier, and instead of factor of 2 you use something like 1.2. This way you can adjust more precisely to the page's update time.
Also headers and sitemap.XML can tell you how often the pages change.
Once upon a time I wrote my thesis on building a web crawler. The (tiny) blog post with an embedded preview:
http://blog.marc-seeger.de/2010/12/09/my-thesis-building-blo...
The PDF itself:
http://blog.marc-seeger.de/assets/papers/thesis_seeger-build...
It's mostly a "this is what I learned and the things I had to take into consideration" with a few "this is how you identify a CMS" bits sprinkled into it. These days I would probably change a thing or two, but people told me it's still an entertaining read. (Not a native speaker though, so the English might have some stylistic kinks)