Scraping things that don't want to be scraped is one of my favorite things to do. At work this is usually an interface for some sort of "network appliance." Though with the push for REST APIs over the last 6 years or so, I don't have a need to do it all to often. Plus with things like selenium it's too easy to just run the page as is, and I can't justify spending the time to figuring out the undocumented API.
My favorite one implemented CSRF protections by polling an endpoint, and adding in the hashed data from that endpoint and a timestamp on every request.
When I hear a junior dev give up on something because the API doesn't provide the functionality of the UI, It makes me very sad that they're missing out.
To be fair selenium style scraping can take a lot of time to setup if you aren’t already familiar with the tooling, and the browser rendering apis are unintuitive and sometimes flat out broken.
Maybe it's because I'm using the python bindings, but it took me about an hour to go from never using it to having it do what I needed it to do. I just messed around in a jupyter notebook until I got what I needed working. Tab complete on live objects is your friend. The hardest part was figuring out where to download a headless browser from.
Though I do prefer requests/bs4. I wrote a helper to generate a requests.Session object from a selenium Browser object. I had something recently where the only thing I needed the javascript engine for was a login form that changed. So by doing it this way I didn't have to rewrite the whole thing. Still kind of bothers me I didn't take the time to figure out how to do it without the headless browser, but it works fine, and I have other things to do.
You're absolutely right. It slipped my mind as I considered it more of a language-agnostic tool, and I organized the article as the provisioning of tools for all popular programming languages. That said, I added it to the post as a language-agnostic tool - thanks for the pointer!
Its funny I never seem to hit these infamous Clouflare captchas. The only impediment I encounter with Cloudflare is they require plaintext SNI to read their blog, https://blog.cloudflare.com. Unlike almost all other Cloudflare, ESNI will not work.
I have not had to deal with that, but I have idly thought that it might be easier to pipe the audio version into google assistant or something, and see what it comes up with.
I remember a workmate having to deal with some difficult to scrape data at a previous job - the page randomly rendered with different mark-up (but the same appearance) to mitigate pulling out data using selectors. I think he got to the bottom of it eventually but it made testing his work a pain.
> It looks like the author of the article just googled some libraries for each language and didn't research the topic
Yep, this seemed like an aggregate Google results page.
I was initially intrigued by the article and then realized it was a list of libraries the author found via Google. There were significantly notable omissions from this list and a bunch of weird stuff that feels unnecessary. I don't think the author has actually scraped a page before.
I agree with your conclusion, but in any discussion about web scraping it's probably a good idea to mention BeautifulSoup given how popular it is (virtually a builtin in terms how much it's used) and given all the documentation available for it, a good starting point if perf is not going to be a concern.
although I don't follow the need to have what appears to be two completely separate HTML parsing C libraries as dependencies; seeing this in the readme for Modest gives me the shivers because lxml has _seen some shit_
> Modest is a fast HTML renderer implemented as a pure C99 library with no outside dependencies.
requests-html is faster than bs4 using lxml. It's a wrapper over lxml. I built something similar years ago using a similar method, it was much faster than bs4, too.
Playwright is essentially a headless chrome, firefox, and webkit browser with a nice API that's intended for automation/scraping. It's far more heavy than something like curl, but it has all the capabilities of any browser you want (not just chrome as with puppeteer) and makes stuff like interacting with javascript a breeze.
It's similar to Google's puppeteer, but in my opinion even with chrome much more pleasant and productive. Microsoft's best developer tool IMO, saves me tons of time.
If you're familiar with Go, there's Colly too [1]. I liked its simplicity and approach and even wrote a little wrapper around it to run it via Docker and a config file:
I used this library to get familiar with Go. It is indeed very powerful and really easy to create a scraper.
My main concerns though were about testing. What if you want to create tests to check if your scraper still gets the data we want? Colly allows nested scraping and it's easy to implement but you have all your logic into one big function, making it harder to test.
Did you find a solution to this? I'm considering switching to net/http + GoQuery only to have more freedom.
I am scraping radio broadcast pages for a decade now. Started with (ruby) scrapy, then nokogiri, then moved on to go and their html package.
Currently sport a mix of curl + grep + xsltproc + lambdasoup (OCaml) and am happy with it. Sounds like a mess but is shallow, inspectable, changeable and concise. http://purl.mro.name/recorder
Last year I needed some quick scraping and I used a headless Chromium to render webpages and print the HTML then analyze it with C#.
I don't remember exactly, but I think it was around 100 or 200 loc, so not exactly something that took long to write.
In fact the most difficult thing was to figure how to pass the right args to Chromium.
HTTP requests, HTML parsing, crawling, data extraction, wrapping complex browser APIs etc. Nothing you couldn't do yourself, but like most frameworks, they abstract the messy details so you can get a scraper working quickly without having to cobble together a bunch of libraries or re-invent the wheel.
Just for one example, when you have to get a form, and then submit the form, with the CSRF protection that was in the form... of course you COULD write that yourself by printing HTML and then analyzing it with C# (which triggers more requests to chromium I guess), but you're probably going to wonder why you are reinventing the wheel when you want to be getting on to the domain-specific stuff.
Throttling is a prime example. If you start loading multitudes of sites in asynchronous fashion you'll have to enter some delay otherwise you run the risk of choking the server in misconfigured sites. I've DDoSed sites accidentally this way. You can of course build a framework on your own, and that's pretty much what every scraper does eventually, but it takes time and a lot of effort.
I think another technique that should be talked about is intercepting network responses as they happen. The web in 2021 still has a whole lot of client-side rendering. For those sites, data is often loaded on the fly with separate network calls (usually with some sort of nonce or contextual key). Much of the hassle in web scraping can be avoided by listening for that specific response instead of parsing an artifact of the JSON->JS->HTML process.
I put together a toy site [0] recently that uses this approach for JIT price comparisons of events. When you click on an event, the backend navigates to requested ticket provider pages through a pool of Puppeteer instances and waits for JSON responses with pricing data.
Slight aside: The most recent Cloudflare HCaptchas ask you to classify AI generated images. They don’t even look like a proper bike/truck/whatever (I don’t have an example handy).
I categorically refuse to do when I’m browsing websites using it. I find this new captcha utterly unacceptable.
It’s no “protection” at this point anymore. Websites are using it as an excuse to become even more user hostile. I am worried for the future of the web.
I've been working on a scraping project in Scrapy over the last month, using Selenium as well. My Python skills are mediocre (mostly a Java/Kotlin dev).
Not only has it been a blast to try out, but also surprisingly easy to setup.
I now have around 11 domains being scraped 4 times a day through a well defined pipeline + ETL then pipes it to Firebase Firestore for consumption.
My SaaS requires some technical knowledge to use (call a web api) which I suppose is why it's not ever in these lists.
Some of my customers are *very* large businesses. If you are looking at evading bot countermeasures, my product isn't probably the best for you. but for TCO nothing beats it.
> Crawl at off-peak traffic times. If a news service has most of its users present between 9 am and 10 pm – then it might be good to crawl around 11 pm or in the wee hours of the morning.
For sites where there is a peak usage time, it's probably obvious what that peak usage time is. A news service (their example) presumably primarily serves a country or a region - then off-peak traffic times are likely at night.
The Internet has no time zone, but its human users all do.
I moved from selenium to playwright. It has a pleasant API and things just works our of box. I ran into odd problems with selenium before, especially when waiting for a specific element. Selenium didn't register it, but I could see it load.
It was uncharacteristic of me, because I tend to use boring, older technologies. But this gamble paid off for me.
No completely different use case. Selenium is browser automation. Mechanical soup/Mechanize/Robobrowser are not actually web browsers, they have no javascript support either. They're python libraries that can simulate a web browser but doing GET requests, storing cookies across requests, filling http POST forms, etc.
The downside is that they don't work with websites which rely on JavaScript to load content. But if you're scraping a website like that, then it might be easier and way way faster to analyze web requests using dev tools or mitmproxy, then automating those API calls instead of automating a browser.
I started reverse engineering web apis years ago as a more efficient way to scrape. Unfortunately there are new website builders (like salesforce lightning) that make really hard to recreate an http request due to the complexity of the parameters
Interesting. I was about to start on some web automation and so far I've had hammered into my head that Selenium is the 'language of the internet' or something along those lines.
What would be a better solution, if you have any to recommend?
Puppeteer is frustrating to me. When I tried to use it I couldn’t get it to click buttons, but I did get it to hover on the button so I know I had the correct element in my code. Their click function just did nothing at all. I resorted to tabbing a certain amount of times and hitting enter.
Nowadays is more and more common for websites to have some kind of rate limiting middleware such as rack attack for ruby. It would be interesting to explore the strategies to deal with it.
Fundamentally, what most scrapers learn is that the more their scraper can behave like a human browsing the site, the less likely they are to get detected and blocked.
This does put limits on how quickly they can crawl, of course, but scrapers find ways around it like changing ip and user agent (ip is probably the main one, bec you can then pretend that you are multiple humans browsing the site normally).
Even changing IPs won't always work against an adversary with a global view of the Internet such as CloudFlare.
CF has a view on a significant chunk of internet traffic across many sites and feeds that into some kind of heuristics/machine learning. Even if we assume that your behavior on the scraped website looks human-like, you may still get blocked or challenged because of your lack of traffic on other sites.
The IPs you'd get from a typical proxy service would only be used for bot activity and would've been classified as such a long time ago, and there's no "human activity" on it to compensate and muddy the waters so to speak.
The best solution is to use IPs with a chunk of legitimate residential traffic, and keep scraping sessions constrained to their IPs - don't rotate your requests among all these IPs, instead every IP should be its own instance of a human-like scraper, using its own user account, browser cookies, etc.
You nailed it! I've also faced issues in the past with captchas, and elaborate bot detection mechanisms. It would also be helpful to mention that there are automatic captcha solvers to bypass security once one is detected. I am wondering if it is worthwhile to provide an addition to this post on how to improve the efficacy of scraping despite these roadblocks. The article is geared towards beginner scrapers that are just starting out so maybe it would be overkill? What do you think?
Why not just lower the crawl rate? My search engine crawler visits tens of million documents in a week at a rate of 1 doc/second, but a few hundred different domains at the same time.
Going as low as 0.2 dps could easily be doable I think.
The switch from UA to browser fingerprinting makes it harder to scrape without being stopped.
Yes, at any time the UA could be ignored and clients could be fingerprinted, but now the UA is being made next to useless, so fingerprinting will now become the default everywhere.
I mean I realize it probably isn't a problematic, just wondering, but on the other hand it shouldn't be so difficult to follow the reasoning based on the context I would think:
poster says - in order to be able to scrape effectively you should appear to be a real human, use different UAs etc.
So as this change happens different UAs become one less thing that you can easily change to seem less suspicious, as a non-frozen UA would then be a suspicious sign after some time.
I tried Python/ BeautifulSoup and Node/Puppeteer recently. It may be because my Python is poor, but puppeteer seemed more natural to me. Injecting functionality into a properly formed web page felt quite powerful and started me thinking about what you could do with it.
self promotion: I launched my no-code scraping cloud software on ProductHunt last month after a year of testing with beta users: https://www.producthunt.com/posts/browse-ai
I've not followed this space. When I did, there were a lot of questions concerning the legality of automated scraping. Have those legal issues been resolved?
As long as you're scraping publicly available data (i.e. not going behind a login) and avoiding copyrightable content and personal data, you should be mostly fine. This article should answer most of the web scraping legality questions: https://www.crawlnow.com/blog/is-web-scraping-legal
In my own experience puppeteer is much better/capable than selenium but the problem is that puppeteer requires nodejs. its python-wrapper https://github.com/pyppeteer/pyppeteer was not as good as selenium when you like to use python.
ProxyMan on MacOS is quite awesome for this. It requires a bit of setup with certificates, etc., but once it's working you just fire up the target app on your phone and all the sweet sweet API requests appear on your big screen. I've scraped two apps very successfully this way.
It's also fascinating to see how developers-who-aren't-me setup their APIs when they assume that nobody's looking.
I’ll chime in with mine: Skyscraper (Clojure) [0] builds on Enlive/Reaver (which in turn build on JSoup), but tries to address cross-cutting concerns like caching, fetching HTML (preferably in parallel), throttling, retries, navigation, emitting the output as a dataset, etc.
Surprised Woob[0] (formerly Weboob) isn't on the list. It's designed for specific tasks, such as getting transactions from your bank, events from different event pages, and much.
Still using casperjs and phantomjs. Both are deprecated for many years, but I cannot find any replacement.
Some of my scraping programs are running over 10 years without any issues.
I would expect it's roughly the same answers, just varying in the specifics:
* those which don't offer a _reasonable_ API, or (I would guess a larger subset) those which don't expose all the same information over their API
* those things which one wishes to preserve (yes, I'm aware that submitting them to the Internet Archive might achieve that goal)
* and then the subset of projects where it's just a fun challenge or the ubiquitous $other
As an example answer to your question, some sites are even offering bounties for scraped data, so one could scratch a technical itch and help data science at the same time:
I have a side-project where I display the schedule of the day of 100+ French radios, like you would for TV channels.
Scraping works great to get the data.
I don't like node/js but I use it to do the scraping as I view the code as trash and full of edge cases and unreliable data / types and I can't complain, a dynamic scripting language is great for that.
It tells you who is your governor, local/federal representative, senator and municipal president.
Each representative lives on a different website so I wrote scrappers for each one.
Scraping saved untold lives this past spring when large healthcare providers (i.e. Walgreens & CVS) opted to hide their vaccination appointments behind redundant survey questions. This made it more difficult to quickly ascertain when an appointment slot would become available. The elderly were less likely to look more than once a day, delaying vaccines for those that needed it the most.
GoodRX built a scraping system that tapped into all the major providers. Thats what a group of vaccine hunters in my state used to get appointments for folks that had tried but were unable to.
Building a side project using python scrapy to scrape podcast shows. I use it to search by title/description etc to find interesting podcasts. Also as a way to learn different tools and frameworks.
The large proxy providers operate in a sort of gray market. You pay for "residential" or "ISP" based IP addresses. In some instances these proxy connections are literally being tunneled through browser extensions running on a real world system somewhere (https://hola.org/ for instance)
Is there open source software that can extract the "content" part of a given page cleanly? I'm thinking about what the reader mode in browsers can do as an example, where the main content is somehow isolated and displayed.
You can use SGML (on which HTML is/was based) and my LGPL-licensed sgmljs package [1] for that, plus my SGML DTD grammar for HTML5. [2] describes common tasks in preservation of Web content to give you a flavor, but you can customize what SGML does with your markup to death really; in your case, you'll probably want to throw away divs and navs to get clean semantic HTML which you can do using SGML link processes (= pipeline of markup filters and transformations), but you could also convert HTML into canonical markup (eg XML) and use Turing-complete XML processing tools such as XSLT as described in the linked tutorial.
I believe the main library for reader mode is called readability. I played around with a python implementation a while back. Just pipe in your raw html as part of the process. It's good, but not flawless. If I remember correctly, it included some quotes and image text as part of the body for the site I tried it on.
My favorite one implemented CSRF protections by polling an endpoint, and adding in the hashed data from that endpoint and a timestamp on every request.
When I hear a junior dev give up on something because the API doesn't provide the functionality of the UI, It makes me very sad that they're missing out.