Hacker News new | past | comments | ask | show | jobs | submit login
The State of Web Scraping in 2021 (mihaisplace.blog)
281 points by marvram on Oct 11, 2021 | hide | past | favorite | 125 comments



Scraping things that don't want to be scraped is one of my favorite things to do. At work this is usually an interface for some sort of "network appliance." Though with the push for REST APIs over the last 6 years or so, I don't have a need to do it all to often. Plus with things like selenium it's too easy to just run the page as is, and I can't justify spending the time to figuring out the undocumented API.

My favorite one implemented CSRF protections by polling an endpoint, and adding in the hashed data from that endpoint and a timestamp on every request.

When I hear a junior dev give up on something because the API doesn't provide the functionality of the UI, It makes me very sad that they're missing out.


To be fair selenium style scraping can take a lot of time to setup if you aren’t already familiar with the tooling, and the browser rendering apis are unintuitive and sometimes flat out broken.


Maybe it's because I'm using the python bindings, but it took me about an hour to go from never using it to having it do what I needed it to do. I just messed around in a jupyter notebook until I got what I needed working. Tab complete on live objects is your friend. The hardest part was figuring out where to download a headless browser from.

Though I do prefer requests/bs4. I wrote a helper to generate a requests.Session object from a selenium Browser object. I had something recently where the only thing I needed the javascript engine for was a login form that changed. So by doing it this way I didn't have to rewrite the whole thing. Still kind of bothers me I didn't take the time to figure out how to do it without the headless browser, but it works fine, and I have other things to do.


That's why things like laravel's Dusk exists to put a layer over that complex experience.


I was surprised not to see selenium in this article. It is a common tool


You're absolutely right. It slipped my mind as I considered it more of a language-agnostic tool, and I organized the article as the provisioning of tools for all popular programming languages. That said, I added it to the post as a language-agnostic tool - thanks for the pointer!


> Scraping things that don't want to be scraped

If all else fails, no website can withstand OCR-based screen scraping. It is slow(er), but fast enough for many use cases.


Assuming that you eventually manage to load the page somehow. Which in some edge cases may entail simulating mouse movements and random delays.


Agreed. -> I use the ui.vision extension to simulate native mouse movements.


Have you tried on a page protected by cloudflare captcha?


Its funny I never seem to hit these infamous Clouflare captchas. The only impediment I encounter with Cloudflare is they require plaintext SNI to read their blog, https://blog.cloudflare.com. Unlike almost all other Cloudflare, ESNI will not work.


I have not had to deal with that, but I have idly thought that it might be easier to pipe the audio version into google assistant or something, and see what it comes up with.


It seems to be no problem if you automate a real browser as opposed to a headless browser. I think they test for that.


A browser extension is probably an easier way to extract text than OCR (unless you're targeting a wide range of sites, I suppose).


I remember a workmate having to deal with some difficult to scrape data at a previous job - the page randomly rendered with different mark-up (but the same appearance) to mitigate pulling out data using selectors. I think he got to the bottom of it eventually but it made testing his work a pain.


Playwright's layout selectors might help the next time you encounter this.

https://playwright.dev/docs/selectors#selecting-elements-bas...


For Python, instead of BeautifulSoup I prefer to use selectolax which is 3-5 times faster.

Also, I think very few people use MechanicalSoup nowadays. There are libraries that allow you to use headless Chrome, e.g. Playwright.

It looks like the author of the article just googled some libraries for each language and didn't research the topic.


> It looks like the author of the article just googled some libraries for each language and didn't research the topic

Yep, this seemed like an aggregate Google results page.

I was initially intrigued by the article and then realized it was a list of libraries the author found via Google. There were significantly notable omissions from this list and a bunch of weird stuff that feels unnecessary. I don't think the author has actually scraped a page before.


I agree with your conclusion, but in any discussion about web scraping it's probably a good idea to mention BeautifulSoup given how popular it is (virtually a builtin in terms how much it's used) and given all the documentation available for it, a good starting point if perf is not going to be a concern.


Lazyweb link: https://github.com/rushter/selectolax

although I don't follow the need to have what appears to be two completely separate HTML parsing C libraries as dependencies; seeing this in the readme for Modest gives me the shivers because lxml has _seen some shit_

> Modest is a fast HTML renderer implemented as a pure C99 library with no outside dependencies.

although its other dep seems much more cognizant about the HTML5 standard, for whatever that's worth: https://github.com/lexbor/lexbor#lexbor

---

> It looks like the author of the article just googled some libraries for each language and didn't research the topic

Heh, oh, new to the Internet, are you? :-D


requests-html is faster than bs4 using lxml. It's a wrapper over lxml. I built something similar years ago using a similar method, it was much faster than bs4, too.


Another tip, there are a few browser extensions that can record your interactions and generate a playwright script.

Here's one: https://chrome.google.com/webstore/detail/headless-recorder/...


If you don't want to install another extension, Playwright has built in support for recording.

npx playwright codegen wikipedia.org

https://playwright.dev/docs/next/codegen


I'm not familiar with "playwright", it doesn't seem to be mentioned in OP either.

When I google, I see it advertised as a "testing" tool.

Can I also use it for scraping? Where would I learn more about doing so?


Playwright is essentially a headless chrome, firefox, and webkit browser with a nice API that's intended for automation/scraping. It's far more heavy than something like curl, but it has all the capabilities of any browser you want (not just chrome as with puppeteer) and makes stuff like interacting with javascript a breeze.

It's similar to Google's puppeteer, but in my opinion even with chrome much more pleasant and productive. Microsoft's best developer tool IMO, saves me tons of time.


Playwright is similar to Puppeteer, but can use different browsers not only Chrome.


I just needed a service to reliably fetch raw pages that I can process in my own application and so far I've been happy with this: https://promptapi.com/marketplace/description/adv_scraper-ap...

$30 / month for 300K requests, rotating residential proxies, uses headless Chromium, etc.


If you're familiar with Go, there's Colly too [1]. I liked its simplicity and approach and even wrote a little wrapper around it to run it via Docker and a config file:

https://gotripod.com/insights/super-simple-site-crawling-and...

[1] http://go-colly.org/


I used this library to get familiar with Go. It is indeed very powerful and really easy to create a scraper.

My main concerns though were about testing. What if you want to create tests to check if your scraper still gets the data we want? Colly allows nested scraping and it's easy to implement but you have all your logic into one big function, making it harder to test.

Did you find a solution to this? I'm considering switching to net/http + GoQuery only to have more freedom.


Not yet but my plan was to just have a static HTML site which the tests could run against.


I am scraping radio broadcast pages for a decade now. Started with (ruby) scrapy, then nokogiri, then moved on to go and their html package.

Currently sport a mix of curl + grep + xsltproc + lambdasoup (OCaml) and am happy with it. Sounds like a mess but is shallow, inspectable, changeable and concise. http://purl.mro.name/recorder


Last year I needed some quick scraping and I used a headless Chromium to render webpages and print the HTML then analyze it with C#.

I don't remember exactly, but I think it was around 100 or 200 loc, so not exactly something that took long to write. In fact the most difficult thing was to figure how to pass the right args to Chromium.

I wonder what does a scraping framework offer?


> I wonder what does a scraping framework offer?

HTTP requests, HTML parsing, crawling, data extraction, wrapping complex browser APIs etc. Nothing you couldn't do yourself, but like most frameworks, they abstract the messy details so you can get a scraper working quickly without having to cobble together a bunch of libraries or re-invent the wheel.


I see thanks.


Just for one example, when you have to get a form, and then submit the form, with the CSRF protection that was in the form... of course you COULD write that yourself by printing HTML and then analyzing it with C# (which triggers more requests to chromium I guess), but you're probably going to wonder why you are reinventing the wheel when you want to be getting on to the domain-specific stuff.


Ah yes I see. Mine was read-only, so no need for complex stuff.


Throttling is a prime example. If you start loading multitudes of sites in asynchronous fashion you'll have to enter some delay otherwise you run the risk of choking the server in misconfigured sites. I've DDoSed sites accidentally this way. You can of course build a framework on your own, and that's pretty much what every scraper does eventually, but it takes time and a lot of effort.


I think another technique that should be talked about is intercepting network responses as they happen. The web in 2021 still has a whole lot of client-side rendering. For those sites, data is often loaded on the fly with separate network calls (usually with some sort of nonce or contextual key). Much of the hassle in web scraping can be avoided by listening for that specific response instead of parsing an artifact of the JSON->JS->HTML process.

I put together a toy site [0] recently that uses this approach for JIT price comparisons of events. When you click on an event, the backend navigates to requested ticket provider pages through a pool of Puppeteer instances and waits for JSON responses with pricing data.

[0] https://www.wyzetickets.com


Cloudflare's protection is quite a b*tch to circumvent with any headless or python library.


Slight aside: The most recent Cloudflare HCaptchas ask you to classify AI generated images. They don’t even look like a proper bike/truck/whatever (I don’t have an example handy).

I categorically refuse to do when I’m browsing websites using it. I find this new captcha utterly unacceptable.

It’s no “protection” at this point anymore. Websites are using it as an excuse to become even more user hostile. I am worried for the future of the web.


https://news.ycombinator.com/item?id=28514998#28515629

> Cloudflare's bot protection mostly makes use of TLS fingerprinting, and thus pretty easy to bypass.

https://news.ycombinator.com/item?id=28251700 -> https://github.com/refraction-networking/utls

Disclaimer: haven't tried it.


It's a pain even when you aren't a bot. For a while there, Cloudflare's fingerprinting page would trigger Firefox on Linux to crash instantly.


with node, i've had success with puppeteer-extra using puppeteer-extra-plugin-stealth


I've been working on a scraping project in Scrapy over the last month, using Selenium as well. My Python skills are mediocre (mostly a Java/Kotlin dev).

Not only has it been a blast to try out, but also surprisingly easy to setup.

I now have around 11 domains being scraped 4 times a day through a well defined pipeline + ETL then pipes it to Firebase Firestore for consumption.

Next step is to write the page on top of it.


Are you using Scrapy mainly for scraping, or do you do crawling, as well?


In my case I am only using it for direct scraping.


Self promotion: my SaaS is the lowest cost web scraping tool for high volume, and has been in business since 2016.

https://PhantomJsCloud.com

My SaaS requires some technical knowledge to use (call a web api) which I suppose is why it's not ever in these lists.

Some of my customers are *very* large businesses. If you are looking at evading bot countermeasures, my product isn't probably the best for you. but for TCO nothing beats it.


Isn't phantomjs deprecated and unmaintained?


yes, bad naming on my part. While it does support PhantomJs still, the default is a Puppeteer backend.


Yep, according to PhantomJS' README, their "development is suspended until further notice".

It looks like phantomjscloud.com also supports Puppeteer.


For some time now - since 2016 I think (though someone briefly tried to revive it) - headless Chrome does it faster and better now.


OCaml’s Lambda Soup (https://aantron.github.io/lambdasoup/) is a amazing library/, especially for those that prefer functional programming


> Crawl at off-peak traffic times. If a news service has most of its users present between 9 am and 10 pm – then it might be good to crawl around 11 pm or in the wee hours of the morning.

How do you know this if it is not your website?

Also, the internet has no time zone.


For sites where there is a peak usage time, it's probably obvious what that peak usage time is. A news service (their example) presumably primarily serves a country or a region - then off-peak traffic times are likely at night.

The Internet has no time zone, but its human users all do.


If your scraping a popular website Google Trends should be a pretty good proxy


Why no mention of selenium? Is it not cool anymore? I have never heard of mechanicalsoup: is it selenium replacement?


I moved from selenium to playwright. It has a pleasant API and things just works our of box. I ran into odd problems with selenium before, especially when waiting for a specific element. Selenium didn't register it, but I could see it load.

It was uncharacteristic of me, because I tend to use boring, older technologies. But this gamble paid off for me.

https://playwright.dev/


> is it selenium replacement

No completely different use case. Selenium is browser automation. Mechanical soup/Mechanize/Robobrowser are not actually web browsers, they have no javascript support either. They're python libraries that can simulate a web browser but doing GET requests, storing cookies across requests, filling http POST forms, etc.

The downside is that they don't work with websites which rely on JavaScript to load content. But if you're scraping a website like that, then it might be easier and way way faster to analyze web requests using dev tools or mitmproxy, then automating those API calls instead of automating a browser.


I started reverse engineering web apis years ago as a more efficient way to scrape. Unfortunately there are new website builders (like salesforce lightning) that make really hard to recreate an http request due to the complexity of the parameters


Selenium is famously unreliable, so a lot of people have been replacing it with headless chrome where they can.


I have been using selenium with chromedriver, I mistakenly thought those were basically the same thing.

Can you tell me more?


Interesting. I was about to start on some web automation and so far I've had hammered into my head that Selenium is the 'language of the internet' or something along those lines.

What would be a better solution, if you have any to recommend?


I'd suggest Puppeteer / Playwright. Both are great. Iirc the puppeteer team largely moved to playwright.


Puppeteer is frustrating to me. When I tried to use it I couldn’t get it to click buttons, but I did get it to hover on the button so I know I had the correct element in my code. Their click function just did nothing at all. I resorted to tabbing a certain amount of times and hitting enter.


Thank you for the suggestions! I will check them out.


Nowadays is more and more common for websites to have some kind of rate limiting middleware such as rack attack for ruby. It would be interesting to explore the strategies to deal with it.


Fundamentally, what most scrapers learn is that the more their scraper can behave like a human browsing the site, the less likely they are to get detected and blocked.

This does put limits on how quickly they can crawl, of course, but scrapers find ways around it like changing ip and user agent (ip is probably the main one, bec you can then pretend that you are multiple humans browsing the site normally).


Yeah, there are services that give you a range of IPs for a certain time.


Even changing IPs won't always work against an adversary with a global view of the Internet such as CloudFlare.

CF has a view on a significant chunk of internet traffic across many sites and feeds that into some kind of heuristics/machine learning. Even if we assume that your behavior on the scraped website looks human-like, you may still get blocked or challenged because of your lack of traffic on other sites.

The IPs you'd get from a typical proxy service would only be used for bot activity and would've been classified as such a long time ago, and there's no "human activity" on it to compensate and muddy the waters so to speak.

The best solution is to use IPs with a chunk of legitimate residential traffic, and keep scraping sessions constrained to their IPs - don't rotate your requests among all these IPs, instead every IP should be its own instance of a human-like scraper, using its own user account, browser cookies, etc.


You nailed it! I've also faced issues in the past with captchas, and elaborate bot detection mechanisms. It would also be helpful to mention that there are automatic captcha solvers to bypass security once one is detected. I am wondering if it is worthwhile to provide an addition to this post on how to improve the efficacy of scraping despite these roadblocks. The article is geared towards beginner scrapers that are just starting out so maybe it would be overkill? What do you think?


Why not just lower the crawl rate? My search engine crawler visits tens of million documents in a week at a rate of 1 doc/second, but a few hundred different domains at the same time.

Going as low as 0.2 dps could easily be doable I think.


Same as always - proxy farms, random popular UAs with random delays etc.


Sorry, what is a UA?


User agent (browser or some other web client).


User Agent


so will Google's freezing of the UA lead to less ability to web scrape for the non big company scrapers out there?


What? It’s just a text string in the header. How in the world would that possibly make it more difficult to scrape?

All Chrome is doing is stop appending the current semver in the UA it sends.


The switch from UA to browser fingerprinting makes it harder to scrape without being stopped.

Yes, at any time the UA could be ignored and clients could be fingerprinted, but now the UA is being made next to useless, so fingerprinting will now become the default everywhere.


I mean I realize it probably isn't a problematic, just wondering, but on the other hand it shouldn't be so difficult to follow the reasoning based on the context I would think:

poster says - in order to be able to scrape effectively you should appear to be a real human, use different UAs etc.

So as this change happens different UAs become one less thing that you can easily change to seem less suspicious, as a non-frozen UA would then be a suspicious sign after some time.

So a sort of side effect.


Sorry can you please elaborate what is Google doing?


sorry, I thought it was a well known thing here given the various discussions over past year or so https://groups.google.com/a/chromium.org/g/blink-dev/c/-2JIR...

on edit: so I'm thinking as there will only be one UA floating around then, sure, older UAs can exist, but those become progressively more suspicious.


You can easily spoof the UA


> It would be interesting to explore the strategies to deal with it.

The strategy to deal with it is to behave well when making requests so that you don't get rate limited.


Rotate through a bunch of proxies.


I tried Python/ BeautifulSoup and Node/Puppeteer recently. It may be because my Python is poor, but puppeteer seemed more natural to me. Injecting functionality into a properly formed web page felt quite powerful and started me thinking about what you could do with it.


For Java/Kotlin, HtmlUnit is generally pretty great.


But JavaScript not being properly executed in HtmlUnit https://stackoverflow.com/questions/19646612/javascript-not- being-properly-executed-in-htmlunit


Pyppetteer is feature complete and worth noting: https://github.com/pyppeteer/pyppeteer


Thanks! Updated the blog post to include it.



On the Ruby side both Nokogiri and Mechanize should be mentioned...


Good call! ~ Will add them in the next version.


I've not followed this space. When I did, there were a lot of questions concerning the legality of automated scraping. Have those legal issues been resolved?


As long as you're scraping publicly available data (i.e. not going behind a login) and avoiding copyrightable content and personal data, you should be mostly fine. This article should answer most of the web scraping legality questions: https://www.crawlnow.com/blog/is-web-scraping-legal


The current article is better than nothing, but it missed https://playwright.dev


Thanks for the pointer! - Just included Playwright as a language agnostic tool.


I’m looking to roll my own Plaid-like service so I can download the CSVfiles from my bank account and credit card. Would selenium be the way to go?


In my own experience puppeteer is much better/capable than selenium but the problem is that puppeteer requires nodejs. its python-wrapper https://github.com/pyppeteer/pyppeteer was not as good as selenium when you like to use python.


Does anyone have a resource for getting into app-based scraping, if the API is obfuscated or rate limited?


ProxyMan on MacOS is quite awesome for this. It requires a bit of setup with certificates, etc., but once it's working you just fire up the target app on your phone and all the sweet sweet API requests appear on your big screen. I've scraped two apps very successfully this way.

It's also fascinating to see how developers-who-aren't-me setup their APIs when they assume that nobody's looking.


Run an http debugger, or some proxy, and find the endpoints.


I’ll chime in with mine: Skyscraper (Clojure) [0] builds on Enlive/Reaver (which in turn build on JSoup), but tries to address cross-cutting concerns like caching, fetching HTML (preferably in parallel), throttling, retries, navigation, emitting the output as a dataset, etc.


Surprised Woob[0] (formerly Weboob) isn't on the list. It's designed for specific tasks, such as getting transactions from your bank, events from different event pages, and much.

[0] https://woob.tech/


Great article. I've put online a discord server for sharing knowledge about web scraping, if some of you wants to join https://discord.gg/fwqhqrWhHW


Still using casperjs and phantomjs. Both are deprecated for many years, but I cannot find any replacement. Some of my scraping programs are running over 10 years without any issues.


What kind of stuff are people needing to scrape?


I would expect it's roughly the same answers, just varying in the specifics:

* those which don't offer a _reasonable_ API, or (I would guess a larger subset) those which don't expose all the same information over their API

* those things which one wishes to preserve (yes, I'm aware that submitting them to the Internet Archive might achieve that goal)

* and then the subset of projects where it's just a fun challenge or the ubiquitous $other

As an example answer to your question, some sites are even offering bounties for scraped data, so one could scratch a technical itch and help data science at the same time:

https://www.dolthub.com/repositories/pdap/datasets/bounties


I have a side-project where I display the schedule of the day of 100+ French radios, like you would for TV channels.

Scraping works great to get the data.

I don't like node/js but I use it to do the scraping as I view the code as trash and full of edge cases and unreliable data / types and I can't complain, a dynamic scripting language is great for that.


I scrape multiple government sites to fill all the data for https://www.quienmerepresenta.com.mx/

It tells you who is your governor, local/federal representative, senator and municipal president. Each representative lives on a different website so I wrote scrappers for each one.


Scraping saved untold lives this past spring when large healthcare providers (i.e. Walgreens & CVS) opted to hide their vaccination appointments behind redundant survey questions. This made it more difficult to quickly ascertain when an appointment slot would become available. The elderly were less likely to look more than once a day, delaying vaccines for those that needed it the most.

GoodRX built a scraping system that tapped into all the major providers. Thats what a group of vaccine hunters in my state used to get appointments for folks that had tried but were unable to.


Building a side project using python scrapy to scrape podcast shows. I use it to search by title/description etc to find interesting podcasts. Also as a way to learn different tools and frameworks.


Websites which change over time and don’t provide a simpler way of getting an update (e.g. an RSS feed or a JSON api).



What's the best way to get around AWS/Azure/... ip range ban and VPN ban when scrapping?


The large proxy providers operate in a sort of gray market. You pay for "residential" or "ISP" based IP addresses. In some instances these proxy connections are literally being tunneled through browser extensions running on a real world system somewhere (https://hola.org/ for instance)


There are other providers. I think big time scrapers use residential IPs


It's all fun and games until PerimeterX comes in.


Puppeteer/Playwright are the easiest for me


Anyone have tried to use Cypress for scraping?


I love web scraping, and I used to download many images through a series of scripts that crawled throughout a certain website.


any recommendation to scrape behind google social login?


Is there open source software that can extract the "content" part of a given page cleanly? I'm thinking about what the reader mode in browsers can do as an example, where the main content is somehow isolated and displayed.


You can use SGML (on which HTML is/was based) and my LGPL-licensed sgmljs package [1] for that, plus my SGML DTD grammar for HTML5. [2] describes common tasks in preservation of Web content to give you a flavor, but you can customize what SGML does with your markup to death really; in your case, you'll probably want to throw away divs and navs to get clean semantic HTML which you can do using SGML link processes (= pipeline of markup filters and transformations), but you could also convert HTML into canonical markup (eg XML) and use Turing-complete XML processing tools such as XSLT as described in the linked tutorial.

[1]: http://sgmljs.net

[2]: http://sgmljs.net/docs/parsing-html-tutorial/parsing-html-tu...


I believe the main library for reader mode is called readability. I played around with a python implementation a while back. Just pipe in your raw html as part of the process. It's good, but not flawless. If I remember correctly, it included some quotes and image text as part of the body for the site I tried it on.


There's a PHP port of Readability and it works for some sites, for others not at all. Very far from perfect.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: