The State of Web Scraping in 2021

dec0dedab0de · on Oct 11, 2021

Scraping things that don't want to be scraped is one of my favorite things to do. At work this is usually an interface for some sort of "network appliance." Though with the push for REST APIs over the last 6 years or so, I don't have a need to do it all to often. Plus with things like selenium it's too easy to just run the page as is, and I can't justify spending the time to figuring out the undocumented API.

My favorite one implemented CSRF protections by polling an endpoint, and adding in the hashed data from that endpoint and a timestamp on every request.

When I hear a junior dev give up on something because the API doesn't provide the functionality of the UI, It makes me very sad that they're missing out.

tmpz22 · on Oct 11, 2021

To be fair selenium style scraping can take a lot of time to setup if you aren’t already familiar with the tooling, and the browser rendering apis are unintuitive and sometimes flat out broken.

dec0dedab0de · on Oct 11, 2021

Maybe it's because I'm using the python bindings, but it took me about an hour to go from never using it to having it do what I needed it to do. I just messed around in a jupyter notebook until I got what I needed working. Tab complete on live objects is your friend. The hardest part was figuring out where to download a headless browser from.

Though I do prefer requests/bs4. I wrote a helper to generate a requests.Session object from a selenium Browser object. I had something recently where the only thing I needed the javascript engine for was a login form that changed. So by doing it this way I didn't have to rewrite the whole thing. Still kind of bothers me I didn't take the time to figure out how to do it without the headless browser, but it works fine, and I have other things to do.

ipaddr · on Oct 11, 2021

That's why things like laravel's Dusk exists to put a layer over that complex experience.

chinchilla2020 · on Oct 11, 2021

I was surprised not to see selenium in this article. It is a common tool

marvram · on Oct 12, 2021

You're absolutely right. It slipped my mind as I considered it more of a language-agnostic tool, and I organized the article as the provisioning of tools for all popular programming languages. That said, I added it to the post as a language-agnostic tool - thanks for the pointer!

eastendguy · on Oct 11, 2021

> Scraping things that don't want to be scraped

If all else fails, no website can withstand OCR-based screen scraping. It is slow(er), but fast enough for many use cases.

elorant · on Oct 11, 2021

Assuming that you eventually manage to load the page somehow. Which in some edge cases may entail simulating mouse movements and random delays.

eastendguy · on Oct 12, 2021

Agreed. -> I use the ui.vision extension to simulate native mouse movements.

timwis · on Oct 11, 2021

Have you tried on a page protected by cloudflare captcha?

1vuio0pswjnm7 · on Oct 11, 2021

Its funny I never seem to hit these infamous Clouflare captchas. The only impediment I encounter with Cloudflare is they require plaintext SNI to read their blog, https://blog.cloudflare.com. Unlike almost all other Cloudflare, ESNI will not work.

dec0dedab0de · on Oct 12, 2021

I have not had to deal with that, but I have idly thought that it might be easier to pipe the audio version into google assistant or something, and see what it comes up with.

eastendguy · on Oct 12, 2021

It seems to be no problem if you automate a real browser as opposed to a headless browser. I think they test for that.

mkl · on Oct 11, 2021

A browser extension is probably an easier way to extract text than OCR (unless you're targeting a wide range of sites, I suppose).

jamesfinlayson · on Oct 11, 2021

I remember a workmate having to deal with some difficult to scrape data at a previous job - the page randomly rendered with different mark-up (but the same appearance) to mitigate pulling out data using selectors. I think he got to the bottom of it eventually but it made testing his work a pain.

no_time · on Oct 12, 2021

Playwright's layout selectors might help the next time you encounter this.

https://playwright.dev/docs/selectors#selecting-elements-bas...

f311a · on Oct 11, 2021

For Python, instead of BeautifulSoup I prefer to use selectolax which is 3-5 times faster.

Also, I think very few people use MechanicalSoup nowadays. There are libraries that allow you to use headless Chrome, e.g. Playwright.

It looks like the author of the article just googled some libraries for each language and didn't research the topic.

jacurtis · on Oct 11, 2021

> It looks like the author of the article just googled some libraries for each language and didn't research the topic

Yep, this seemed like an aggregate Google results page.

I was initially intrigued by the article and then realized it was a list of libraries the author found via Google. There were significantly notable omissions from this list and a bunch of weird stuff that feels unnecessary. I don't think the author has actually scraped a page before.

xnyan · on Oct 11, 2021

I agree with your conclusion, but in any discussion about web scraping it's probably a good idea to mention BeautifulSoup given how popular it is (virtually a builtin in terms how much it's used) and given all the documentation available for it, a good starting point if perf is not going to be a concern.

mdaniel · on Oct 11, 2021

Lazyweb link: https://github.com/rushter/selectolax

although I don't follow the need to have what appears to be two completely separate HTML parsing C libraries as dependencies; seeing this in the readme for Modest gives me the shivers because lxml has _seen some shit_

> Modest is a fast HTML renderer implemented as a pure C99 library with no outside dependencies.

although its other dep seems much more cognizant about the HTML5 standard, for whatever that's worth: https://github.com/lexbor/lexbor#lexbor

---

> It looks like the author of the article just googled some libraries for each language and didn't research the topic

Heh, oh, new to the Internet, are you? :-D

heavyset_go · on Oct 11, 2021

requests-html is faster than bs4 using lxml. It's a wrapper over lxml. I built something similar years ago using a similar method, it was much faster than bs4, too.

m_ke · on Oct 11, 2021

Another tip, there are a few browser extensions that can record your interactions and generate a playwright script.

Here's one: https://chrome.google.com/webstore/detail/headless-recorder/...

sidharthv · on Oct 11, 2021

If you don't want to install another extension, Playwright has built in support for recording.

npx playwright codegen wikipedia.org

https://playwright.dev/docs/next/codegen

jrochkind1 · on Oct 11, 2021

I'm not familiar with "playwright", it doesn't seem to be mentioned in OP either.

When I google, I see it advertised as a "testing" tool.

Can I also use it for scraping? Where would I learn more about doing so?

xnyan · on Oct 11, 2021

Playwright is essentially a headless chrome, firefox, and webkit browser with a nice API that's intended for automation/scraping. It's far more heavy than something like curl, but it has all the capabilities of any browser you want (not just chrome as with puppeteer) and makes stuff like interacting with javascript a breeze.

It's similar to Google's puppeteer, but in my opinion even with chrome much more pleasant and productive. Microsoft's best developer tool IMO, saves me tons of time.

rmetzler · on Oct 11, 2021

Playwright is similar to Puppeteer, but can use different browsers not only Chrome.

benzible · on Oct 11, 2021

I just needed a service to reliably fetch raw pages that I can process in my own application and so far I've been happy with this: https://promptapi.com/marketplace/description/adv_scraper-ap...

$30 / month for 300K requests, rotating residential proxies, uses headless Chromium, etc.

colinramsay · on Oct 11, 2021

If you're familiar with Go, there's Colly too [1]. I liked its simplicity and approach and even wrote a little wrapper around it to run it via Docker and a config file:

https://gotripod.com/insights/super-simple-site-crawling-and...

[1] http://go-colly.org/

hivacruz · on Oct 11, 2021

I used this library to get familiar with Go. It is indeed very powerful and really easy to create a scraper.

My main concerns though were about testing. What if you want to create tests to check if your scraper still gets the data we want? Colly allows nested scraping and it's easy to implement but you have all your logic into one big function, making it harder to test.

Did you find a solution to this? I'm considering switching to net/http + GoQuery only to have more freedom.

colinramsay · on Oct 11, 2021

Not yet but my plan was to just have a static HTML site which the tests could run against.

mro_name · on Oct 11, 2021

I am scraping radio broadcast pages for a decade now. Started with (ruby) scrapy, then nokogiri, then moved on to go and their html package.

Currently sport a mix of curl + grep + xsltproc + lambdasoup (OCaml) and am happy with it. Sounds like a mess but is shallow, inspectable, changeable and concise. http://purl.mro.name/recorder

jmnicolas · on Oct 11, 2021

Last year I needed some quick scraping and I used a headless Chromium to render webpages and print the HTML then analyze it with C#.

I don't remember exactly, but I think it was around 100 or 200 loc, so not exactly something that took long to write. In fact the most difficult thing was to figure how to pass the right args to Chromium.

I wonder what does a scraping framework offer?

Veen · on Oct 11, 2021

> I wonder what does a scraping framework offer?

HTTP requests, HTML parsing, crawling, data extraction, wrapping complex browser APIs etc. Nothing you couldn't do yourself, but like most frameworks, they abstract the messy details so you can get a scraper working quickly without having to cobble together a bunch of libraries or re-invent the wheel.

jmnicolas · on Oct 11, 2021

I see thanks.

jrochkind1 · on Oct 11, 2021

Just for one example, when you have to get a form, and then submit the form, with the CSRF protection that was in the form... of course you COULD write that yourself by printing HTML and then analyzing it with C# (which triggers more requests to chromium I guess), but you're probably going to wonder why you are reinventing the wheel when you want to be getting on to the domain-specific stuff.

jmnicolas · on Oct 11, 2021

Ah yes I see. Mine was read-only, so no need for complex stuff.

elorant · on Oct 11, 2021

Throttling is a prime example. If you start loading multitudes of sites in asynchronous fashion you'll have to enter some delay otherwise you run the risk of choking the server in misconfigured sites. I've DDoSed sites accidentally this way. You can of course build a framework on your own, and that's pretty much what every scraper does eventually, but it takes time and a lot of effort.

krakengerry · on Oct 12, 2021

I think another technique that should be talked about is intercepting network responses as they happen. The web in 2021 still has a whole lot of client-side rendering. For those sites, data is often loaded on the fly with separate network calls (usually with some sort of nonce or contextual key). Much of the hassle in web scraping can be avoided by listening for that specific response instead of parsing an artifact of the JSON->JS->HTML process.

I put together a toy site [0] recently that uses this approach for JIT price comparisons of events. When you click on an event, the backend navigates to requested ticket provider pages through a pool of Puppeteer instances and waits for JSON responses with pricing data.

[0] https://www.wyzetickets.com

marban · on Oct 11, 2021

Cloudflare's protection is quite a b*tch to circumvent with any headless or python library.

omneity · on Oct 11, 2021

Slight aside: The most recent Cloudflare HCaptchas ask you to classify AI generated images. They don’t even look like a proper bike/truck/whatever (I don’t have an example handy).

I categorically refuse to do when I’m browsing websites using it. I find this new captcha utterly unacceptable.

It’s no “protection” at this point anymore. Websites are using it as an excuse to become even more user hostile. I am worried for the future of the web.

password4321 · on Oct 11, 2021

https://news.ycombinator.com/item?id=28514998#28515629

> Cloudflare's bot protection mostly makes use of TLS fingerprinting, and thus pretty easy to bypass.

https://news.ycombinator.com/item?id=28251700 -> https://github.com/refraction-networking/utls

Disclaimer: haven't tried it.

heavyset_go · on Oct 11, 2021

It's a pain even when you aren't a bot. For a while there, Cloudflare's fingerprinting page would trigger Firefox on Linux to crash instantly.

alphabet9000 · on Oct 11, 2021

with node, i've had success with puppeteer-extra using puppeteer-extra-plugin-stealth

MrDresden · on Oct 11, 2021

I've been working on a scraping project in Scrapy over the last month, using Selenium as well. My Python skills are mediocre (mostly a Java/Kotlin dev).

Not only has it been a blast to try out, but also surprisingly easy to setup.

I now have around 11 domains being scraped 4 times a day through a well defined pipeline + ETL then pipes it to Firebase Firestore for consumption.

Next step is to write the page on top of it.

heavyset_go · on Oct 11, 2021

Are you using Scrapy mainly for scraping, or do you do crawling, as well?

MrDresden · on Oct 12, 2021

In my case I am only using it for direct scraping.

novaleaf · on Oct 11, 2021

Self promotion: my SaaS is the lowest cost web scraping tool for high volume, and has been in business since 2016.

https://PhantomJsCloud.com

My SaaS requires some technical knowledge to use (call a web api) which I suppose is why it's not ever in these lists.

Some of my customers are *very* large businesses. If you are looking at evading bot countermeasures, my product isn't probably the best for you. but for TCO nothing beats it.

lloydatkinson · on Oct 11, 2021

Isn't phantomjs deprecated and unmaintained?

novaleaf · on Oct 11, 2021

yes, bad naming on my part. While it does support PhantomJs still, the default is a Puppeteer backend.

spiffytech · on Oct 11, 2021

Yep, according to PhantomJS' README, their "development is suspended until further notice".

It looks like phantomjscloud.com also supports Puppeteer.

jamesfinlayson · on Oct 11, 2021

For some time now - since 2016 I think (though someone briefly tried to revive it) - headless Chrome does it faster and better now.

toastal · on Oct 11, 2021

OCaml’s Lambda Soup (https://aantron.github.io/lambdasoup/) is a amazing library/, especially for those that prefer functional programming

amelius · on Oct 11, 2021

> Crawl at off-peak traffic times. If a news service has most of its users present between 9 am and 10 pm – then it might be good to crawl around 11 pm or in the wee hours of the morning.

How do you know this if it is not your website?

Also, the internet has no time zone.

chucky · on Oct 11, 2021

For sites where there is a peak usage time, it's probably obvious what that peak usage time is. A news service (their example) presumably primarily serves a country or a region - then off-peak traffic times are likely at night.

The Internet has no time zone, but its human users all do.

numeralls · on Oct 11, 2021

If your scraping a popular website Google Trends should be a pretty good proxy

gcatalfamo · on Oct 11, 2021

Why no mention of selenium? Is it not cool anymore? I have never heard of mechanicalsoup: is it selenium replacement?

duckmysick · on Oct 11, 2021

I moved from selenium to playwright. It has a pleasant API and things just works our of box. I ran into odd problems with selenium before, especially when waiting for a specific element. Selenium didn't register it, but I could see it load.

It was uncharacteristic of me, because I tend to use boring, older technologies. But this gamble paid off for me.

https://playwright.dev/

IceWreck · on Oct 11, 2021

> is it selenium replacement

No completely different use case. Selenium is browser automation. Mechanical soup/Mechanize/Robobrowser are not actually web browsers, they have no javascript support either. They're python libraries that can simulate a web browser but doing GET requests, storing cookies across requests, filling http POST forms, etc.

The downside is that they don't work with websites which rely on JavaScript to load content. But if you're scraping a website like that, then it might be easier and way way faster to analyze web requests using dev tools or mitmproxy, then automating those API calls instead of automating a browser.

gcatalfamo · on Oct 11, 2021

I started reverse engineering web apis years ago as a more efficient way to scrape. Unfortunately there are new website builders (like salesforce lightning) that make really hard to recreate an http request due to the complexity of the parameters

nicoburns · on Oct 11, 2021

Selenium is famously unreliable, so a lot of people have been replacing it with headless chrome where they can.

gcatalfamo · on Oct 11, 2021

I have been using selenium with chromedriver, I mistakenly thought those were basically the same thing.

Can you tell me more?

melomal · on Oct 11, 2021

Interesting. I was about to start on some web automation and so far I've had hammered into my head that Selenium is the 'language of the internet' or something along those lines.

What would be a better solution, if you have any to recommend?

xzel · on Oct 11, 2021

I'd suggest Puppeteer / Playwright. Both are great. Iirc the puppeteer team largely moved to playwright.

kjkjadksj · on Oct 11, 2021

Puppeteer is frustrating to me. When I tried to use it I couldn’t get it to click buttons, but I did get it to hover on the button so I know I had the correct element in my code. Their click function just did nothing at all. I resorted to tabbing a certain amount of times and hitting enter.

melomal · on Oct 12, 2021

Thank you for the suggestions! I will check them out.

juanse · on Oct 11, 2021

Nowadays is more and more common for websites to have some kind of rate limiting middleware such as rack attack for ruby. It would be interesting to explore the strategies to deal with it.

Ozzie_osman · on Oct 11, 2021

Fundamentally, what most scrapers learn is that the more their scraper can behave like a human browsing the site, the less likely they are to get detected and blocked.

This does put limits on how quickly they can crawl, of course, but scrapers find ways around it like changing ip and user agent (ip is probably the main one, bec you can then pretend that you are multiple humans browsing the site normally).

prox · on Oct 11, 2021

Yeah, there are services that give you a range of IPs for a certain time.

Nextgrid · on Oct 11, 2021

Even changing IPs won't always work against an adversary with a global view of the Internet such as CloudFlare.

CF has a view on a significant chunk of internet traffic across many sites and feeds that into some kind of heuristics/machine learning. Even if we assume that your behavior on the scraped website looks human-like, you may still get blocked or challenged because of your lack of traffic on other sites.

The IPs you'd get from a typical proxy service would only be used for bot activity and would've been classified as such a long time ago, and there's no "human activity" on it to compensate and muddy the waters so to speak.

The best solution is to use IPs with a chunk of legitimate residential traffic, and keep scraping sessions constrained to their IPs - don't rotate your requests among all these IPs, instead every IP should be its own instance of a human-like scraper, using its own user account, browser cookies, etc.

marvram · on Oct 12, 2021

You nailed it! I've also faced issues in the past with captchas, and elaborate bot detection mechanisms. It would also be helpful to mention that there are automatic captcha solvers to bypass security once one is detected. I am wondering if it is worthwhile to provide an addition to this post on how to improve the efficacy of scraping despite these roadblocks. The article is geared towards beginner scrapers that are just starting out so maybe it would be overkill? What do you think?

marginalia_nu · on Oct 11, 2021

Why not just lower the crawl rate? My search engine crawler visits tens of million documents in a week at a rate of 1 doc/second, but a few hundred different domains at the same time.

Going as low as 0.2 dps could easily be doable I think.

hdjjhhvvhga · on Oct 11, 2021

Same as always - proxy farms, random popular UAs with random delays etc.

juanse · on Oct 11, 2021

Sorry, what is a UA?

Veen · on Oct 11, 2021

User agent (browser or some other web client).

shmoogy · on Oct 11, 2021

User Agent

bryanrasmussen · on Oct 11, 2021

so will Google's freezing of the UA lead to less ability to web scrape for the non big company scrapers out there?

dreyfan · on Oct 11, 2021

What? It’s just a text string in the header. How in the world would that possibly make it more difficult to scrape?

All Chrome is doing is stop appending the current semver in the UA it sends.

heavyset_go · on Oct 11, 2021

The switch from UA to browser fingerprinting makes it harder to scrape without being stopped.

Yes, at any time the UA could be ignored and clients could be fingerprinted, but now the UA is being made next to useless, so fingerprinting will now become the default everywhere.

bryanrasmussen · on Oct 11, 2021

I mean I realize it probably isn't a problematic, just wondering, but on the other hand it shouldn't be so difficult to follow the reasoning based on the context I would think:

poster says - in order to be able to scrape effectively you should appear to be a real human, use different UAs etc.

So as this change happens different UAs become one less thing that you can easily change to seem less suspicious, as a non-frozen UA would then be a suspicious sign after some time.

So a sort of side effect.

vivekv · on Oct 11, 2021

Sorry can you please elaborate what is Google doing?

bryanrasmussen · on Oct 11, 2021

sorry, I thought it was a well known thing here given the various discussions over past year or so https://groups.google.com/a/chromium.org/g/blink-dev/c/-2JIR...

on edit: so I'm thinking as there will only be one UA floating around then, sure, older UAs can exist, but those become progressively more suspicious.

hdfinenwvdun · on Oct 11, 2021

You can easily spoof the UA

heavyset_go · on Oct 11, 2021

> It would be interesting to explore the strategies to deal with it.

The strategy to deal with it is to behave well when making requests so that you don't get rate limited.

stef25 · on Oct 11, 2021

Rotate through a bunch of proxies.

beardyw · on Oct 11, 2021

I tried Python/ BeautifulSoup and Node/Puppeteer recently. It may be because my Python is poor, but puppeteer seemed more natural to me. Injecting functionality into a properly formed web page felt quite powerful and started me thinking about what you could do with it.

Jenkins2000 · on Oct 11, 2021

For Java/Kotlin, HtmlUnit is generally pretty great.

anjingchi · on Oct 14, 2021

But JavaScript not being properly executed in HtmlUnit https://stackoverflow.com/questions/19646612/javascript-not- being-properly-executed-in-htmlunit

say_it_as_it_is · on Oct 11, 2021

Pyppetteer is feature complete and worth noting: https://github.com/pyppeteer/pyppeteer

marvram · on Oct 12, 2021

Thanks! Updated the blog post to include it.

ardalann · on Oct 13, 2021

self promotion: I launched my no-code scraping cloud software on ProductHunt last month after a year of testing with beta users: https://www.producthunt.com/posts/browse-ai

Here are a few comparisons if you're curious:

- https://www.browse.ai/vs/hexomatic

- https://www.browse.ai/vs/import-io

- https://www.browse.ai/vs/octoparse

- https://www.browse.ai/vs/oxylabs

- https://www.browse.ai/vs/parsehub

- https://www.browse.ai/vs/simplescraper

- https://www.browse.ai/vs/webscraper

- https://www.browse.ai/vs/zyte

abzug · on Oct 11, 2021

On the Ruby side both Nokogiri and Mechanize should be mentioned...

marvram · on Oct 12, 2021

Good call! ~ Will add them in the next version.

SubiculumCode · on Oct 11, 2021

I've not followed this space. When I did, there were a lot of questions concerning the legality of automated scraping. Have those legal issues been resolved?

hafizhamid · on Oct 20, 2021

As long as you're scraping publicly available data (i.e. not going behind a login) and avoiding copyrightable content and personal data, you should be mostly fine. This article should answer most of the web scraping legality questions: https://www.crawlnow.com/blog/is-web-scraping-legal

chaostheory · on Oct 12, 2021

The current article is better than nothing, but it missed https://playwright.dev

marvram · on Oct 12, 2021

Thanks for the pointer! - Just included Playwright as a language agnostic tool.

travisporter · on Oct 11, 2021

I’m looking to roll my own Plaid-like service so I can download the CSVfiles from my bank account and credit card. Would selenium be the way to go?

synergy20 · on Oct 11, 2021

In my own experience puppeteer is much better/capable than selenium but the problem is that puppeteer requires nodejs. its python-wrapper https://github.com/pyppeteer/pyppeteer was not as good as selenium when you like to use python.

Quessked73 · on Oct 11, 2021

Does anyone have a resource for getting into app-based scraping, if the API is obfuscated or rate limited?

nlh · on Oct 12, 2021

ProxyMan on MacOS is quite awesome for this. It requires a bit of setup with certificates, etc., but once it's working you just fire up the target app on your phone and all the sweet sweet API requests appear on your big screen. I've scraped two apps very successfully this way.

It's also fascinating to see how developers-who-aren't-me setup their APIs when they assume that nobody's looking.

elorant · on Oct 11, 2021

Run an http debugger, or some proxy, and find the endpoints.

nathell · on Oct 11, 2021

I’ll chime in with mine: Skyscraper (Clojure) [0] builds on Enlive/Reaver (which in turn build on JSoup), but tries to address cross-cutting concerns like caching, fetching HTML (preferably in parallel), throttling, retries, navigation, emitting the output as a dataset, etc.

gizdan · on Oct 11, 2021

Surprised Woob[0] (formerly Weboob) isn't on the list. It's designed for specific tasks, such as getting transactions from your bank, events from different event pages, and much.

[0] https://woob.tech/

PigiVinci83 · on Oct 12, 2021

Great article. I've put online a discord server for sharing knowledge about web scraping, if some of you wants to join https://discord.gg/fwqhqrWhHW

holoduke · on Oct 11, 2021

Still using casperjs and phantomjs. Both are deprecated for many years, but I cannot find any replacement. Some of my scraping programs are running over 10 years without any issues.

zlib · on Oct 11, 2021

What kind of stuff are people needing to scrape?

mdaniel · on Oct 11, 2021

I would expect it's roughly the same answers, just varying in the specifics:

* those which don't offer a _reasonable_ API, or (I would guess a larger subset) those which don't expose all the same information over their API

* those things which one wishes to preserve (yes, I'm aware that submitting them to the Internet Archive might achieve that goal)

* and then the subset of projects where it's just a fun challenge or the ubiquitous $other

As an example answer to your question, some sites are even offering bounties for scraped data, so one could scratch a technical itch and help data science at the same time:

https://www.dolthub.com/repositories/pdap/datasets/bounties

conradfr · on Oct 11, 2021

I have a side-project where I display the schedule of the day of 100+ French radios, like you would for TV channels.

Scraping works great to get the data.

I don't like node/js but I use it to do the scraping as I view the code as trash and full of edge cases and unreliable data / types and I can't complain, a dynamic scripting language is great for that.

hyeomans · on Oct 11, 2021

I scrape multiple government sites to fill all the data for https://www.quienmerepresenta.com.mx/

It tells you who is your governor, local/federal representative, senator and municipal president. Each representative lives on a different website so I wrote scrappers for each one.

ianhawes · on Oct 11, 2021

Scraping saved untold lives this past spring when large healthcare providers (i.e. Walgreens & CVS) opted to hide their vaccination appointments behind redundant survey questions. This made it more difficult to quickly ascertain when an appointment slot would become available. The elderly were less likely to look more than once a day, delaying vaccines for those that needed it the most.

GoodRX built a scraping system that tapped into all the major providers. Thats what a group of vaccine hunters in my state used to get appointments for folks that had tried but were unable to.

perlwle · on Oct 12, 2021

Building a side project using python scrapy to scrape podcast shows. I use it to search by title/description etc to find interesting podcasts. Also as a way to learn different tools and frameworks.

rmetzler · on Oct 11, 2021

Websites which change over time and don’t provide a simpler way of getting an update (e.g. an RSS feed or a JSON api).

selecsosi · on Oct 12, 2021

https://j-archive.com/

rafale · on Oct 11, 2021

What's the best way to get around AWS/Azure/... ip range ban and VPN ban when scrapping?

jmuguy · on Oct 11, 2021

The large proxy providers operate in a sort of gray market. You pay for "residential" or "ISP" based IP addresses. In some instances these proxy connections are literally being tunneled through browser extensions running on a real world system somewhere (https://hola.org/ for instance)

killingtime74 · on Oct 11, 2021

There are other providers. I think big time scrapers use residential IPs

jakearmitage · on Oct 11, 2021

It's all fun and games until PerimeterX comes in.

tough · on Oct 12, 2021

Puppeteer/Playwright are the easiest for me

anjingchi · on Oct 14, 2021

Anyone have tried to use Cypress for scraping?

frankfury · on Oct 12, 2021

I love web scraping, and I used to download many images through a series of scripts that crawled throughout a certain website.

lucasverra · on Oct 11, 2021

any recommendation to scrape behind google social login?

throwawaysea · on Oct 11, 2021

Is there open source software that can extract the "content" part of a given page cleanly? I'm thinking about what the reader mode in browsers can do as an example, where the main content is somehow isolated and displayed.

tannhaeuser · on Oct 11, 2021

You can use SGML (on which HTML is/was based) and my LGPL-licensed sgmljs package [1] for that, plus my SGML DTD grammar for HTML5. [2] describes common tasks in preservation of Web content to give you a flavor, but you can customize what SGML does with your markup to death really; in your case, you'll probably want to throw away divs and navs to get clean semantic HTML which you can do using SGML link processes (= pipeline of markup filters and transformations), but you could also convert HTML into canonical markup (eg XML) and use Turing-complete XML processing tools such as XSLT as described in the linked tutorial.

[1]: http://sgmljs.net

[2]: http://sgmljs.net/docs/parsing-html-tutorial/parsing-html-tu...

specproc · on Oct 11, 2021

I believe the main library for reader mode is called readability. I played around with a python implementation a while back. Just pipe in your raw html as part of the process. It's good, but not flawless. If I remember correctly, it included some quotes and image text as part of the body for the site I tried it on.

stef25 · on Oct 11, 2021

There's a PHP port of Readability and it works for some sites, for others not at all. Very far from perfect.