Hacker News new | past | comments | ask | show | jobs | submit login
Web scraping with Python open knowledge (github.com/reanalytics-databoutique)
219 points by PigiVinci83 on May 27, 2022 | hide | past | favorite | 74 comments



I recently joined a brilliant web scraping API company called ScrapFly, who provided me with the resources to create lots of open knowledge on web scraping that I always wanted to create!

So, to add to this list here are my top 3 favorite article that could expand OP's document:

0 - central guide to avoiding blocking - this one a tough one because there's so much information: request headers, http versions, TLS fingerprinting, javascript fingerprinting etc. I spent almost a month working on these and it was an amazing research experience that I could never afford myself before.

1, 2 - xpath and css selector introduction articles where I built a widget into our article that allows to test all css and xpath selectors right there in the learning material.

3 - introduction to reverse engineering - quick introduction to using browser devtools for web scraping, how to inspect the network and replicate it in your program. This is where I point all beginners as understanding the browser really helps to understand web scraping!

0 - https://scrapfly.io/blog/parsing-html-with-css/

1 - https://scrapfly.io/blog/parsing-html-with-xpath/

2 - https://scrapfly.io/blog/how-to-scrape-without-getting-block...

3 - https://scrapecrow.com/reverse-engineering-intro.html


Do you follow robots.txt or do you allow your customers to bypass restrictions placed by the website? How do you feel about bypassing that? I know it's not illegal but it's certainly not ethical either.


Could you expand on why you think it's ethically required to follow robots.txt instructions?

Primary argument _in favor_ of automation (e.g. web scraping) is that it would be unethical to hire hudnreds people to do meanial, unfufilling tasks like mindlessly clicking around the website and saving the pages when it could be done by a program which is countless times more efficient for everyone involved (the website included) and safer.


The ethical argument has nothing to do with hiring hundreds of people to circumvent robots.txt requirements!

The whole point is to avoid unnecessary or excessive crawling by bots that are engineered with no concern for anything other than the owners motivations, presumably financial in most cases.


Sure. At earlier jobs, I often got paged because of some bot running wild in crawling our web pages. Some pages are heavy and we don't expect them to be hit often, but once you crawl these pages indiscriminately (even if accidentally) that can bring a site down. There are also some pages whose underlying resources are billed in a pay-as-you-use model. Once again, heavy bot traffic ran up our bills.

Robots.txt allows the site owners to restrict such pages from being crawled by bots. Services that allow people to circumvent the restrictions are being rude to say the least. Many crawling services also use a farm of proxies that spoof their real identity with fake user agents to circumvent rate limiting etc. All of these "strategies" go far beyond basic automation and is quite shady in reality.


There's actually a difference between crawling and web scraping. Crawling discovers pages in a very loose manner by following all links, digesting them and producing more crawl tasks. Web Scraping, on the other hand, is a more controlled environment where the rules are pretty strict e.g. scrape `produc-<product id>.html` links for product data so web scrapers are very unlikely to stumble on some page randomly.

Also, unfortunately, robots.txt is rarely used to indicate non crawlable endpoints these days but instead is used as a way to withold public data. Just take any random big website and take a look at their robots.txt file:

User-agent: Googlebot Allow: / User-agent: * Disallow: /


There is no clear “right” or “wrong” side here, although not honoring robots.txt is a bit frowned upon, comparable to someone telling you certain areas are off-limits, but you decide to go there anyway.

Having said that, if you’re in the web scraping business dealing with anti-scrape shields and whatnot, ignoring robots.txt is the least nefarious of them all.


My personal point of view (my opinion, not my company's) is that I think it's unethical to take advantage of all the benefits public data provides which lets be honest are absolutely massive - like search engine indexing, brand building, content previews etc. while at the same time avoid paying the costs of having data publically - like someone automating their browsing experience.

I mean, we had a solution to web scrapers since the inception of web authentication, but the value of having the data publically clearly outweighs the costs of having your data scraped to the point where big corporations would rather take web scrapers all the way to Ninth Circuit (Linkedin case) than shut down the public access.

That being said our understanding of information philosophy is still in complete infancy so it's hard to discuss ethics here. Generally, I'm in favor of hackers, individuals and decentralization over big corporations and access to web scraping empowers the former and weakens latter - so, I'm rooting for the healthier, better version of the internet above all!


You make fair points about the negative side of the web in an unideal world.

But please stop framing your encouragement of toxic crawling practices as some sort of noble pursuit in a made-up fight against The Man.

Just own it as the "I'm-alright-Jack" approach it is; the honesty will make it a more respectable position intellectually, even if it remains unethical.


The robots.txt time makes it at times easier to scrape a target by the info that a website can reveal in it (e.g. allow a specific bot to scrape all). Their sitemaps are another gem.


Thank you, much appreciated your comment


Your first link is broken, it is not about blocking.


sorry, I reordered the links and forgot to update the footnotes - you can probably guess which one is which one by the link text itself though :)


A work in progress guide about web scraping in python, anti bot softwares and techniques and so on. Please feel free to share and contribute with your own experience too.


The tab formatting seems like an odd (and rather unPythonic) addition. What's the intention there?


Same here. It feels out of place, unnecessary and its rationalization unconvincing. Considering Python, an outright weird suggestion.


Python allows indenting using tabs, so I don't understand why it's a weird decision.

In fact, they even stated their reasoning in the document. I don't see why anyone has to blindly follow PEP8 nor do I get why 4 spaces indent has to be considered a standard.


> Python allows indenting using tabs, so I don't understand why it's a weird decision.

A standard is not a set of rules already enforced in a language, otherwise it would not be needed. It's rather a set of practical guidelines that a group of people agrees upon, with the purpose of making each other's lives easier. That's why indenting with tabs is weird.

> I don't see why anyone has to blindly follow PEP8

In the <tabs> vs <space> debate there's really no reasons not to follow PEP8. The number of people (and editors, and tooling, etc) that abide by it is quite large and it seems to work well enough for most. The only reason that someone would even mention spaces as an inconvenience, is that they can somehow perceive a difference when editing code, which may point to badly configured or outdated tooling, rather than faulty standard. Most people who code Python have their editor set to feed 4 spaces when the <tab> key is pressed and delete 4 consecutive spaces on <Backspace>. If instead you repeatedly press (4 times) the <space> bar or the <Backspace> to insert or remove code indentations, you're doing it wrong.


I think tabs are handier exactly because they don't have a fixed width. You can adjust code easily to your readability preference without actually having to change the code. Some people like deep indents, others like them more shallow. Just modify the tab width to your personal preference.

Also, it avoids the issue of only accepting 2 or 4 spaces, leaving the 1,3,5 spaces as incorrect combinations leading to issues. With tabs there are no combinations which are invalid, though you can still have too many or few indents than intended of course. But the hunting for that extra space you copied in is gone.

I just don't subscribe to the 'tabs are evil' narrative. I like that python supports them but I think it's really annoying that YAML doesn't.

The argument of "each editor does things differently" is also not really valid when you're going to need a special editor that can convert tabs to spaces and delete spaces in bunches to really work with it comfortably. It would have been much easier to just use an editor that handles tabs in the way that was needed. Either way you're going to want specific editor features.


[flagged]


Because it's not a tutorial on web scraping but a mix of what we suggest internally to do and what we've learnt from our experience in this field in these years. For our codebase we prefer tabs instead of spaces, but i understand it's a subject for debates that last decades :) But thanks for the point, I'll rephrase the topic in the guide


It's odd to me that your apparent revenue stream is from scraping difficult-to-scrape sites and you're broadcasting the exact tactics you use to bypass anti-scraping systems. You're making your own life difficult by giving Cloudflare/PerimeterX/etc the information necessary to improve their tooling.

You also seem to advertise many of the sites/datasets you're scraping, which opens you up to litigation. Especially if they're employing anti-scraping tooling and you're brazenly bypassing those. It doesn't matter that it's legal in most jurisdictions of the world, you'll still have to handle cease and desists or potential lawsuits, which is a major cost and distraction.


« You also seem to advertise many of the sites/datasets you're scraping, which opens you up to litigation.»

Is that a done deal now after the “LinkedIn vs HiQ” case public information only hold copyright, but you can use the by product as it’s fit you for new business?


The only clear outcome from the LinkedIn case, afaik, is that scraping publicly of accessible data is not a federal crime under CFAA [1]. There are still plenty of other civil ways that someone can sue you to stop scraping their site: breach of contract, trespass to chattels, trademark infringement, etc. And they can do so over and over again til you're broke. OP is based in Italy anyway so I have absolutely no clue what does and doesn't apply.

I'd like to point out that, while HiQ Labs "won" the case, that company is basically dead. The CEO and CTO are both working for other companies now. So I think the bigger takeaway is: don't get yourself sued while you're a tiny little startup.

[1] https://www.natlawreview.com/article/hiq-labs-v-linkedin


It's not a best practice, it's just a random thing your team does. It does make the team sound amateurish if it can't distinguish between meaningful best practices and just conventions the team happens to have.


I appreciate the inclusion of anti-bot software. As someone who builds plugins for enterprise apps (currently Airtable), I really want to build automated tests for my apps with Selenium, but keep getting foiled by anti-bot measures.

Can anyone recommend other resources for understanding anti-bot tech and their workarounds?


Anyone with a stake in bypassing anti-bot measures isn't going to share their tactics, since sharing it will lead to such workaround being patched or mitigated, requiring them to research for more bot detection workarounds.

Projects like cloudscraper[0] are often linked to point and say "look! they broke Cloudflare!" but CF and the rest of the industry has detections for tools like this, and instead of rolling out blocks for these tools, they give website owners tools like bot score[1] to manage their own risk level on a per-page basis.

0: https://github.com/VeNoMouS/cloudscraper

1: https://developers.cloudflare.com/bots/concepts/bot-score/


Yeah that's a shame but it makes sense.

Probably need to find an in with bot builders, if that's really a goal I have.


On the page about canvas fingerprinting[0], it only mentions Cloudflare. From what I can tell, reCaptcha v3 also uses canvas fingerprinting [1]

[0] https://github.com/reanalytics-databoutique/webscraping-open...

[1] https://brianwjoe.com/2019/02/06/how-does-recaptcha-v3-work/


Thanks for sharing, i'll update soon the page.


I recently transitioned one of my scraping projects away from selenium to playwright and I must say that the developer experience is way better, in my opinion.

I also implemented to receive a telegram message with the debug trace in case of errors in my pipeline, so that I could have the entire scraping flow to analyze. That’s pretty neat.


Plug for https://commoncrawl.org/ if you need billions of pages but don't want to deal with scraping the web yourself.


It would thrill me if common crawl were updated with such frequency that it would allow new search engines to enter the market

I haven't dug into it enough to know if there's some technical reason it's not currently the case, or just lack of (interest|willpower)


I'd argue that one broad crawl every 2-3 months in addition to their updated-daily news crawl[0] should be good enough to make a rudimentary search engine.

[0] https://data.commoncrawl.org/crawl-data/CC-NEWS/index.html


You're right about the "rudimentary" part, because I don't know how they do it but the major players have some not-kidding-around freshness:

https://www.google.com/search?hl=en&q=%22thrill%20me%22%20%2...

https://www.bing.com/search?q=%22thrill+me%22+%22common+craw... (and DDG similarly, because bing)

ed: I was curious if maybe HN publishes a sitemap, and it seems no. Then again, hnreplies knows about the HN API so maybe it's special-cased by the big crawlers https://github.com/ggerganov/hnreplies#hnreplies


I think it is as simple as an algorithm that rates how frequently new updates show up vs how many people visit a site.

If you rarely make updates to your site, Google crawls it infrequently and new things won't show up very quickly.

But if you do have frequent updates and lots of traffic, like any popular forum style site, you will get lots of crawler traffic. And I would bet the algo does the same for all endpoints on a site. So the "about us" page on a popular site probably ends up not being crawled nearly as much as the new threads page.


Is there subsets of common crawl anywhere for individual sites. E.g. YouTube for example?


You can query subset of specific sites from common crawl itself.


If you're the kind of person who wants "open data" (read as broadly as you like) and could get it in snapshots direct from the source without having to scrape, what would your ideal format be?

I know it's a very open ended question.


Alternatively, think how much traffic a site could save itself if it allowed subscribing to webhooks on new content, or having a functional sitemap.xml to say nothing of the end-user experience of having fresh content

I would speculate (based on anecdata) that a lot of the actual load placed upon sites is from the discovery phase -- what pages are there, and have any of them changed -- not so much "hit this one endpoint and unpack its data"


It's an interesting idea to have a sort of a "what's up?" endpoint that is just basic text and tells you what's fresh. At least for static websites, it's quite easy to do.


Probably RDF serialized as hextuples https://github.com/ontola/hextuples


Looks interesting. From that page I couldn't see what 'graph' field relates to. Is it the identifier for a distinct named graph? It was blank in the examples.

Do you use it? What for?


It’s not well defined. I try to use it to represent where the data came from, I.e. dbpedia


Thanks for the question, i can speak for what we've encountered in these years of web scraping and nothing beats API and JSON, but i'm sure there are formats even more friendly to read.


The way that PushShift offers huge amounts of reddit data in compressed JSON is very convenient. Checkout files.pushshift.io


Good list, confused about the “tabs weighing less” bit. Isn’t that a preference left for the end-devs?

Another tip I’ve found is to check if the data is accessible on a mobile app and proxy it to see if there is a JSON API available.


Thanks for your reply, mobile data it's a thing i need to add soon. Usually we check using Fiddler if there's an API inside, but only for really problematic website.


I was reading another thread about webscraping, someone mentioned CSS selectors being way quicker than xpath. I'm easy either way but apart from a more powerful syntax what other benefits are there?


In my experience, it's not that CSS selectors are "more powerful," but rather "more legible." XPath is for sure more powerful, but also usually lower signal to noise ratio

    response.css("#the-id")
    # vs
    response.xpath("//*[@id='the-id']")
Thankfully, Scrapy (well, pedantically "parsel") allows mixing and matching, using the one which makes the most sense

    response.css(".someClass").xpath(".//*[starts-with(text(), 'Price')]")


CSS is nice because it's more readable than XPATH for longer queries, and is friendlier to newer programmers who didn't come up when XML was big.

XPATH is generally more powerful for really gnarly things and for backtracking. "Show me the 3rd paragraph that's a sibling of the fourth div id="subhed" and contains the text "starting".


  > XPATH is generally more powerful...
That is a convincing argument is you can back it up with an XPATH expression.


Here’s an example of parsing some particularly annoying old school html. I’m not claiming it’s the _best_ way to do it, just that you can, and I’m not sure this one is doable with selectors. https://github.com/openstates/openstates-scrapers/blob/40246...


Well, the rest of their sentence summed it up pretty well; try and implement that example using CSS selectors

Hell, even "find id=subhead and _go up one element_" isn't possible in CSS because that's not a problem it was designed to solve


I'm not sure about quicker. Doesn't scrapy use elementpath?. which converts a css query to an xpath under the hood as there is no complete CSSOM available for python. Likely as there is no modern standards based python dom to operate on so doing it on lxml tree is probably the best option. I find the main difference is xpath can return an attribute value where as css returns the node. You can use either from the terminal in my lib... https://github.com/byteface/domonic (as it uses elementpath like scrapy)


sorry i meant cssselect... https://pypi.org/project/cssselect/ which converts to xpath.


In my experience, XPath selectors are easier to write but usually result in selectors that are less robust to DOM changes. It is possible to write reliable XPath selectors as well, but I often see XPath selectors breaking because of implicit assumptions about the DOM structure. I don't see this as often for CSS selectors since they encourage you to make more explicit assumptions.

This is in the context of test automation of modern web apps with a virtual DOM. I'm sure things might be different in other areas.


Having a large codebase like ours, we find out that XPATH are more readable, but i understand it's a personal feeling. We don't have high frequency scraping, so the performances of CSS vs XPATH were not considered. It's an interesting point i'd like to write more about, thanks for sharing.


It can be frustrating learning web scraping with Python when so many sites actively block scraping.


This is sad reality that web scraping is quickly becoming only accessible to people who can afford the captcha/computing/dev resources.

Identifying scrapers is actually really easy but it's not a binary decision. Anti scraping systems usually keep score that is compiled of few measurements so just applying some commonly known patches can improve your trust score significantly!

We recently published a blog series on all things that can be done to avoid blocking [1] request headers, proxies, TLS fingerprint, JS fingerprint etc but it's quite a bit of work to get there if you're new to web scraping - there's just so much information to get through and it's growing every day.

1 - https://scrapfly.io/blog/how-to-scrape-without-getting-block...


Have you noticed selenium now opens chrome in some kind of 'dev mode' that stops access to cookies so you have to faff more. Makes you wonder if better to use pynput


Any chance you could name a few of those sites. LinkedIn is the only example I know of.


The main perpetrator I've come across (in, I'll admit, my limited experience with scraping) was Google services. It seemed somewhat obvious to me initially while playing around with Beautiful Soup, the most valuable resource I could use for scraping was Google search, followed by Gmail and Google Finance. I had no luck using BS with any of these services.

Largely this taught me to be creative with my data sources. For example, I built a virtual weather vane powered by a Raspberry Pi that would scrape my local airport's website to get wind direction data, then turn the vane via a servo to the correct direction. So my takeaway from this project was scraping isn't as straight forward as one would thing, there's more of an art to it in order to figuring out where to get the information you want.


Agreed. Most public websites are not trying to force visitors to use a certain web browser by selectively denying access. No public website should be the sole, exclusive source of its data because if the data is public then by definition the data can be copied by any other website. As such, chances are the data can be found in multiple locations and at least one will not be trying to force the use of a certian web browser.

Am I correct that the examples listed here are (a) www.google.com, (b) mail.google.com and (c) www.google.com/finance/. I have no trouble extracting data from these examples.[FN1] I do not use a graphical web browser to make HTTP requests nor do I use Python or BeatifulSoup. A cookie is required for mail.google.com, in lieu of a password, but the cookie can be saved and will work for years.

1. Of course, Google Web Search is crippled. Using a basic HTTP client, e.g., no cookies, Javascript, FLoC, etc., one cannot retrieve more than 250-300 results total. Searching "too fast" will draw a temporary IP block. This "search engine" is designed for advertising not discovery. Advertisers compete for space at the top of the first page of results. Popular websites are prioritised, potentially making them even more popular. Websites that "rank"[FN2] too low in a search are not discoverable as they have no value for advertising. An index of public websites and public data is treated as properietary and secret. Google actively tries to prevent anyone from copying even a small portion of it.

2. Google makes it impossible to sort results by URL, date, or even number of keyword/string hits in the page. Results are ordered according to secret algorithm, designed for advertising.


2. Or sort by <title>.


As the second sentence says, it's a cat and mouse game, so there's no incentive on either side of bot vs anti-bot to share information.


I'm sure no one will add here its secret sauce :)


Indeed. Personally I would want to give away more, but for now mostly getting the red light here, for obvious reasons. What we could however do better as a community is share generic tooling.


For websites that require auth via Google Auth this is a non starter. There’s no way to bypass its bot detection


Perhaps you should anyhow not burn your hands on such targets even if technically possible, you start to get in murky water quickly. Requiring Google Auth meaning that you need to login, makes it debatable whether or not that data is still public data.


Is there an FLOSS project that combines Scrapy scrapers and just makes the results publicly available?


This is the one I know about: https://morph.io/ and https://github.com/openaustralia/morph#readme (AGPLv3) -- they used to be at the intersection of "heroku for scrapers" and DoltHub (e.g. https://www.dolthub.com/repositories/dolthub/us-businesses/d...) since the scrapers would run but then make their data available as CSV or sqlite or whatever. But, when I just tried to load one of the morph.io scrapers, the page just said "creating new template" so I'm guessing they've gone the way of the ScraperWiki.com that preceded them: turns out, hosted compute for free isn't free


As someone who recently dealt with scraping sites behind cloudflare...I never want to scrape again


Honestly cloudflare isn't really a big deal in terms of scraping.


I think that if you don’t want to invest a lot of time into learning web scraping and money to get a pool of residential, or even better mobile, proxies it’s easy to quickly get good results with web scraping API like https://scrapingfish.com They have good blogposts, for example, for how to scrape public Instagram profiles: https://scrapingfish.com/blog/scraping-instagram




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: