Hacker News new | past | comments | ask | show | jobs | submit login
Trafilatura: Python tool to gather text on the Web (github.com/adbar)
134 points by kevin_hu on Aug 14, 2023 | hide | past | favorite | 22 comments



This tool is so great for robustly dealing with content in old and poorly formatted HTML. There are a lot of similar tools for extracting "the main text" from free-form HTML, but this was the most reliable in my experience, especially when dealing with web archives containing hand-written HTML back to the 1990s, working with non-English languages, etc.


Author here, nice to see the package on the HN's front page this morning and thanks for the kind words! Just created an account to participate in the discussion, I'll try to answer your questions.


I’ve been using this package and like it a lot.

One problem I’d like to find a solution for is how to get past cookie pop ups when scraping a website. I’ve not found a satisfactory packaged solution for this. Clearly a tough problem in general but wondered if people have found good libs to help with this. I’ve heard of solutions involving playwright etc.


Thanks! Here is what I put together in the docs, you could basically preprocess/render/filter the webpages with the software of your choice and then pass the result to trafilatura: https://trafilatura.readthedocs.io/en/latest/troubleshooting...


Cool tool, I used it for a scrapinh project and it performed quite well for extracting clean text and the date.


I wish there was a web service that used this tool to scrape nicely-formatted plain text from any website, then archive it and serve it as a super basic web reader.


You sort of, kind of, maybe just asked for roughly what RSS (Really Simple Syndication) provides...although your wish is more of a "pull", while RSS is more of a "push" in content access/distribution. :-) Don't get me wrong, I'm in agreement with you. I wish every website, web app, well, pretty much everything digital had an automated RSS feed available to consume and subscribe to!


With RSS you are at the mercy of the server, though. The content creator may only syndicate an excerpt of the whole article, remove pictures or formatting, yada yada. But yes the Web would be so much nicer if more websites provided at least some form of content syndication...


Agreed, one would definitely be at the mercy of the author/content creator...but I often feel like someone who is willing to offer an RSS feed probably would likely enable easier consumption of their content even if one needs to actually visit the website. Of course tthat is a very broad generalization I'm making. But you're certainly not wrong.


Archive box (https://archivebox.io/) will create a local dump of any site in a multitude of formats from raw html, printed PDF, and extracted body text. Also has option to request internet archive to trigger a scrape of the page.


Not sure how it fares nowadays, but I used to employ Mercury Reader/API for this, now called Postlight Reader[1]. While not perfect, I found it to work for most daily reading needs.

[1]: https://reader.postlight.com/


Concerning tooling I'd say you have two different worlds, JavaScript and Python, each with a series of tools to tackle such tasks. It's not easy to compare them directly because of varying software environments and I haven't had a chance to test JS tools thoroughly.

For the sake of completeness: Mozilla's Readability [1] is obviously a reference in the JS world.

[1]: https://github.com/mozilla/readability


Sounds trivial to implement using this library with a bit of glue code for the web bits:

https://archiw.fly.dev/


check out nghota.com api. It is able to pull out the main text from most non-ecommerce web pages and return that to you in json.


In general I'd be curious to try this, but your homepage is not very convincing.

The "demo" doesn't look like typing, it's a fade right, and it's painfully slow. And then, there's no library, it's just 'import requests', so even the demo is extra long. (Why not show curl then?)

Also, are there any benchmarks? Why should I take the time to evaluate this myself against existing open-source tools? It seems like that should be your responsibility, not mine, to spend the time doing a detailed comparison and evaluation. In a way that feels open and trustworthy.

I respect what you are doing and share this feedback from the heart.


There’s been a few such things over the years. I even built one for iOS/iPad that’s still in the store. I found that doing the parsing client side is preferable because so many sites have paywalls and render some of their content with JS. I never did much with the app because it’s hard to monetize, but I maintain it occasionally.


What is the gap between this and beautiful soup?


This tool can extract data in a structured format from virtually any website, with any HTML structure.

With Beautiful Soup, you'd need to explicitly tell where each piece of data exists referencing HTML tags, ids, classes, etc. For each website you'd want to process.


The feature list answers that question pretty well: https://github.com/adbar/trafilatura#features

Basically: you could implement all of this on top of BeautifulSoup - polite crawling policies, sitemap and feed parsing, URL de-duplication, parallel processing, download queues, heuristics for extracting just the main article content, metadata extraction, language detection... but it would require writing an enormous amount of extra code.


Maybe bs4 + newspaper3k rolled into one? But still, what's the gap?


Regarding content extraction it's more accurate than newspaper3k (especially for languages other than English) and it entails more information: metadata, text, and comments. It works out of the box in most cases so no need to write a particular scraper for a given websites, which saves time. If you care about 2-3 websites and are willing to write and maintain scraping scripts then bs4/lxml/whatever is also fine.

It also features functions and a command-line interface to collect data on your own (say find recent news using feeds). So it's not merely about text extraction in the end but also text discovery.


Has anyone already used this package to code a web article to markdown download?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: