Hacker News new | past | comments | ask | show | jobs | submit | jetter's comments login

I am running a web scraping API ScrapeNinja https://scrapeninja.net. 10K+ subscribers.

It is a (rather messy) node.js codebase. Two rendering engines, including a hacked puppeteer package with stealth mode for better success rate. A big set of proxy providers under the hood. Bootstrapped.


What does "stealth mode" mean?

Quite curious, I have been scraping some websites for my girlfriend with nodejs/puppeteer and put the content on an .epub file (she likes to read on her e-reader) and it can be quite annoying to bypass some anti-scraping techniques.


Small piece of feedback, the main text "Smart Web Scraping API" looks pretty off-center to me. Using latest Chrome on Mac on a 4k screen.

I use Clickhouse to store close to 1TB of API analytics data (which would be 10TB in MongoDB, Clickhouse has insane compression ) and it's a wonderful and stable SQL-first alternative to DuckDB - which is a very exciting piece of software, but is indeed too young to embed into boring production. The last time I checked DuckDB npm package, it used callbacks instead of awaits..


I can understand how the older callback API for node.js might form a negative impression, but it's really not indicative of the maturity of the core db engine at all. And remember: the vast majority of users use the Python API. Even better news is that, as of a couple of months ago, there is now this package (which I wrote at MotherDuck and we have open sourced) which provides typed promise wrappers for the Duckdb API: https://www.npmjs.com/package/duckdb-async. This is an independent npm package for now, but was developed in close coordination with the DuckDb core team.


Interesting test suite, thanks! I have tested scrapeninja.net via https://scrapeninja.net/scraper-sandbox and I got { "puppeteerEvaluationScript": "OK", "webdriverPresent": "OK", "connectionRTT": "OK", "refMatch": "OK", "overrideTest": "OK", "overflowTest": "OK", "puppeteerExtraStealthUsed": "OK", "inconsistentWebWorkerNavigatorPropery": "OK", "inconsistentServiceWorkerNavigatorPropery": "OK" }

and ip range of "us" geo proxy gives is_abuse: true. Consider this to be okayish though, given that this is a default proxy pool.


scrapeninja.net /scrape-js endpoint scrapes company pages of g2 without big troubles (with "us"/"eu" proxy geo in their online sandbox: https://scrapeninja.net/scraper-sandbox ). They also have /scrape which is much faster because it does not bootstrap real browser, and bypasses CloudFlare TLS fingeprint check: https://pixeljets.com/blog/bypass-cloudflare/


If you are a minimalist, and are using VS Code, try https://marketplace.visualstudio.com/items?itemName=humao.re... which is a pure text syntax to describe API requests, and execute them right from the editor window. I now have api.http text file in every API-first project I am building and I love it.


I like this one because it's easy to keep API workflows with my projects. The scripting ability here is phenomenal. However only really useful if you code in VS Code.


Jetbrains also provides a similar, albeit slightly incompatible syntax for the same thing.

In the end, I think hurl [0] is nicer, because it’s open source and it’s a cli tool (and VS code also has a syntax highlighting plugin for it), making it editor independent.

[0]: https://github.com/Orange-OpenSource/hurl


Do you have a single api.http file or do you you multiple {route}.http files?


Not OP but you can store all your routes in one file or multiple, it's up to you.

Personally what I do is I script out full API workflows in different files. So one file might login, then POST to add an object, then GET that object off an endpoint, then patch that endpoint, then trigger the GET object again.

Another workflow might login, upload an image, get that image, etc. For me the scripting is what makes this appealing.

But you could setup one file that documents and tests all your endpoints similar to Postman.


For me, it is always a pain to write and test cheerio code unless I was doing it on the previous week. The syntax of cheerio is somewhat similar to jQuery, but this is still node.js, and not "real" DOM.

I was suffering every time I was googling for "Cheerio quick examples", so I have built a cheerio sandbox to quickly test cheerio syntax against various test inputs. This is already helpful for myself and saves me up to 15-30 minutes on every simple scraper I am writing, I think, just because I have working selectors samples at hand and I can quickly test my new selectors.


I have built an online tool which does just one thing: traces a handwritten signature.


Thank you for what you guys do. Altinity blog and videos are an outstanding source of practical in-depth knowledge on the subject, so much needed for Clickhouse recognition.


You are most welcome. The webinars and blog articles are incredibly fun to work on.


https://github.com/meilisearch/MeiliSearch gets a lot of traction recently. There are also Sphinx and its fork https://manticoresearch.com/ - very lightweight and fast.


I immediately thought of Sphinx when I saw MeiliSearch... it's uncanny how the use case and implementation semantics haven't changed much in 15 years.

The beauty of pointing it to your mysql tables and getting fulltext-via-api on the other side was quite nice.


I've mentioned Pinot and Druid briefly in 2018 writeup: https://pixeljets.com/blog/clickhouse-as-a-replacement-for-e... (see "Compete with Pinot and Druid" )


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: