Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Web scraping that just works with OpenFaaS with Puppeteer (openfaas.com)
145 points by alexellisuk on Oct 31, 2020 | hide | past | favorite | 15 comments



I adore OpenFaas, it gives you the best of 1) serverless, 2) containers, and 3) portability.

The OpenFaas Function Store is a useful way to mix and match runtime components very easily! Example: use sentiment analysis to automatically mark new Issues with rants as bugs: https://github.com/openfaas/workshop/blob/master/lab5.md


Interesting article. I operate a somewhat similar service at https://headlesstesting.com where we provide a grid of headless browsers. You can connect to these via Puppeteer (and Playwright) to do various things discussed in the article: scraping, taking screenshots, generating PDFs, ...


OpenFaaS has been pretty easy to work with for anything we've thrown at it, ranging from web scraping to ML deployment. If anyone here has been on the fence, definitely give it a shot.

Back when we first set it up a few of the defaults were utter garbage (like autoscaling which (thankfully slowly) ramped up or down between MIN and MAX instances based purely on whether you were above or below a qps threshold), but there aren't all that many features, so reading the whole manual and configuring it like you want is a cinch.


OpenFaaS is great ... But its quite slow compared to something like Nuclio: https://github.com/nuclio/nuclio


Thanks for that pointer. Nuclio wasn't on my radar.


Could you give a TL;DR version of how we can use OpenFaaS with Google Cloud Functions? Or is it meant to be deployed to GKE?


OpenFaaS is more-or-less a competitor to Google Cloud Functions or AWS Lambda. None is really quite a subset of the other in terms of features, so you might gain some benefit by using multiple FaaS offerings, but they all occupy the same niche.

You can deploy OpenFaas on any Kubernetes offering, Google Cloud Run, Docker Swarm, etc... It runs on your favorite Docker substrate without much hassle.


I've been working on creating something like this, but using Kubeless instead, it's been pretty great so far.

I wish I could use OpenFaaS as a Serverless Framework provider. Being able to create and deploy your workloads anywhere (cloud, on-premise, self-hosted, etc) you want is really valuable, especially when you're doing web scraping.


As a newbie, I see a lot of articles about scraping. Why is that so interesting?


From my experience, any business beyond a certain size ends up requiring scraping to get access to some data that is not published cleanly or available programmatically. It is also one of the common gateways to programming - and hence gets written more about I guess?


"I want piece of information from page programatically but there's no API."

Common ask, especially in anything dealing with secondary markets (card balances, information on sales, shoe drops, etc)

API will be much faster, but there's a place for this.

This is also common when talking to IVR systems (ex. I want a card balance, have some code call the phone number on the card, walk through the phone tree, and use voice reg to get the number)


APIs may also present the data differently, not have all the data, require payment, or have much lower rate limits than a browser solution.

(Disclosure: work at a SAAS that does both and have done some for personal projects.)


Great tutorial. OpenFaaS is an amazing piece of software, thanks for putting in so much time and effort, Alex!


For folks looking for similar solutions: browserless.io is worth a look too.

Disclaimer: No affiliation.


Is there a solution that makes logging in easy for scraping behind paywalls?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: