Show HN: Run Puppeteer on AWS Lambda

sriram_iyengar · on Dec 6, 2018

What is the advantage of running automated tests as lambda ? Typically when automation tests are run, they are long running processes and lambda execution may not be suitable. The cold start times of lambda is another challenge. What is a good practice/model for running suites ? One test per lambda or, one spec per lambda ? Still inclined to have ec2 instances created and destroyed via devops tools like terraform to run automation. Thoughts please.

alixaxel · on Dec 6, 2018

Well, puppeteer can be used for more than test suites (think screenshots, PDF rendering, proxified APIs, ...).

But for running long automated tests, I'd probably look into alternatives like Fargate, where the billing model is per-second with a one minute minimum. Terraform + EC2 spot instances works too, obviously. :)

sriram_iyengar · on Dec 6, 2018

Thanks. The example you’ve mentioned will be useful.

tnolet · on Dec 6, 2018

My SaaS uses Puppeteer for website monitoring. Shameless plug: https://checklyhq.com/product/transaction-monitoring/

It's been quite difficult to get it right, with Puppeteer being young etc. but it chugs along nicely now at around 10k-15k runs per day spread over 4 regions.

defied · on Dec 6, 2018

There are quite some disadvantages to use AWS Lambda to do Automated Testing (test time is capped to 15min, cold start waiting time, ...).

The advantage is that you can run a lot of tests concurrently at a relatively cheap cost.

The company [1] I work for offers VMs that are created/destroyed automatically after each test. There's no cold start, and no time limit. Plus you can choose to run headless like Puppeteer or test in an actual OS like Win/Mac.

[1] https://testingbot.com

asien · on Dec 6, 2018

Lambda have a maximum running time of 15min. Even if you have « Cold Start » it won’t more than a minute for Chrome to be up and running.

Meaning you’ll get at least 10 to 14 minutes of Headless testing.

As for recommandations on how to do so , unless your testing is super long you should just run the entire thing in one function . Otherwise decouple your testing based on the various modules of your app ( i.e one module per function )

sriram_iyengar · on Dec 6, 2018

Understand the 15 min and <1min coldstart. The question was more towards test suites with 100’s of tests for really large products and you cannot break them. The scenarios won’t make sense. The feedback from these tests will not be done in 15 mins.

whoisjuan · on Dec 6, 2018

I think is more cost efficient and scalable than having an EC2 instance being created and destroyed every time you run a test.

hayd · on Dec 6, 2018

parallelization

tnolet · on Dec 6, 2018

I run a ton of Puppeteer jobs (300k in the last month), currently on EC2 and Digital Ocean VM's, mostly due to the subtle difficulties of running Puppeteer on Lambda.

Will certainly have a look at this project and contribute where possible.

My main concerns are not so much cold start time, as for my use case this is not really a huge issue, but mainly the performance of Chrome on AWS Lambda boxes. The rendering, navigation etc. needs to be snappy.

thesandlord · on Dec 6, 2018

Google App Engine and Google Cloud Functions got native support for Puppeteer a few months ago as well. Let me know what you think if you try it out.

https://news.ycombinator.com/item?id=17795626

(I work for Google Cloud)

alixaxel · on Dec 6, 2018

The performance of puppeteer is super bad on CGF (you can read more about it here https://github.com/GoogleChrome/puppeteer/issues/3120). It would actually be great to have someone really improve this situation instead of dismissing it as a weird IO problem.

thesandlord · on Dec 6, 2018

Did some research internally, this is being tracked but still no root cause AFAIK :(

tnolet · on Dec 7, 2018

Would love to use GCF but the performance is terrible (as mentioned) and I need more geographic locations than GCF offers.

alixaxel · on Dec 6, 2018

I also run hundreds of thousands of puppeteer sessions every month, all on Lambda and so far I'm pretty happy with it, from scalability itself to session performance.

Granted, there are some issues with rendering (fonts, emojis and whatnot) but meanwhile there are solutions available that could be explored.

Feel free to try it out and share your specific challenges on GitHub, I'll do my best to come up with solutions for them.

russian_bot · on Dec 7, 2018

hi,

out of curiosity, what is it that you do that demands so many sessions? Just webscraping?

dschep · on Dec 7, 2018

Here's another alternative lambda layer containing headless chrome with and puppeteer example: https://github.com/RafalWilinski/serverless-puppeteer-layers

nailer · on Dec 6, 2018

This is fantastic.

- I'm just getting started with Lambda so pardon if this is ignorant, but what's the cold start time of Chromium? Or can you warm start it somehow?

- Since scraping often depends on state, wouldn't you hit a timeout doing longer scraping joba?

alixaxel · on Dec 6, 2018

Thanks!

So usually with Lambda, you want your jobs to be as atomic/quick as possible, as Lambda is stateless and has a maximum duration of 15 minutes.

As for the warm up times, the decompression of Chromium with Brotli takes about 700ms on a 1.5GB Lambda (this is faster than Gzip/Zip). Launching Chromium itself and opening a new tab takes another 400ms or so. If you keep your Lambdas warm (by registering a scheduled ClowdWatch event every 15 minutes for instance) your startup time will effective be those 400ms.

jpambrun · on Dec 6, 2018

If you keep your Lambda warm, shouldn't you just use something like browserless (https://github.com/joelgriffith/browserless)?

e1g · on Dec 6, 2018

I run browserless Docker container on-prem and it works very well for us. Fire&forget, +1.

Touche · on Dec 6, 2018

Presumably you can also keep Chromium running and keep a pool of tabs to reuse. Doing this it would pretty fast I imagine.

Depending on your use case you can also disable security and open as many iframes as you want in a single tab. Not sure how this compares to multiple tabs though.

Of course you'll run into cold start again when lambda has to scale.

alixaxel · on Dec 6, 2018

Yes, this is a good solution but only if you don't have any sort of session data that you want flushed out after it runs. One could argue you could use browser contexts (incognito tabs) to have ephemeral sessions, but unfortunately that feature doesn't work in --single-process mode (which AWS Lambda requires).

pouta · on Dec 6, 2018

I'm using incognito mode to parse some pages that for some reason I can't using the normal context.

I have been considering moving my pool of chromium workers to lambda functions so we can avoid api slowdowns due to a high number of parsings at the same time.

Are there any other side effects of running chromium headless in a lambda function?