so I have an upcoming side project that will heavily leverage scraping/screenshotting and I plan to use either headless chrome or slimerjs. It will be a Rails app, queueing up lots of sidekiq jobs, that I (presumably) will have calling a headless browser instance, probably running in a container. Ideally I'll run this either on Heroku or even better, just a DO droplet running dokku.
I found browserless.io and the supporting github project over the weekend, looks promising.
Who else is doing this? Is the architecture I just touched on the right way to go about doing this? Or should I have each sidekiq job spin up its own browser container for the scraping job?
Also, chrome vs firefox won't really matter in this instance, but reliability/performance/memory will...any recommendations?
I built https://github.com/danielwestendorf/pdf-bot-pro a few days ago, a scalable 1-click heroku deploy rewrite of pdf-bot. It's fully functional, except I built the background queuing on Faktory, and there is not working Faktory add-on available for heroku yet (and thus, no real promotion of the project yet). You could set up a faktory server elsewhere and define the URL as an ENV var and it would work today.
Nice thanks for the link. So if I'm understanding correctly, your app essentially runs as a standalone api, receiving pdf requests, queueing them, processing them, then hitting a specific webhook url on success for other systems to then query the pdf tool to get the resulting pdf.
This is approximately how I was thinking of settings thing up as well, so thanks for this!
"With Firefox 56 and more, SlimerJS can be trully headless by adding the –headless option on the command line"
So that could be an interesting thing to play with. I've thought about just having my sidekiq jobs spin up a docker container with the browser in it and then just limiting the amount of jobs at once for resource constraints, and now seeing your comment I think I might just give it a shot.
I found browserless.io and the supporting github project over the weekend, looks promising.
Who else is doing this? Is the architecture I just touched on the right way to go about doing this? Or should I have each sidekiq job spin up its own browser container for the scraping job?
Also, chrome vs firefox won't really matter in this instance, but reliability/performance/memory will...any recommendations?