so I have an upcoming side project that will heavily leverage scraping/screensho...

deedubaya · on March 6, 2018

I built https://github.com/danielwestendorf/pdf-bot-pro a few days ago, a scalable 1-click heroku deploy rewrite of pdf-bot. It's fully functional, except I built the background queuing on Faktory, and there is not working Faktory add-on available for heroku yet (and thus, no real promotion of the project yet). You could set up a faktory server elsewhere and define the URL as an ENV var and it would work today.

dchuk · on March 6, 2018

Nice thanks for the link. So if I'm understanding correctly, your app essentially runs as a standalone api, receiving pdf requests, queueing them, processing them, then hitting a specific webhook url on success for other systems to then query the pdf tool to get the resulting pdf.

This is approximately how I was thinking of settings thing up as well, so thanks for this!

deedubaya · on March 6, 2018

Yup, exactly.

africajam · on March 6, 2018

Nice, thanks for sharing! I built a simple scraper and I need to add background processing etc:

https://github.com/RealEstateWebTools/property_web_scraper

Will dig into your code to see what I can learn.

smileysteve · on March 6, 2018

I used slimer because scraper detection didn't pick it up as much. Beware that slimer (when I used in 2015) still require xvfb.

Chrome headless should be fastest. And if you're printing to pdf phantom and chrome had the better ability to do that..

I had each sidekiq job spinning up its own process -- and needed to do some X display avoidance.

dchuk · on March 6, 2018

Nice, thanks for the reply. Looking at their site, it looks like they now support a true headless mode:

https://docs.slimerjs.org/master/release-notes-1.0.html

"With Firefox 56 and more, SlimerJS can be trully headless by adding the –headless option on the command line"

So that could be an interesting thing to play with. I've thought about just having my sidekiq jobs spin up a docker container with the browser in it and then just limiting the amount of jobs at once for resource constraints, and now seeing your comment I think I might just give it a shot.

africajam · on March 6, 2018

If you decide to open source your project, please share the link here. Thanks