Show HN: Getting started with Puppeteer and Chrome Headless for Web Scraping

paps · on Aug 29, 2017

Where I work we prefer jQuery to the native DOM API for scraping. It really speeds up the process of extracting data.

For example with Puppeteer you can do page.injectFile("jquery-3.2.1.min.js"). I think that would simplify your evaluate() calls.

It would also be easy to speed up the whole process by doing a single evaluate() call per page with all your scraping code in it.

BTW we just released an article with tips & tricks for Headless Chrome: https://blog.phantombuster.com/web-scraping-in-2017-headless... What do you think?

emadehsan · on Aug 29, 2017

Good suggestion. I would update soon. Thank you

Giroflex · on Aug 29, 2017

> Since the official announcement of Chrome Headless, many of the industry standard libraries for automated testing have been discontinued by their maintainers. The prominent of these are PhantomJS and Selenium IDE for Firefox.

Correct me if I'm wrong, but if I'm notm mistaken Selenium IDE has been discontinued due to lack of mantainers, and that has little if any relation to Chrome Headless.

The IDE is just a more effective way of programming test behavior; the Selenium webdriver is still up and working with straight code (as is the case of this tutorial).

coolio222 · on Aug 29, 2017

Which is kind of sad. No Chrome Headless library supports dowloading of files.

jaxn · on Aug 29, 2017

Related or not, it seems like a valid point.

We switched to chrome headless after a post from thoughtbot made me question Capybara-WebKit's future.

escap · on Aug 29, 2017

Selenium IDE was discontinued due to the change of extension (from XPI to WebExtension) in Firefox. Nothing to do with Chrome Headless.

see https://seleniumhq.wordpress.com/2017/08/09/firefox-55-and-s... and associated HN discussion https://news.ycombinator.com/item?id=15061605

tw21 · on Aug 29, 2017

Plus, there are already new IDEs showing up, for example https://chrome.google.com/webstore/detail/kantu-browser-auto...

emadehsan · on Aug 29, 2017

You are right! Also the discontinuation has no relation to Chrome Headless as far as I know.

I have updated the article so it doesn't seem like the two events are related.

kyriakos · on Aug 29, 2017

There's still nothing equivalent to selenium ide for chrome headless.

ankit84 · on Aug 29, 2017

Great tutorial! Also, you look like a Full stack. How's the reception for Hospital Run software you worked on? (https://github.com/HospitalRun/hospitalrun-frontend)

""Somewhat similar is the case with Internet that we traversed today in quest of data.""

emadehsan · on Aug 29, 2017

HospitalRun team is great and very welcoming. You can join there Slack channel here: https://hospitalrun.slack.com/ . The project is expected to be undertaken by JS Foundation in near future. And Yeah, I am a Full stack developer.

twsted · on Aug 29, 2017

Two things:

1. Please do not test a web app with Chrome only, we don't want to go back to a world with a single browser

2. > So, until puppeteer supports this, we will rely on jsdom, a package available via npm

JSDOM is not just a package on npm, it's an engineering piece of art

rodorgas · on Aug 29, 2017

Just to note, there are other uses in browser automation beyond testing (this article is about webscraping). Selenium webdriver have its own limitations, and they aren't willing to add features to cover other use cases.

I hope Puppeteer become a standard.

iLemming · on Aug 30, 2017

Headless coming to Firefox. I suspect Puppeteer would support it too very soon.

hugh7 · on Aug 29, 2017

I have given up on Firefox. Every bug I submit takes it own sweet time to get fixed.

rodorgas · on Aug 29, 2017

How about the bufixes you submit, are they faster?

veb · on Aug 29, 2017

Ooooh I read this fantastic introduction the other day and wrote this wee HN demo using Cheerio. https://gist.github.com/veb/c1beab69b5eb1b07123e5eaf55b80320

testcross · on Aug 29, 2017

Do you know if it is possible to render a page without serving it from a web server? For example, I have the html of one page of my domain generated by a test. I would like to use puppeteer to render it. But I don't want to setup a http server for this. I would like to give a string with the html + a url to page.goto and let it render the page like it comes from the real server.

I guess I can cheat by intercepting the request and respond with the html I already have. But I wonder if there is already something existing.

stokilo · on Aug 29, 2017

Save response in a file and open in puppeteer using file protocol smth like file:///c:/response.html

houli · on Aug 29, 2017

You should be able to use a data URI containing the HTML string

egeozcan · on Aug 29, 2017

It's not allowed anymore: https://groups.google.com/a/chromium.org/forum/m/#!topic/bli...

ConfucianNardin · on Aug 30, 2017

That's not correct.

Initial assumption when reading the thread was that navigating to a data URI would be handled like entry of a data URI into the omnibox and still be allowed.

A small test case confirms that assumption - it works.

hugh7 · on Aug 29, 2017

This is the point of chrome headless.

garou · on Aug 29, 2017

I am writing almost the same thing but for PDF [1]. But I am having trouble with scaling.

I got able to make it run inside a docker.

In this exact moment the example at the repo is just returning a blank PDF but the problem is at the API Gateway.

[1] https://github.com/tecnospeed/pastor

MrBlue · on Aug 29, 2017

Puppeteer is definitely cool but on a recent project I had to revert back to using NightmareJS as I needed to download files.

andrewguenther · on Aug 29, 2017

This is currently being worked on in Headless Chrome. There's been tons of development on the project and they're super open to feature requests.

gmac · on Aug 29, 2017

A simple option for web scraping is just to use the developer console in a real web browser.

I have a repo outlining the basics here: https://github.com/jawj/web-scraping-for-researchers

jasan_s · on Aug 29, 2017

Tried Puppetter, Its pretty awesome. I'm a newbie in terms of scraping but thus far its been a pleasant experience with this tool. Anyone used artoo.js with puppeteer successfully?

testcross · on Aug 29, 2017

Is it possible to launch multiple times const browser = await puppeteer.launch(); in a same nodejs process? I haven't find any information about that

aslushnikov · on Aug 29, 2017

It is possible. Beware though that each `puppeteer.launch()` will spawn a chromium process.

naveedahmada036 · on Sept 2, 2017

I can write mini script to scrape emails and github, what's up about this hype?

dchuk · on Aug 29, 2017

Correction: it's "scraping"

emadehsan · on Aug 29, 2017

Corrected! Thanks :)

desireco42 · on Aug 29, 2017

I tried it out when it was released, it works well and it is decently fast.

testcross · on Aug 29, 2017

What is faster than puppeteer? All the alternatives using electron look slower.

kasbah · on Aug 29, 2017

Seems like most of the parsing is done by JSDOM in this tutorial.

emadehsan · on Aug 29, 2017

Updated to use `page.evaluate`