Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Getting started with Puppeteer and Chrome Headless for Web Scraping (github.com/emadehsan)
142 points by emadehsan on Aug 29, 2017 | hide | past | favorite | 37 comments



Where I work we prefer jQuery to the native DOM API for scraping. It really speeds up the process of extracting data.

For example with Puppeteer you can do page.injectFile("jquery-3.2.1.min.js"). I think that would simplify your evaluate() calls.

It would also be easy to speed up the whole process by doing a single evaluate() call per page with all your scraping code in it.

BTW we just released an article with tips & tricks for Headless Chrome: https://blog.phantombuster.com/web-scraping-in-2017-headless... What do you think?


Good suggestion. I would update soon. Thank you


> Since the official announcement of Chrome Headless, many of the industry standard libraries for automated testing have been discontinued by their maintainers. The prominent of these are PhantomJS and Selenium IDE for Firefox.

Correct me if I'm wrong, but if I'm notm mistaken Selenium IDE has been discontinued due to lack of mantainers, and that has little if any relation to Chrome Headless.

The IDE is just a more effective way of programming test behavior; the Selenium webdriver is still up and working with straight code (as is the case of this tutorial).


Which is kind of sad. No Chrome Headless library supports dowloading of files.


Related or not, it seems like a valid point.

We switched to chrome headless after a post from thoughtbot made me question Capybara-WebKit's future.


Selenium IDE was discontinued due to the change of extension (from XPI to WebExtension) in Firefox. Nothing to do with Chrome Headless.

see https://seleniumhq.wordpress.com/2017/08/09/firefox-55-and-s... and associated HN discussion https://news.ycombinator.com/item?id=15061605


Plus, there are already new IDEs showing up, for example https://chrome.google.com/webstore/detail/kantu-browser-auto...


You are right! Also the discontinuation has no relation to Chrome Headless as far as I know.

I have updated the article so it doesn't seem like the two events are related.


There's still nothing equivalent to selenium ide for chrome headless.


Great tutorial! Also, you look like a Full stack. How's the reception for Hospital Run software you worked on? (https://github.com/HospitalRun/hospitalrun-frontend)

""Somewhat similar is the case with Internet that we traversed today in quest of data.""


HospitalRun team is great and very welcoming. You can join there Slack channel here: https://hospitalrun.slack.com/ . The project is expected to be undertaken by JS Foundation in near future. And Yeah, I am a Full stack developer.


Two things:

1. Please do not test a web app with Chrome only, we don't want to go back to a world with a single browser

2. > So, until puppeteer supports this, we will rely on jsdom, a package available via npm

JSDOM is not just a package on npm, it's an engineering piece of art


Just to note, there are other uses in browser automation beyond testing (this article is about webscraping). Selenium webdriver have its own limitations, and they aren't willing to add features to cover other use cases.

I hope Puppeteer become a standard.


Headless coming to Firefox. I suspect Puppeteer would support it too very soon.


I have given up on Firefox. Every bug I submit takes it own sweet time to get fixed.


How about the bufixes you submit, are they faster?


Ooooh I read this fantastic introduction the other day and wrote this wee HN demo using Cheerio. https://gist.github.com/veb/c1beab69b5eb1b07123e5eaf55b80320


Do you know if it is possible to render a page without serving it from a web server? For example, I have the html of one page of my domain generated by a test. I would like to use puppeteer to render it. But I don't want to setup a http server for this. I would like to give a string with the html + a url to page.goto and let it render the page like it comes from the real server.

I guess I can cheat by intercepting the request and respond with the html I already have. But I wonder if there is already something existing.


Save response in a file and open in puppeteer using file protocol smth like file:///c:/response.html


You should be able to use a data URI containing the HTML string



That's not correct.

Initial assumption when reading the thread was that navigating to a data URI would be handled like entry of a data URI into the omnibox and still be allowed.

A small test case confirms that assumption - it works.


This is the point of chrome headless.


I am writing almost the same thing but for PDF [1]. But I am having trouble with scaling.

I got able to make it run inside a docker.

In this exact moment the example at the repo is just returning a blank PDF but the problem is at the API Gateway.

[1] https://github.com/tecnospeed/pastor


Puppeteer is definitely cool but on a recent project I had to revert back to using NightmareJS as I needed to download files.


This is currently being worked on in Headless Chrome. There's been tons of development on the project and they're super open to feature requests.


A simple option for web scraping is just to use the developer console in a real web browser.

I have a repo outlining the basics here: https://github.com/jawj/web-scraping-for-researchers


Tried Puppetter, Its pretty awesome. I'm a newbie in terms of scraping but thus far its been a pleasant experience with this tool. Anyone used artoo.js with puppeteer successfully?


Is it possible to launch multiple times const browser = await puppeteer.launch(); in a same nodejs process? I haven't find any information about that


It is possible. Beware though that each `puppeteer.launch()` will spawn a chromium process.


I can write mini script to scrape emails and github, what's up about this hype?


Correction: it's "scraping"


Corrected! Thanks :)


I tried it out when it was released, it works well and it is decently fast.


What is faster than puppeteer? All the alternatives using electron look slower.


Seems like most of the parsing is done by JSDOM in this tutorial.


Updated to use `page.evaluate`




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: