I used puppeteer in my book’s (https://www.handsonscala.com/) build pipeline to convert HTML sources into PDFs, for online distribution and printing. This let me write and style my book using common web technlogies (e.g. bootstrap CSS) rather than needing to fiddle with specialized tooling like Pandoc or LaTeX, and it ended up looking pretty good.
Works flawlessly, and has exactly the configuration knobs you would expect and want. Took a bit of plumbing to call into a Node.js script from my Scala build logic, but all in all ended up being like 20 lines of plumbing which was straightforward to write and understand. A welcome change after struggling with bugs in wkhtmltopdf!
This is a really interesting use case! Screenshots of your book look great.
I'd just started looking into the existing tools like those you mentioned for e-book authoring and it's quickly gotten overwhelming. From what you've described, I'm interested in looking into puppeteer as an alternative workflow.
One day I will right an extensive post (or set of them) about using Puppeteer to bypass sites' anti-bot measures. It's a fascinating (and annoying) cat-and-mouse game. But at the end of the day, almost all bot detection measures rely on using Javascript to report back metrics about the browser, but those measures are running in an environment where the bot completely controls what Javascript reports back.
One of my favorite tricks I've seen employed are detection measures that look to see if common detection bypass tricks have been implemented (like checking the toString output of commonly overridden native functions.)
I recently was working on the same thing (https://github.com/chris124567/puppeteer-bypassing-bot-detec...). The existing solutions (like the headless-cat-n-mouse repo) seemed to be pretty incomplete and easily detected. I got mine to pass all the checks on Antoine Vastel’s site along with Distil Networks’ and PerimeterX‘s bot detection (although in practice they may have other ways of detection like checking for rapid URL visits).
Something worth noting about toString is that it can now be undetectably modified (to fake “native code”) with the new ES6 Proxy object. There was a really interesting blog post written about this at https://adtechmadness.wordpress.com/2019/03/23/javascript-ta... (I also incorporated this into my project).
CF seems to have started classifying browsers with no existing CF cookies as likely bots (a score of 10 or less, where 99 is a human and 1 is confirmed bot) for enterprise users of their Bot Management feature[0]. From my testing, it happens for both puppeteer and incognito tabs of Chrome, even with perfect IP reputation.
That would explain why I always see CF bot prompts when visiting a site for the first time or the hundredth time in Safari with a few layers of tracking protection and no third-party cookies. I prefer to answer captchas if that’s the price I pay for a bit more privacy, then...
CF and Google Captcha is really making the web unbrowsable with hardened browsers. The web is looking really grim for people who care about privacy these days.
This may be a very noob tool for this game but it has served me well and even though I'm guessing most people know about it, just sharing it for reference:
I wonder if google captcha will always be able to defeat puppeteer? Seems odd for google to publish a set of abuse-able APIs, and not be able to detect their use.
There are farms of people who literally sit around all day and solve CAPTCHAs - there's no surefire way to address this problem and it usually ends up in an orchestration of reputation-score tooling (including making a user fill out a CAPTCHA) to fingerprint a bot.
If you're good at spoofing all of that fingerprinting you'll blow straight past them - it's all client-side in-which you have control all the way down to the bits and bytes.
I've used and love Puppeteer, but it also makes me realize with enough money and the right skills, these tools are exactly what would be used to mass manipulate social media. You could literally create thousands or millions of accounts and add enough entropy to make it undetectable.
I've created a couple scripts to delete accounts, and even signing in can include randomization between scrolls and clicks to make each change entirely unique and to mimic real user interactions. Sort of scary to think about what is possible with this, especially given a large pool of residential IP addresses.
OP here: funny and true story. We had a misconfiguration in our rate limiting. One smarty pants used it to blast his twitch channel with "viewers", e.g. a 1000 concurrent Puppeteer sessions.
It was probably great for his engagement / viewership numbers till we shut him down.
I have to ask... What was the reason you shut this person down? Was it simply that they were violating your rate limit, or something more?
E.g. if they were paying for the 1000 concurrent puppeteer sessions, would everything be in the clear with your SaaS?
Presumably, your service doesn't care what the users use it for. Sure, though, it's a violation of twitch's terms of service to fake viewers on the platform. I may be naive -- could twitch sue a service that is used to fake viewers?
They were violating our terms of service, rate limiting, responsible behaviour etc. etc. So we care quite a bit what Checkly is used for. We have a page for it:
That you were asked this question makes me a little sad.
Beside risks like Twitch (or worse, all of Amazon) blocking your traffic wholesale, letting someone use your product to abuse another is just a crappy thing to do; Especially if you let it because "well we're making money".
I crawled a few social media platforms that do not offer accessible APIs as an experiment. I mimicked user behavior, I even occasionally bypassed google's noChapta (although that could have been just pure luck) on a few sites.
Seeing what people put in their bios, what their network looks like, their posts and all the metadata, its incredibly what amount of user data is pretty much out there in public. Seeing that this is just the tip of the iceberg in terms of what companies like Facebook are collecting, all those stories about algorithms predicting pregnancies before the people themselves know about it seem much more realistic.
The main think holding people back from mass creating accounts to spam social media (even more) are chaptas, and those can be outsourced to chapta farms somewhere in the world.
Absolutely, there are already cheap third-party captcha solvers that basically outsource the captchas to real people. I could also see companies using this to inflate their user count.
Try out the selenium IDE.[0] It's a browser extension which allows you to record your movements/actions in the browser, and then export a scaffold of the code to replicate the interactions. It could serve as a decent start point
I work at Sauce Labs (inc. TestObject and Screener.io) and the Record-and-Playback tests are the ones that tend to be the most...
Not great. They're prone to timing problems and fragile selectors. They work well for rapidly recorded, rapidly changing sites (where you can throw them away) or very stable sites (where the DOM is basically fixed), but they rapidly become a hassle in the middle.
But they sure are easy. I'd suggest recording as a first pass, then going back and refactoring them to be a bit more stable; Change XPATH or structural CSS selectors to use classes and IDs, add waits and assert for page loads, etc.
A good use case would be indexing real estate portals for interested buyers. I could imagine it as a relatively attractive side business, targeting real estate agencies and buyers/renters alike. But I haven't had the time to actually build it
From my having worked ~8 years in residential real estate (and overseeing dev. and maintenance of an in-house api)...I've learned that real estate data on all those real estate sites (from zillow, to any other broker site, etc.) is very much not clean, not always correct, nor updated at a realistic interval. From a data perspective, it is extremely messy to work with. Plus real estate brokers (and related data brokers) have little incentive to clean up the data, let alone make updates more real-time. This might not be a big issue for personal apps...but just be aware of the not-so-little gaps around real estate data. Now, if someone did this as an exercise to learn about apis, scraping, web browser automation, then by all means enjoy; plenty to learn for sure! But don't expect the data to be as usuable. (Oh, and my claims of data messiness are not limited to automata, as anyone searching for homes can tell you about their disappointments when asking about a home only to hear from the agent: "Sorry, that home was sold, we/someone forgot to take the listing down...").
This is unfortunately the truth of the situation. I spent way too much time writing scrapers for Zillow and Redfin to guide home purchasing decisions, only to find out half the stuff listed for sale had been under contract for weeks, places listed with garage that had none, homes with pictures of pools and no pool in the data fields... just tons of errors in the data that no amount of vigilance could clean.
The mad thing is, I would happily pay for the real estate sites to provide me with this kind of data because buying a house is so expensive and such a pain...
But it would STILL suck, because the input is so dodgy.
I've been a long-time Puppeteer user, it's been game-changing in so many ways. We even built a whole platform with it to automate our company's social media presence, among countless small bash scripts, archiving utilities, QA workflows, etc.
Playwright interests me even more though. We've been getting a lot of requests for ArchiveBox.io to support other browsers as the rendering engine for web archives, and it's always seemed daunting to try and reimplement multi-browser support ourselves for puppeteer-style workflows, but Playwright seems to completely take care of that!
I've never really worked too heavily with headless, so what would be some examples of 'real-life' applications using this API? Looking for some inspiration to maybe build a side project around this :)
Thanks for the mention (browserless.io)! Really enjoying this site, let me know if you'd like a guest post and I'd happily add something. We've gathered a few weird tricks and tips as well. Definitely a lot that can be shared!
I don't like to have my home WiFi on all the time but I want to be able to activate / desactivate it from my phone anytime so I send a SMS to my RaspberryPi (running raspisms) that launches an app that connects to my ISP router box and activate the WiFi like I would do manually.
My app is powered by Selenium but the concept is the same.
I'm actually using it in building an in-house health-check platform for our sites.
Its purpose in our case is front-end performance measurement. In short, have a script running checks in a Cron Job, published to static reports we can reference over a timeline.
Nothing overly sophisticated, but it suits our current budget ($0).
Maybe fun to add. The guides and articles are all on GitHub as is the source code for the site - it’s Vuepress based so it might help folks who want to make their own knowledge base using that framework.
This is highly opinionated, but I have been using and teaching how to use Selenium and Appium for a few years so that is unavoidable :) I would say the pros of Puppeteer & Playwright:
- generally higher speed
- higher reliability (specifically lower base false-positive rate)
These are mainly due to architectural choices (less moving parts between script and browser).
That being said, Selenium has been the open-source standard in cross-browser testing for a long time now, and is more polished and feature-rich. Also, multi-language support makes it an easier choice for non-JS teams. I would suggest a quick hands-on POC if you want to use these tools in a project.
Hey, I work on browserless.io, which supports both puppeteer and selenium.
Selenium uses a chatty HTTP interface, whereas puppeteer/playwright use WebSockets or pipes to communicate. Under-the-hood, however, Selenium is simply using chrome's devtools protocol to communicate with it. The way selenium does this is by another binary, generally a `driver`, that has the protocol "baked" into it and has the HTTP selenium API as its input interface.
This is all a long way of saying that puppeteer/playwright have a lot less moving parts, and are generally more approachable. Selenium _does_ have a lot more history behind it, better support across languages and frameworks, and is more stable but it's also much larger and "clunkier" feeling. It's also a lot harder to scale with load-balancers since, again, it's all over HTTP so you'll need some way to load-balance with sticky sessions.
Practically speaking they all do the same thing at some layer. Both are high-level APIs around the devtools protocol, it's just what higher-level interface you prefer and what your language/runtime is.
^^ All of this. I work for Sauce Labs; We've been pretty focused around Selenium but we're building out support for Puppeteer, Playwrite, Cypress et al.
The newer automation tools benefit from being newer; They can take advantage of hardened, well designed interfaces (like the Dev Tool protocol). Selenium's been around for a bit longer, and was built when browsers didn't make it easy to control them. That's influenced the semantics of Selenium quite a lot, as well as explaining the extra moving parts (Drivers exist to map the Selenium Wire Protocol (or W3C protocol) to whatever they're driving because Selenium wasn't built with a specific browser in mind).
I feel like, at this point in time, the real difference is how much abstraction you want from the browser. Selenium is a set of knives, Puppeteer is a Die Cutter. You'll put in more work with Selenium, but maybe you need something do happen a REALLY specific way. Or, you might just need shapes cut, and Puppeteer will be more reliable and faster.
Seems like this doesn't support anything but Chrome/Chromium?
Edit: I was kind of wrong already and it seems I will be completely wrong soon. Which is good in this case :-) As hlenke points out below Firefox support is on its way :-)
tl;dr: Better ergonomics, faster, more reliable and more coverage of web platform.
* The Playwright API auto-waits for the right conditions on every action on the page (click, fill). This ensures automation scripts are concise to write and maintain over time.[1]
* Unlike Selenium, Playwright uses an bi-directional channel between the browser and automation script. This channel is used to listen to events from the browser (like page "load" event, network requests). These events enable Playwright scripts to be precise about browser state and prevent the need to rely on sleeps/timeouts, which contribute to flakiness of Selenium scripts. This is also exposed in the API, for more powerful automation[2].
* Playwright also has a wider coverage for modern browser features, including device emulation, web workers, shadow DOM, geolocation, and permissions.
I work for https://headlesstesting.com where we provide a grid of browsers, which people can use in combination with Puppeteer and Playwright. One of the reasons people use this, instead of Selenium, is because of the increase in speed.
How does one reliably use puppeteer and know when a page has loaded?
I've tried using the various networkidle events, wait for some DOM element, and I find myself just using 5 seconds or something like that as the most reliable solution.
Is there a foolproof way of doing this? I feel like it should be way easier and less hacky.
Honestly, I'd completely avoid sleeps. I think it always comes back to bite you. In the end you need to learn and understand the context of the page to determine when the page has loaded.
When you say "wait for some DOM element", not sure what that refers to exactly, but how about using: `page.waitForSelector` and in addition, using the `visible: true` option?
It really depends on the page you're testing, feels like when I'm automating Wikipedia, I very rarely have to wait for stuff, whereas with JavaScript heavy sites, I use a custom wait function, or the waitForSelector with visible: true.
Depends on what you mean by “loaded”. Most tools in this space, like WebPageTest or commercial tools, have both a “Load” event and a “Fully Loaded” event.
Loaded is the usually the DOM Complete or window.onload event. Fully Loaded is an algorithmic metric that is looking for a time window after the load event when there isn’t any network activity (usually 1.5 to 2 seconds depending on the tool)
Best advice is to just follow what the open source tools do. Same from other computed metrics like TBT or SpeedIndex. (I’m the CTO of a company in this space and that’s what we do)
You can't really know whether a site has fully loaded and is just sending analytics ajax to the server or is querying some random service and will trigger a navigation based on the response, whether there's a setTimeout running somewhere that'll trigger a navigation etc. I've found a timeout to be the safest bet. If onload has been fired and it hasn't jumped in the last X seconds, it's probably going to stay at that URL.
Humans don't know either, they're just better at guessing based on visual cues.
One could design a page that visually loaded, then jumped to a redirect after 10 seconds on the page. But who would?
The primary approach is always event-based, because most pages do that sanely.
If not... the best approach I've found is looking for sentinel elements.
Essentially, something that only matches once the website is de facto loaded (regardless of events). Sometimes it's a "search results found" bit of text, sometimes a first element. But more or less, "How do I (a human) know when the page is ready?"
Affiliate networks (the shadier they are, the more likely they will), because they are weird and load third party tracking beacons in transitional pages and want to make extra sure that the beacons (who also can redirect multiple times) have been loaded. To add to the fun, they're also adding random new tracking domains (to avoid being blocked, I assume), so you can't even say whether you expect some domain to be transitional or final to increase your confidence in what you measure.
You're right though, looking for elements is a pretty good way if you know the page you're checking. If you're going in blind, you can still look for things they probably have (e.g. <nav>, <header>, <section> etc), but I haven't found any that are reliably on a "real" page and reliably not on a redirect page.
That's a use case I haven't encountered, nor considered!
Most of my work is making known transitions (e.g. page1 to page2) work reliably, so I have the benefit of knowing the landing page structure.
If you're crawling pathological, client-side redirect chains, maybe do pattern-matching scans on loaded code for the full set of redirect methods? There's only so many, and includes / doesn't-include seems a fair way to bucket pages.
Yeah, we had been doing that initially and found that there are lots of imaginative ways to use e.g. refresh meta-tags that browsers do accept but we did not (e.g. somebody might say content="0.0; url=https://example.com/"*) and more and more networks and agencies switching to JS-based redirects lead to a headless browser being easier in the end, despite dealing with these specific issues.
A simple self.location.href = ...* is still doable (-ish, because I've seen conditional changes that were essentially if(false)... to disable a redirect, which we obviously didn't consider when pattern matching), but once they include e.g. jQuery (and some do on a simple redirect page) it got far too complicated.
I’ve found success using page.goto(url, { waitUntil: “networkidle2” }).
The default waitUntil fires when the onload event is triggered, networkidle2 waits for less than 2 network requests in the past 500ms, and networkidle0 waits for no requests in the past 500ms. This + waitForSelector should help you out.
The other answers pretty much sum this up already: the topic is not so straightforward as many think. I believe having a dedicated post on this might be beneficial, so I will put this on my list of articles for the basics on theheadless.dev (I am one of the initial contributors).
I don't think that's possible in the most generic way. Think about when would you say the Facebook page (post login) is fully loaded.
You've got there the infinite scroll, the chat box, the notifications. All of those are never fully loaded.
do you know reliably if all the data on any page you look at has loaded? I mean if I think about it I have to honestly answer not really, I just make some assumptions.
So when using a crawler I make some assumptions.
I wait for either more than 10 links or 20 seconds, the few pages that have 10 links or less get a longer wait.
I also do stuff like check the links once I have gone into page processing, if a page was 20 seconds and has no links to any other page on the domain, probably something wrong, goes into problem checking queue. etc. etc.
It has to be hacky because there is no way for the browser to tell you with any surety - everything rendered, no events in event queue that will effect display etc.
You might wanna switch to another analytics provider (temporarily) when posting on HN. Because 60%+ users probably block it anyway, so you will lose the insights, etc
I am asking out of curiosity: is it possible & legal to set up a simple proxy server to redirect analytics data to the provider? So that the analytics traffic goes to the same domain - and is not blocked?
Google probably don't want that happening because they use the data to inform decisions for all kinds of other things (search, ad sales, filling in gaps in their creepy profiles of people, maybe even metrics-driven machine-learning-for-web-design, who knows) and are more worried about web site owners feeding them mountains of fake data than about missing stats on some % of ad-blocking users.
uBlock Origin blocks that too. CNAME blocking. But only on Firefox, since Chrome doesnt support it.
When on example.com , a request is sent to definitely-not-tracking.example.com . But it just going to tracking.ad-server.com . uBO blocks those on Firefox
Quick question: if you were to start a new project today, what would be your choice between puppeteer and playwright?
Haven't worked extensively with them, but it seems to me playwright is the no-brainer now because of the guys behind it (the creators of puppeteer), their experience and lessons learned with puppeteer and also the support for Firefox and webkit.
I guess so, the emergence of Playwright is therefore really interesting in general. It's two frameworks by arguably the same folks. Time will tell what will happen. Competition is never bad!
I thought it's about a knowledge base about e2e testing using playwright/puppeteer. The problem I wrote about is one that's solvable using those tools (I think and hope). It's a problem I'm trying to solve in my limited free time and I was hoping someone already have solved it, but if that's not the case and someone also finds it interesting, how should've I let people know there already is an open source playground for this, without posting the project.
I get that but when you brought up TypeScript and React, these tools are pretty indifferent to the underlying frameworks and are primarily used to simulate web browser interactions. I suppose you could integrate them into your e2e testing automatically, but every site is different so it would likely be difficult to do anything useful.
I mentioned typescript and react (meaning jsx actually) because browsers run javascript.
So the browser will have to be able to make a mapping from javascript -> jsx -> typescript in order to report correctly what was covered during the execution.
That mapping is going to happen though regardless if it is a headless or real-user interaction. I don't think these tools are going to help with mapping unless you know what kind of interaction you want to simulate, and be able to map the entire interaction. I think tracing and logging tools would be a lot more useful and appropriate for those types of scenarios. I do think they can help with e2e testing, just not inline with the frameworks you mentioned.
Works flawlessly, and has exactly the configuration knobs you would expect and want. Took a bit of plumbing to call into a Node.js script from my Scala build logic, but all in all ended up being like 20 lines of plumbing which was straightforward to write and understand. A welcome change after struggling with bugs in wkhtmltopdf!