How does one reliably use puppeteer and know when a page has loaded? I've tried ...

umaar · on Aug 19, 2020

Honestly, I'd completely avoid sleeps. I think it always comes back to bite you. In the end you need to learn and understand the context of the page to determine when the page has loaded.

When you say "wait for some DOM element", not sure what that refers to exactly, but how about using: `page.waitForSelector` and in addition, using the `visible: true` option?

It really depends on the page you're testing, feels like when I'm automating Wikipedia, I very rarely have to wait for stuff, whereas with JavaScript heavy sites, I use a custom wait function, or the waitForSelector with visible: true.

billyhoffman · on Aug 19, 2020

Depends on what you mean by “loaded”. Most tools in this space, like WebPageTest or commercial tools, have both a “Load” event and a “Fully Loaded” event.

Loaded is the usually the DOM Complete or window.onload event. Fully Loaded is an algorithmic metric that is looking for a time window after the load event when there isn’t any network activity (usually 1.5 to 2 seconds depending on the tool)

Best advice is to just follow what the open source tools do. Same from other computed metrics like TBT or SpeedIndex. (I’m the CTO of a company in this space and that’s what we do)

https://www.webpagetest.org/forums/showthread.php?tid=13487

luckylion · on Aug 19, 2020

You can't really know whether a site has fully loaded and is just sending analytics ajax to the server or is querying some random service and will trigger a navigation based on the response, whether there's a setTimeout running somewhere that'll trigger a navigation etc. I've found a timeout to be the safest bet. If onload has been fired and it hasn't jumped in the last X seconds, it's probably going to stay at that URL.

Humans don't know either, they're just better at guessing based on visual cues.

ethbro · on Aug 19, 2020

It's something of a UX question too.

One could design a page that visually loaded, then jumped to a redirect after 10 seconds on the page. But who would?

The primary approach is always event-based, because most pages do that sanely.

If not... the best approach I've found is looking for sentinel elements.

Essentially, something that only matches once the website is de facto loaded (regardless of events). Sometimes it's a "search results found" bit of text, sometimes a first element. But more or less, "How do I (a human) know when the page is ready?"

luckylion · on Aug 19, 2020

> But who would?

Affiliate networks (the shadier they are, the more likely they will), because they are weird and load third party tracking beacons in transitional pages and want to make extra sure that the beacons (who also can redirect multiple times) have been loaded. To add to the fun, they're also adding random new tracking domains (to avoid being blocked, I assume), so you can't even say whether you expect some domain to be transitional or final to increase your confidence in what you measure.

You're right though, looking for elements is a pretty good way if you know the page you're checking. If you're going in blind, you can still look for things they probably have (e.g. <nav>, <header>, <section> etc), but I haven't found any that are reliably on a "real" page and reliably not on a redirect page.

ethbro · on Aug 19, 2020

That's a use case I haven't encountered, nor considered!

Most of my work is making known transitions (e.g. page1 to page2) work reliably, so I have the benefit of knowing the landing page structure.

If you're crawling pathological, client-side redirect chains, maybe do pattern-matching scans on loaded code for the full set of redirect methods? There's only so many, and includes / doesn't-include seems a fair way to bucket pages.

luckylion · on Aug 19, 2020

Yeah, we had been doing that initially and found that there are lots of imaginative ways to use e.g. refresh meta-tags that browsers do accept but we did not (e.g. somebody might say content="0.0; url=https://example.com/"*) and more and more networks and agencies switching to JS-based redirects lead to a headless browser being easier in the end, despite dealing with these specific issues.

A simple self.location.href = ...* is still doable (-ish, because I've seen conditional changes that were essentially if(false)... to disable a redirect, which we obviously didn't consider when pattern matching), but once they include e.g. jQuery (and some do on a simple redirect page) it got far too complicated.

knrz · on Aug 19, 2020

I’ve found success using page.goto(url, { waitUntil: “networkidle2” }).

The default waitUntil fires when the onload event is triggered, networkidle2 waits for less than 2 network requests in the past 500ms, and networkidle0 waits for no requests in the past 500ms. This + waitForSelector should help you out.

gitgud · on Aug 19, 2020

Puppeteer's Networkidle events are the easiest solution, and I've had no problems with networkidle0 (waits until there's 0 pending requests).

For a foolproof way, you could also just wait until the specific DOM node you want exists. Either by:

- Polling the DOM until it exists via; setTimeout/setInterval

- Using MutationObservers to check the DOM on every node change, until it's been added to the DOM

jadell · on Aug 19, 2020

You don't need to poll the DOM yourself. Puppeteer's page.waitForSelector() does this for you. You can specify a timeout of how long to wait.

gitgud · on Aug 19, 2020

Woah, yeah I hadn't heard of that part of the api before. Looks perfect for that usecase.

ragog · on Aug 19, 2020

The other answers pretty much sum this up already: the topic is not so straightforward as many think. I believe having a dedicated post on this might be beneficial, so I will put this on my list of articles for the basics on theheadless.dev (I am one of the initial contributors).

tudorconstantin · on Aug 19, 2020

I don't think that's possible in the most generic way. Think about when would you say the Facebook page (post login) is fully loaded. You've got there the infinite scroll, the chat box, the notifications. All of those are never fully loaded.

chromedev · on Aug 19, 2020

https://pptr.dev/#?product=Puppeteer&version=v5.2.1&show=api...

You can also look for a specific element.

bryanrasmussen · on Aug 19, 2020

do you know reliably if all the data on any page you look at has loaded? I mean if I think about it I have to honestly answer not really, I just make some assumptions.

So when using a crawler I make some assumptions.

I wait for either more than 10 links or 20 seconds, the few pages that have 10 links or less get a longer wait.

I also do stuff like check the links once I have gone into page processing, if a page was 20 seconds and has no links to any other page on the domain, probably something wrong, goes into problem checking queue. etc. etc.

It has to be hacky because there is no way for the browser to tell you with any surety - everything rendered, no events in event queue that will effect display etc.