How does one reliably use puppeteer and know when a page has loaded?
I've tried using the various networkidle events, wait for some DOM element, and I find myself just using 5 seconds or something like that as the most reliable solution.
Is there a foolproof way of doing this? I feel like it should be way easier and less hacky.
Honestly, I'd completely avoid sleeps. I think it always comes back to bite you. In the end you need to learn and understand the context of the page to determine when the page has loaded.
When you say "wait for some DOM element", not sure what that refers to exactly, but how about using: `page.waitForSelector` and in addition, using the `visible: true` option?
It really depends on the page you're testing, feels like when I'm automating Wikipedia, I very rarely have to wait for stuff, whereas with JavaScript heavy sites, I use a custom wait function, or the waitForSelector with visible: true.
Depends on what you mean by “loaded”. Most tools in this space, like WebPageTest or commercial tools, have both a “Load” event and a “Fully Loaded” event.
Loaded is the usually the DOM Complete or window.onload event. Fully Loaded is an algorithmic metric that is looking for a time window after the load event when there isn’t any network activity (usually 1.5 to 2 seconds depending on the tool)
Best advice is to just follow what the open source tools do. Same from other computed metrics like TBT or SpeedIndex. (I’m the CTO of a company in this space and that’s what we do)
You can't really know whether a site has fully loaded and is just sending analytics ajax to the server or is querying some random service and will trigger a navigation based on the response, whether there's a setTimeout running somewhere that'll trigger a navigation etc. I've found a timeout to be the safest bet. If onload has been fired and it hasn't jumped in the last X seconds, it's probably going to stay at that URL.
Humans don't know either, they're just better at guessing based on visual cues.
One could design a page that visually loaded, then jumped to a redirect after 10 seconds on the page. But who would?
The primary approach is always event-based, because most pages do that sanely.
If not... the best approach I've found is looking for sentinel elements.
Essentially, something that only matches once the website is de facto loaded (regardless of events). Sometimes it's a "search results found" bit of text, sometimes a first element. But more or less, "How do I (a human) know when the page is ready?"
Affiliate networks (the shadier they are, the more likely they will), because they are weird and load third party tracking beacons in transitional pages and want to make extra sure that the beacons (who also can redirect multiple times) have been loaded. To add to the fun, they're also adding random new tracking domains (to avoid being blocked, I assume), so you can't even say whether you expect some domain to be transitional or final to increase your confidence in what you measure.
You're right though, looking for elements is a pretty good way if you know the page you're checking. If you're going in blind, you can still look for things they probably have (e.g. <nav>, <header>, <section> etc), but I haven't found any that are reliably on a "real" page and reliably not on a redirect page.
That's a use case I haven't encountered, nor considered!
Most of my work is making known transitions (e.g. page1 to page2) work reliably, so I have the benefit of knowing the landing page structure.
If you're crawling pathological, client-side redirect chains, maybe do pattern-matching scans on loaded code for the full set of redirect methods? There's only so many, and includes / doesn't-include seems a fair way to bucket pages.
Yeah, we had been doing that initially and found that there are lots of imaginative ways to use e.g. refresh meta-tags that browsers do accept but we did not (e.g. somebody might say content="0.0; url=https://example.com/"*) and more and more networks and agencies switching to JS-based redirects lead to a headless browser being easier in the end, despite dealing with these specific issues.
A simple self.location.href = ...* is still doable (-ish, because I've seen conditional changes that were essentially if(false)... to disable a redirect, which we obviously didn't consider when pattern matching), but once they include e.g. jQuery (and some do on a simple redirect page) it got far too complicated.
I’ve found success using page.goto(url, { waitUntil: “networkidle2” }).
The default waitUntil fires when the onload event is triggered, networkidle2 waits for less than 2 network requests in the past 500ms, and networkidle0 waits for no requests in the past 500ms. This + waitForSelector should help you out.
The other answers pretty much sum this up already: the topic is not so straightforward as many think. I believe having a dedicated post on this might be beneficial, so I will put this on my list of articles for the basics on theheadless.dev (I am one of the initial contributors).
I don't think that's possible in the most generic way. Think about when would you say the Facebook page (post login) is fully loaded.
You've got there the infinite scroll, the chat box, the notifications. All of those are never fully loaded.
do you know reliably if all the data on any page you look at has loaded? I mean if I think about it I have to honestly answer not really, I just make some assumptions.
So when using a crawler I make some assumptions.
I wait for either more than 10 links or 20 seconds, the few pages that have 10 links or less get a longer wait.
I also do stuff like check the links once I have gone into page processing, if a page was 20 seconds and has no links to any other page on the domain, probably something wrong, goes into problem checking queue. etc. etc.
It has to be hacky because there is no way for the browser to tell you with any surety - everything rendered, no events in event queue that will effect display etc.
I've tried using the various networkidle events, wait for some DOM element, and I find myself just using 5 seconds or something like that as the most reliable solution.
Is there a foolproof way of doing this? I feel like it should be way easier and less hacky.