Whilst the spec certainly allows you to ignore closing of a whole range of elements, it's not necessarily the wisest of choices to make. The parser does actually get slower when you fail to close your tags in my experience.
Unscientific stats from a recent project where I noticed it:
+ Document is about 50,000 words in size. About 150 words to a paragraph element, on average.
+ Converting the entire thing to self-closing p elements added an overhead of about 120ms in Firefox on Linux, before initial render.
+ Converting the entire thing to self-closing p elements added an overhead of about 480ms in Chrome on Linux, before initial render.
+ Converting the entire thing to self-closing p elements added an overhead of about 400ms in Firefox on Android, before initial render.
+ Converting the entire thing to self-closing p elements added an overhead of about 560ms in Chrome on Android, before initial render.
+ The time differences appeared to be linearly increasing, as the document grew from 20,000 to 50,000 words.
+ Curiously, Quirks Mode also increased the load times by about 250ms on Firefox and 150ms on Chrome. (Tried it just because I was surprised at the massive overhead of removing/adding the tag endings.)
The most common place this was going to be opened was Chrome on Android, and a whopping half-second slower to first render is going to be noticeable to the end user. For some prettier mark up.
Whilst you can debate whether that increased latency actually affects the user, a decreased latency will always make people smile more. So including the end tags is a no-brainer. Feel free to write it without them - but you _might_ consider whether your target is appropriate for you to generate them before you serve up the content.
I can't verify your numbers. As far as I can tell, loading a ~900,000 word document with no other differences than including or excluding </p> has about the same load time, though there's too much variance from load to load for me to really give definitive numbers.
Are you sure you converted it properly? I'd expect those kinds of numbers if your elements were very deeply nested by mistake (e.g. omitting tags where it's not valid to do so), but I don't see why leaving out </p> should be so slow.
For five runs, on the same hardware with the same load:
+ Unclosed: 4.00s, 3.91s, 3.59s, 4.45s, 3.93s
+ Closed: 3.90s, 2.74s, 3.9s, 2.05s, 3.39s
Though I'd note that the newline you have immediately following the paragraph, even when closing, would probably reduce the backtracking effect. And having no explicit body or head element would probably cause some different rendering patterns as well.
I don't know what you're measuring (onload?), but it's not giving you enough precision to make a conclusion about the performance of the HTML parser.
If you profile the page w/ devtools Performance panel, you'll see that just 5% of the CPU cost used to load & render the page is spent parsing the HTML. At that level I'm seeing costs of 22-36ms per load.
And, spoiler alert: after repeated runs I'm not seeing any substantial difference between these test pages. And based on how the HTML parser works, I wouldn't expect it.
Were the five unclosed runs before the five closed runs? I could see that making a difference vs. interleaving them, if the hardware needs to "warm up" first.
For me, on Firefox on Linux (I know it's the one with the smallest difference, but I don't have the others on hand, sorry), using the "load" time at the bottom of the Network tab, with cache disabled and refreshing with Ctrl+F5, interleaving the tests:
- Unclosed: 1.38s, 1.49s, 1.45s, 1.52s, 1.48s
- Closed: 1.47s, 1.37s, 1.48s, 1.49s, 1.35s
The one with </p> omitted takes about 0.032s longer on average going by these numbers, but that's about 2 frames of extra latency for a page almost twice the length of The Lord of the Rings.
Regarding the page itself, I tried to keep everything else as identical between the two versions as possible, including the DOM, hence why I wrote the </p> immediately before each <p>. As for backtracking, I'm not sure what you mean. The rule for the parser is simply "If the start tag is one from this list, and there's an open <p> element on the stack, close the <p> element before handling the start tag."
Well this sounds like really interesting observation. May I ask where exactly were the original closing tags located and how the stripped source looked like? I can imagine there _might_ be some differences among differently formatted code: e.g. I'd expect
<p>Content<p>Content[EOF fig1]
to be (slightly) slower, than
<p>Content</p><p>Content</p>[EOF fig2]
(most likely because of some "backtracking" when hitting `<p[>]`), or
<p>Content</p>
<p>Content</p>[EOF fig3]
(with that that small insignificant `\n` text node between paragraph nodes), what should be possibly faster than "the worst scenarios":
<p>Content
<p>Content[EOF fig4a]
or even
<p>
Content
<p>
Content
[EOF fig4b]
with paragraph text nodes `["Content\n","Content]"` / `["\nContent\n","\nContent\n]"`, where the "\n" must be also preserved in the DOM but due white-space collapsing rules not present in the render tree (if not overridden by some non-default CSS) but still with backtracking, that
<p>Content
</p>
<p>Content
</p>[EOF fig5]
should eliminate (again, similarly to fig2 vs fig1).
(Sorry for wildly biased guesswork, worthless without measurements.)
It was just paragraphs of text. p, strong, em, and q mingled at most. No figures or images or anything of the like to radically shift DOM computations. That the effect can even be seen is probably due to the scale of the document, as I noted it's a little larger than most things.
All paragraphs had a blank line between them, both with and without the p end tag. The p opening tag was always at the top-left, with no gap between it and the content.
So, for example:
<p>Cheats open the doorway for casual play. They make it easier for disabled players to enjoy the same things as their peers, and allow people to skip parts of a game that <em>they bought</em> that they find too difficult.</p>
<p>Unfortunately, cheats are going away, because of extensive online play, and a more corporate approach to developing games that despises anything hidden.</p>
Versus:
<p>Cheats open the doorway for casual play. They make it easier for disabled players to enjoy the same things as their peers, and allow people to skip parts of a game that <em>they bought</em> that they find too difficult.
<p>Unfortunately, cheats are going away, because of extensive online play, and a more corporate approach to developing games that despises anything hidden.
(You can also discount CSS from having a major effect. Less than a hundred lines of styles, where most rules are no more complicated than: `p { font-family: sans-serif; }`. No whitespace rules.)
However, if you wanted to look at this in a more scientific way - it should be entirely possible to generate test cases fairly easily, given the simplicity of the text data I saw my results with.
Finally did some synthetic measurements of (hopefully) parse times (not render nor CSSOM or anything like that). Differences seems microscopic but overall aligned with my initial expectations (omitting the closing tag actually shaves a bit of yak's hair), so I suspect that the real overhead you observed is caused by something happening after parse, where absence of trailing white-space in DOM nodes (ensued by closing tags) helps in some way. I guess something around that white-space or text layout. (Speaking of insignificant white-space, you could probably gain some more microseconds if you'd stuck paragraphs together (`..</p>\n\n<p>..` -> `..</p><p>..`), however such minification seems like a nuisance.)
Tested only on Windows, in browser consoles.
Numbers:
Firefox (Nightly) (performance.now is clamped to miliseconds)
(Before realizing I can use synthetic domparser I made something what measures document load time in iframe (http://myfonj.github.io/tst/html-parsing-times.html) but it gives quite unconvincing results, although probably closer to the real world. Understandably, synthetic domparser can crunch much more code than visible iframe.)
I'm not claiming you claimed it was a simple specification :-)
I just find it interesting. This would indicate to me that there are 500 "features" in the language. I thought mark-down languages just provided a few shortcuts for producing the most commonly needed HTML features and then provide a fallback to HTML. So if you cannot do it in the markdown language, use HTML instead.
I can't really be bothered to take a look at the tests, but I strongly doubt there are actually 500 features. A large part of those tests are probably trying combinations of features. E.g. suppose markdown only had tables as a feature, and nothing else. That feature alone deserves several several tests (for tables of various sizes, edge cases such as having only the header, having rows with an incorrect number of columns, etc.).
But let's assume we can get away with just a single test for tables. And then we introduce the features "section headers" and "bold" and "underline". All these features can interact (e.g. underlined bold section headers), so we want to test combinations of all those features, and have a nice combinatorial explosion.
I see, combinations. But also the ability to use different combinations of "basic" features in a sense are a specific feature too. Like you can mark text bold and you can mark text as representing a table. But can you mark text within tables bold? If you can that would to me be a "feature" too. If you can not, then that "feature" is missing.
Why would I care if one is merely “formatting” or not? If I have to run a tool either way, I would prefer one that accepts a user-friendly input language and decouples content from presentation.
Because transforming an .md file into an .html file is a lot more invasive (though taken for granted here I think) than just writing the .html file. It's a build step where there wasn't one before.
What do you mean by lookaround/backtracking? You're inside <p>. You encounter another <p>. You can't nest one <p> inside another <p>, so you close current <p> and open new <p>. That's about it. I fail to see where do you need any kind of backtracking.
Well, even in this one example, imagine parser combinators which often mean backtracking the inner <p> so that you can commit to the `openTag('p')` parser. Or your logic may be 'consume all tags that aren't <p>` which is a lookahead.
A better example here is whether you are lenient and accept unescaped html entities like "<" vs "<". If you require it to be escaped "<" or if all entities in your inputs are always escaped, then your text parser never has to backtrack. But if you are lenient, your text parser can do catastrophic levels of backtracking if there is a single "<" somewhere (unless you are careful). Imagine input that starts off "<a small mouse once said". If could be quite a while before your parser knows it's not an anchor open tag.
> Are more strict html parsers/renderers, and aren't they faster?
Are what more strict? You're missing a subject there.
At a guess, you're referencing the differences between Chrome/Firefox rendering times? And are surprised that Chrome is always slower?
In the same completely unscientific stat taking, I found that Chrome was significantly faster at parsing the HTML head element of a document than Firefox, and that difference was enough for Chrome to pull ahead of Firefox in overall rendering times for smaller pages. (Chrome was about 30% of Firefox's time spent in the head.)
However, Firefox was faster at parsing the body, and as I had a larger-than-usual body (50k words is not your average webpage), Firefox was overall faster.
To you and all that have responded: there is no variation in HTML parsing between browsers. All engines are using precisely the same exhaustively-defined algorithm. There is no leniency or strictness. Their performance characteristics may differ outside of parsing, which includes what they do with the result of parsing, but in the parsing itself there should be basically no difference between engines or parsers.
That’s interesting, but surely relying on user agent to ‘fill in the gaps’ is error prone? Surely transpiling prior or during render would be more resilient than trusting browser behaviour
If you're in a situation where resilience against odd browser quirks matters, you probably shouldn't be writing HTML like this anyway. This style is fine for writing HTML for a blog. For any kind of application, it would be a nightmare to try to maintain.
Every time the author introduced a shorthand, they had to clarify that it works only in specific situations. The result of those qualifiers is that you will have to have some code written in the more verbose style anyway. Context switching between those styles and having to decide whether the shorthand works in any given case just isn't worth it on a large project that you'll be making changes to over time.
HTML parsing is exhaustively defined, so there’s not any filling of gaps, but only rules to be aware of. If you don’t know those rules, this may be error-prone, but if you do, it’s not, and things like the start and end tag omissions discussed in the article are quite straightforward rules to learn.
Unscientific stats from a recent project where I noticed it:
+ Document is about 50,000 words in size. About 150 words to a paragraph element, on average.
+ Converting the entire thing to self-closing p elements added an overhead of about 120ms in Firefox on Linux, before initial render.
+ Converting the entire thing to self-closing p elements added an overhead of about 480ms in Chrome on Linux, before initial render.
+ Converting the entire thing to self-closing p elements added an overhead of about 400ms in Firefox on Android, before initial render.
+ Converting the entire thing to self-closing p elements added an overhead of about 560ms in Chrome on Android, before initial render.
+ The time differences appeared to be linearly increasing, as the document grew from 20,000 to 50,000 words.
+ Curiously, Quirks Mode also increased the load times by about 250ms on Firefox and 150ms on Chrome. (Tried it just because I was surprised at the massive overhead of removing/adding the tag endings.)
The most common place this was going to be opened was Chrome on Android, and a whopping half-second slower to first render is going to be noticeable to the end user. For some prettier mark up.
Whilst you can debate whether that increased latency actually affects the user, a decreased latency will always make people smile more. So including the end tags is a no-brainer. Feel free to write it without them - but you _might_ consider whether your target is appropriate for you to generate them before you serve up the content.