Whilst the spec certainly allows you to ignore closing of a whole range of eleme...

niconii · on June 10, 2022

I can't verify your numbers. As far as I can tell, loading a ~900,000 word document with no other differences than including or excluding has about the same load time, though there's too much variance from load to load for me to really give definitive numbers.

Are you sure you converted it properly? I'd expect those kinds of numbers if your elements were very deeply nested by mistake (e.g. omitting tags where it's not valid to do so), but I don't see why leaving out should be so slow.

Try these two pages:

https://niconii.github.io/lorem-unclosed.html

https://niconii.github.io/lorem-closed.html

shakna · on June 10, 2022

For five runs, on the same hardware with the same load:

+ Unclosed: 4.00s, 3.91s, 3.59s, 4.45s, 3.93s

+ Closed: 3.90s, 2.74s, 3.9s, 2.05s, 3.39s

Though I'd note that the newline you have immediately following the paragraph, even when closing, would probably reduce the backtracking effect. And having no explicit body or head element would probably cause some different rendering patterns as well.

paulirish · on June 10, 2022

I don't know what you're measuring (onload?), but it's not giving you enough precision to make a conclusion about the performance of the HTML parser. If you profile the page w/ devtools Performance panel, you'll see that just 5% of the CPU cost used to load & render the page is spent parsing the HTML. At that level I'm seeing costs of 22-36ms per load.

And, spoiler alert: after repeated runs I'm not seeing any substantial difference between these test pages. And based on how the HTML parser works, I wouldn't expect it.

(I work on web performance on the Chrome team)

niconii · on June 10, 2022

Were the five unclosed runs before the five closed runs? I could see that making a difference vs. interleaving them, if the hardware needs to "warm up" first.

For me, on Firefox on Linux (I know it's the one with the smallest difference, but I don't have the others on hand, sorry), using the "load" time at the bottom of the Network tab, with cache disabled and refreshing with Ctrl+F5, interleaving the tests:

- Unclosed: 1.38s, 1.49s, 1.45s, 1.52s, 1.48s

- Closed: 1.47s, 1.37s, 1.48s, 1.49s, 1.35s

The one with omitted takes about 0.032s longer on average going by these numbers, but that's about 2 frames of extra latency for a page almost twice the length of The Lord of the Rings.

Regarding the page itself, I tried to keep everything else as identical between the two versions as possible, including the DOM, hence why I wrote the immediately before each . As for backtracking, I'm not sure what you mean. The rule for the parser is simply "If the start tag is one from this list, and there's an open element on the stack, close the element before handling the start tag."

myfonj · on June 10, 2022

Well this sounds like really interesting observation. May I ask where exactly were the original closing tags located and how the stripped source looked like? I can imagine there _might_ be some differences among differently formatted code: e.g. I'd expect

    <p>Content<p>Content[EOF fig1]

to be (slightly) slower, than

    <p>Content</p><p>Content</p>[EOF fig2]

(most likely because of some "backtracking" when hitting `<p[>]`), or

    <p>Content</p>
    <p>Content</p>[EOF fig3]

(with that that small insignificant `\n` text node between paragraph nodes), what should be possibly faster than "the worst scenarios":

    <p>Content
    <p>Content[EOF fig4a]

or even

    <p>
    Content
    <p>
    Content
    [EOF fig4b]

with paragraph text nodes `["Content\n","Content]"` / `["\nContent\n","\nContent\n]"`, where the "\n" must be also preserved in the DOM but due white-space collapsing rules not present in the render tree (if not overridden by some non-default CSS) but still with backtracking, that

    <p>Content
    </p>
    <p>Content
    </p>[EOF fig5]

should eliminate (again, similarly to fig2 vs fig1).

(Sorry for wildly biased guesswork, worthless without measurements.)

shakna · on June 10, 2022

It was just paragraphs of text. p, strong, em, and q mingled at most. No figures or images or anything of the like to radically shift DOM computations. That the effect can even be seen is probably due to the scale of the document, as I noted it's a little larger than most things.

All paragraphs had a blank line between them, both with and without the p end tag. The p opening tag was always at the top-left, with no gap between it and the content.

So, for example:

    <p>Cheats open the doorway for casual play. They make it easier for disabled players to enjoy the same things as their peers, and allow people to skip parts of a game that <em>they bought</em> that they find too difficult.</p>

    <p>Unfortunately, cheats are going away, because of extensive online play, and a more corporate approach to developing games that despises anything hidden.</p>

Versus:

    <p>Cheats open the doorway for casual play. They make it easier for disabled players to enjoy the same things as their peers, and allow people to skip parts of a game that <em>they bought</em> that they find too difficult.

    <p>Unfortunately, cheats are going away, because of extensive online play, and a more corporate approach to developing games that despises anything hidden.

(You can also discount CSS from having a major effect. Less than a hundred lines of styles, where most rules are no more complicated than: `p { font-family: sans-serif; }`. No whitespace rules.)

However, if you wanted to look at this in a more scientific way - it should be entirely possible to generate test cases fairly easily, given the simplicity of the text data I saw my results with.

myfonj · on June 10, 2022

Yay, thanks for info and inspiration, sure it seems like fun weekend project.

(BTW your snippet's content sounds interesting and feels relatable, definitely intrigued.)

myfonj · on June 14, 2022

Finally did some synthetic measurements of (hopefully) parse times (not render nor CSSOM or anything like that). Differences seems microscopic but overall aligned with my initial expectations (omitting the closing tag actually shaves a bit of yak's hair), so I suspect that the real overhead you observed is caused by something happening after parse, where absence of trailing white-space in DOM nodes (ensued by closing tags) helps in some way. I guess something around that white-space or text layout. (Speaking of insignificant white-space, you could probably gain some more microseconds if you'd stuck paragraphs together (`..\n\n..` -> `....`), however such minification seems like a nuisance.)

Tested only on Windows, in browser consoles.

Numbers:

Firefox (Nightly) (performance.now is clamped to miliseconds)

    total; median; average; snippet
    2279.0; 4.0; 4.558; '<p>_'
    2652.0; 4.0; 5.304; '<p>_</p>'
    2471.0; 4.0; 4.942; '<p>_abcd'
    2387.0; 4.0; 4.774; '<p>_\n'
    3615.0; 5.0; 7.230; '<p>_</p>\n'
    2380.0; 4.0; 4.760; '<p>_abcd\n'
    3093.0; 5.0; 6.186; '<p>_\n</p>\n'
    3107.0; 5.0; 6.214; '<p>_</p>\n\n'
    2317.0; 4.0; 4.634; '<p>_abcd\n\n'
    2344.0; 4.0; 4.688; '<p>_\n\n'

Google Chrome (performance.now is sub-milisecond)

    total; median; average; snippet
    2870.4; 5.2; 5.741; '<p>_'
    2895.2; 5.4; 5.790; '<p>_</p>'
    2684.7; 5.2; 5.369; '<p>_abcd'
    2845.4; 5.2; 5.690; '<p>_\n'
    3836.7; 7.3; 7.673; '<p>_</p>\n'
    2837.8; 5.2; 5.676; '<p>_abcd\n'
    4022.5; 7.4; 8.045; '<p>_\n</p>\n'
    4044.3; 7.3; 8.089; '<p>_</p>\n\n'
    2928.4; 5.2; 5.857; '<p>_abcd\n\n'
    2805.3; 5.2; 5.611; '<p>_\n\n'

Test config

    Snippets per document: 5000
    Rounds: 500
    Wrap: '<!doctype html>(items-paragraphs)'
    Content each item (_): bunch of random digits chunks, something like '1943965927 52 27 5 51664138859173 5161 7226 5 15 2 55679 6553712585'

Code: https://gist.github.com/myfonj/57a6a8fcb1c5686527412543a897c...

(Before realizing I can use synthetic domparser I made something what measures document load time in iframe (http://myfonj.github.io/tst/html-parsing-times.html) but it gives quite unconvincing results, although probably closer to the real world. Understandably, synthetic domparser can crunch much more code than visible iframe.)

toqy · on June 10, 2022

> For some prettier mark up.

But then if you run it through Prettier it'll add all the closing tags for you :)

throwaway894345 · on June 10, 2022

If you’re running it through a processor, why it just write markdown and call it a day?

galaxyLogic · on June 10, 2022

Is there a standard definition for the "Markdown" -language?

There are several for HTML different versions and it is standardized that you can omit some closing tags and some tags altogether.

The benefit of writing in a standardized language is that later you or anybody can run tools against your sources that check for conformity.

So that is why I prefer HTML. But I would like to hear your opinion on what is the best mark-down dialect currently?

throwaway894345 · on June 10, 2022

Yes, CommonMark is a standard with implementations in many different languages.

galaxyLogic · on June 10, 2022

That is an interesting development.

From their Github page I read: "The spec contains over 500 embedded examples which serve as conformance tests."

So it's not so simple any more, is it?

(https://github.com/commonmark/commonmark-spec)

jmalicki · on June 11, 2022

Less than a 1000 conformance tests for a standard? Sounds Pretty simple to me, no way you could make an HTML compliance suite that small.

throwaway894345 · on June 10, 2022

> So it's not so simple any more, is it?

I claimed the specification existed, I didn’t claim it was a simple specification.

galaxyLogic · on June 11, 2022

I'm not claiming you claimed it was a simple specification :-)

I just find it interesting. This would indicate to me that there are 500 "features" in the language. I thought mark-down languages just provided a few shortcuts for producing the most commonly needed HTML features and then provide a fallback to HTML. So if you cannot do it in the markdown language, use HTML instead.

Thiez · on June 11, 2022

I can't really be bothered to take a look at the tests, but I strongly doubt there are actually 500 features. A large part of those tests are probably trying combinations of features. E.g. suppose markdown only had tables as a feature, and nothing else. That feature alone deserves several several tests (for tables of various sizes, edge cases such as having only the header, having rows with an incorrect number of columns, etc.).

But let's assume we can get away with just a single test for tables. And then we introduce the features "section headers" and "bold" and "underline". All these features can interact (e.g. underlined bold section headers), so we want to test combinations of all those features, and have a nice combinatorial explosion.

galaxyLogic · on June 13, 2022

I see, combinations. But also the ability to use different combinations of "basic" features in a sense are a specific feature too. Like you can mark text bold and you can mark text as representing a table. But can you mark text within tables bold? If you can that would to me be a "feature" too. If you can not, then that "feature" is missing.

hombre_fatal · on June 10, 2022

Well, one simply formats the source file as you write it. The other requires a infile -> outfile build step that's more complex.

Whether the latter is worth it tends to depend on other things than parse time.

throwaway894345 · on June 10, 2022

Why would I care if one is merely “formatting” or not? If I have to run a tool either way, I would prefer one that accepts a user-friendly input language and decouples content from presentation.

hombre_fatal · on June 14, 2022

Because transforming an .md file into an .html file is a lot more invasive (though taken for granted here I think) than just writing the .html file. It's a build step where there wasn't one before.

I'm not saying it's never worth it.

bentley · on June 10, 2022

How does Markdown decouple content from presentation?

throwaway894345 · on June 10, 2022

You typically write your content in markdown and merge it with HTML templates and CSS.

jokoon · on June 10, 2022

Are more strict html parsers/renderers, and aren't they faster?

hombre_fatal · on June 10, 2022

Lenient parsers still benefit from strict input because it lets them avoid lookaround/backtracking.

vbezhenar · on June 10, 2022

What do you mean by lookaround/backtracking? You're inside . You encounter another . You can't nest one inside another , so you close current and open new . That's about it. I fail to see where do you need any kind of backtracking.

hombre_fatal · on June 14, 2022

Well, even in this one example, imagine parser combinators which often mean backtracking the inner so that you can commit to the `openTag('p')` parser. Or your logic may be 'consume all tags that aren't ` which is a lookahead.

A better example here is whether you are lenient and accept unescaped html entities like "<" vs "<". If you require it to be escaped "<" or if all entities in your inputs are always escaped, then your text parser never has to backtrack. But if you are lenient, your text parser can do catastrophic levels of backtracking if there is a single "<" somewhere (unless you are careful). Imagine input that starts off "<a small mouse once said". If could be quite a while before your parser knows it's not an anchor open tag.

shakna · on June 10, 2022

> Are more strict html parsers/renderers, and aren't they faster?

Are what more strict? You're missing a subject there.

At a guess, you're referencing the differences between Chrome/Firefox rendering times? And are surprised that Chrome is always slower?

In the same completely unscientific stat taking, I found that Chrome was significantly faster at parsing the HTML head element of a document than Firefox, and that difference was enough for Chrome to pull ahead of Firefox in overall rendering times for smaller pages. (Chrome was about 30% of Firefox's time spent in the head.)

However, Firefox was faster at parsing the body, and as I had a larger-than-usual body (50k words is not your average webpage), Firefox was overall faster.

chrismorgan · on June 11, 2022

To you and all that have responded: there is no variation in HTML parsing between browsers. All engines are using precisely the same exhaustively-defined algorithm. There is no leniency or strictness. Their performance characteristics may differ outside of parsing, which includes what they do with the result of parsing, but in the parsing itself there should be basically no difference between engines or parsers.

hsbauauvhabzb · on June 10, 2022

That’s interesting, but surely relying on user agent to ‘fill in the gaps’ is error prone? Surely transpiling prior or during render would be more resilient than trusting browser behaviour

lolinder · on June 11, 2022

If you're in a situation where resilience against odd browser quirks matters, you probably shouldn't be writing HTML like this anyway. This style is fine for writing HTML for a blog. For any kind of application, it would be a nightmare to try to maintain.

Every time the author introduced a shorthand, they had to clarify that it works only in specific situations. The result of those qualifiers is that you will have to have some code written in the more verbose style anyway. Context switching between those styles and having to decide whether the shorthand works in any given case just isn't worth it on a large project that you'll be making changes to over time.

chrismorgan · on June 11, 2022

HTML parsing is exhaustively defined, so there’s not any filling of gaps, but only rules to be aware of. If you don’t know those rules, this may be error-prone, but if you do, it’s not, and things like the start and end tag omissions discussed in the article are quite straightforward rules to learn.