Hacker Newsnew | past | comments | ask | show | jobs | submit | niconii's commentslogin

> If hypothetically 10% of the population said A but you slice up B into specific enough buckets then A wins even if the overwhelming majority dislike A.

Yes, if you're only allowed to vote for a single option. If you're allowed to vote yes/no for each option, or rank them from best to worst, then this problem doesn't happen.


It’s very hard to make multiple choices representative. Yes/No is fine, the most popular option from 20 yes’s and 20 No’s isn’t meaningful.


Although there are some other uses for <p>, it is perfectly valid to use <p> tags for textual paragraphs and that has been the main use for <p> for as long as HTML has existed. I'm not sure why you believe otherwise.

Take a look at the source code for http://info.cern.ch/hypertext/WWW/MarkUp/Future.html for instance, which was written by the creator of HTML, Tim Berners-Lee.

You can also look at the source code for any page of the current HTML spec (e.g. https://html.spec.whatwg.org/multipage/introduction.html) where, again, <p> is used for each paragraph in the text.


I didn't say it's not a valid use, I said that it's not it's primary use.

Paragraphs relate to grouping content[1], not textual one. There's no logic in paragraphs.

I quote here the official spec, which makes various examples of how paragraphs are not related to logical paragraphs:

> The solution is to realize that a paragraph, in HTML terms, is not a logical concept, but a structural one. In the fantastic example above, there are actually five paragraphs as defined by this specification: one before the list, one for each bullet, and one after the list.

And I'll quote also the definition on MDN:

> The <p> HTML element represents a paragraph. Paragraphs are usually represented in visual media as blocks of text separated from adjacent blocks by blank lines and/or first-line indentation, but HTML paragraphs can be any structural grouping of related content, such as images or form fields.

Failing to realize that paragraphs are grouping rather than logical content leads to frequent misuses of paragraphs and this comment section is literally filled by bad paragraphs examples which suggests the community is largely ignorant on html.

[1]https://html.spec.whatwg.org/multipage/grouping-content.html...


In this comment section? Are you talking about stuff like the example I used earlier?

    <p><div></div></p>
Yes, obviously this is bad and nonsensical HTML. Under no circumstances does it make sense to have a div inside a p. In fact, the above doesn't even work, being parsed as

    <p></p><div></div></p>
But the intention of this example is not to show good HTML. The point is that many people have only a very basic understanding of HTML syntax, under the impression that

    <foo><bar></bar></foo>
works for any elements, because there's a <foo> and a </foo> so clearly anything inside it must be inside the foo element, right? But this is not the case for all elements. HTML's syntax is more complicated than that. My example was only intended to correct this misconception, not to demonstrate semantically-correct HTML, and that goes for other similar examples made by other people in the comments too.


One interesting detail is that a lack of deep nesting was in fact a deliberate design goal for HTML originally, to make WYSIWYG editing more feasible.

http://info.cern.ch/hypertext/WWW/MarkUp/HTMLConstraints.htm...


I can't verify your numbers. As far as I can tell, loading a ~900,000 word document with no other differences than including or excluding </p> has about the same load time, though there's too much variance from load to load for me to really give definitive numbers.

Are you sure you converted it properly? I'd expect those kinds of numbers if your elements were very deeply nested by mistake (e.g. omitting tags where it's not valid to do so), but I don't see why leaving out </p> should be so slow.

Try these two pages:

https://niconii.github.io/lorem-unclosed.html

https://niconii.github.io/lorem-closed.html


For five runs, on the same hardware with the same load:

+ Unclosed: 4.00s, 3.91s, 3.59s, 4.45s, 3.93s

+ Closed: 3.90s, 2.74s, 3.9s, 2.05s, 3.39s

Though I'd note that the newline you have immediately following the paragraph, even when closing, would probably reduce the backtracking effect. And having no explicit body or head element would probably cause some different rendering patterns as well.


I don't know what you're measuring (onload?), but it's not giving you enough precision to make a conclusion about the performance of the HTML parser. If you profile the page w/ devtools Performance panel, you'll see that just 5% of the CPU cost used to load & render the page is spent parsing the HTML. At that level I'm seeing costs of 22-36ms per load.

And, spoiler alert: after repeated runs I'm not seeing any substantial difference between these test pages. And based on how the HTML parser works, I wouldn't expect it.

(I work on web performance on the Chrome team)


Were the five unclosed runs before the five closed runs? I could see that making a difference vs. interleaving them, if the hardware needs to "warm up" first.

For me, on Firefox on Linux (I know it's the one with the smallest difference, but I don't have the others on hand, sorry), using the "load" time at the bottom of the Network tab, with cache disabled and refreshing with Ctrl+F5, interleaving the tests:

- Unclosed: 1.38s, 1.49s, 1.45s, 1.52s, 1.48s

- Closed: 1.47s, 1.37s, 1.48s, 1.49s, 1.35s

The one with </p> omitted takes about 0.032s longer on average going by these numbers, but that's about 2 frames of extra latency for a page almost twice the length of The Lord of the Rings.

Regarding the page itself, I tried to keep everything else as identical between the two versions as possible, including the DOM, hence why I wrote the </p> immediately before each <p>. As for backtracking, I'm not sure what you mean. The rule for the parser is simply "If the start tag is one from this list, and there's an open <p> element on the stack, close the <p> element before handling the start tag."



>A p element's end tag may be omitted if the p element is immediately followed by an address, article, aside, blockquote, details, div, dl, fieldset, figcaption, figure, footer, form, h1, h2, h3, h4, h5, h6, header, hgroup, hr, main, menu, nav, ol, p, pre, section, table, or ul element, or if there is no more content in the parent element and the parent element is an HTML element that is not an a, audio, del, ins, map, noscript, or video element, or an autonomous custom element.

Whoops! Not in the spec!


What's not in the spec? Every example in the article is valid HTML, and the article itself, which is written in the same style, is valid as well:

https://validator.w3.org/nu/?doc=https%3A%2F%2Flofi.limo%2Fb...

> Document checking completed. No errors or warnings to show.

Or are you complaining that the rules are too complicated? It's very verbose and explicit because this is a specification, but the basic rule of thumb is that anything that would normally be a block element and thus doesn't make sense inside a paragraph will end that paragraph. In practice, this is not really an issue I run into.

Moreover, you need to know about this rule even if you don't omit </p>, because this is the list of elements that implicitly ends a paragraph. For example, <p><div></div></p> is invalid HTML because <div> ends the paragraph implicitly, making it equivalent to <p></p><div></div></p>.

If you don't like that, then your problem is not with this particular code style but HTML itself, which is reasonable. HTML's syntax is very complicated due to its history and doesn't always make sense. But you still have to know how it works regardless of how you personally like to write it.


Read the quoted sentence again (from the source you brought here), none of those clauses apply to:

<p>

Block of text ...

Which is what they do in the article.


I don't understand what you mean. Please elaborate.

That quote says that </p> is not needed in many cases. When you say "none of those clauses apply to <p>", this is true, you can't omit <p>, only </p>... but the blog article doesn't advocate for omitting <p> at any point.


1. The article omits </p> liberally.

2. The spec details situations where the </p> tag may be omitted.

3. None of these (2) apply to what's going on at (1).


Okay, let's go over this then.

First, let's talk about the basic case where there's no whitespace between the two paragraphs.

    <p>Paragraph 1</p><p>Paragraph 2</p>
In this case, the first </p> can be omitted according to the rule "A p element's end tag may be omitted if the p element is immediately followed by an [...] p [...] element [...]", resulting in this code:

    <p>Paragraph 1<p>Paragraph 2</p>
If we assume that the body ends immediately after this (either because there's a </body> or because we've reached the end of the file, since </body> and </html> are optional tags) then we can remove the second </p> as well because of the rule "A p element's end tag may be omitted if [...] there is no more content in the parent element and the parent element is an HTML element that is not an a, audio, del, ins, map, noscript, or video element, or an autonomous custom element":

    <p>Paragraph 1<p>Paragraph 2
Now, let's get into the case where there is whitespace between the two paragraphs:

    <p>Paragraph 1</p>
    
    <p>Paragraph 2</p>
In this case, you can't remove the first </p>, because the rule is that it must be "immediately followed" by another p element. However, what if we start with this code?

    <p>Paragraph 1
    
    </p><p>Paragraph 2</p>
In this case, we can remove the first </p>, resulting in:

    <p>Paragraph 1
    
    <p>Paragraph 2</p>
and again, we can remove the last </p>, resulting in:

    <p>Paragraph 1
    
    <p>Paragraph 2
Now, this is different from what we started with. The whitespace is now inside the first paragraph instead of after it. But since HTML does not render this extra whitespace by default, it's of no real consequence.

And that leads us to this point: the HTML spec is specifying the exact circumstances where you can omit tags without changing the DOM. However, if we are okay with changing the DOM a bit, by moving that whitespace into the first paragraph, then we can simply pretend that we wrote

    <p>Paragraph 1
    
    </p><p>Paragraph 2</p>
from the beginning, and apply the rules to that instead.


> But since HTML does not render this extra whitespace by default, it's of no real consequence.

But it does render that extra whitespace (as a single space). Try selecting the text of the article and you see there is a trailing space after every paragraph.


You misunderstand the spec. Exactly what confuses you is hard to discern, but perhaps you misread “the p element” as referring to the <p> start tag, when in fact the element includes the start tag, text contents, and (if present) the end tag.


You have to know about what breaks out of <p> tags regardless of whether or not you leave off the end tag, though.

<p><div></div></p> is invalid HTML because <div> ends the paragraph, resulting in an unpaired </p>.


And not just because of that. In XHTML‐as‐XML, where <div> does not implicitly end the paragraph, what you posted is still invalid because <p> cannot contain <div>.


Okay, I've gotten a bit carried away with HTML spec minutia elsewhere in these comments, mainly because this stuff really isn't common knowledge among web developers and it's hard to resist talking about it.

However, I also want to add to the other point here; namely, "why do this in the first place?"

Google mentions "file size optimization" in the style guide linked above, and it led to criticism on HN and elsewhere along the lines of "sure, maybe shaving off a few bytes adds up for Google, but you're not Google." However, this really isn't why I do it, and I think maybe it's poisoned the discussion a bit.

For me, it's mostly about reducing visual noise. It's not necessarily less typing if you're using an editor that auto-inserts closing tags, but I think it's harder to read with the end tags, particularly when it comes to tables. For instance, consider this table (which, if you're curious, is about Super Nintendo audio samples):

  <table>
    <tr><th>sample rate</th> <th>data rate</th>    <th>max length</th></tr>
    <tr><td>32000 Hz</td>    <td>18000 byte/s</td> <td> ~3.641 s</td></tr>
    <tr><td>16000 Hz</td>    <td> 9000 byte/s</td> <td> ~7.282 s</td></tr>
    <tr><td> 8000 Hz</td>    <td> 4500 byte/s</td> <td>~14.564 s</td></tr>
  </table>
And now, take a look at it without the optional end tags:

  <table>
    <tr><th>sample rate <th>data rate    <th>max length
    <tr><td>32000 Hz    <td>18000 byte/s <td> ~3.641 s
    <tr><td>16000 Hz    <td> 9000 byte/s <td> ~7.282 s
    <tr><td> 8000 Hz    <td> 4500 byte/s <td>~14.564 s
  </table>
Personally, this is much easier for me to read and maintain. There's also much less chance of me accidentally mismatching opening and closing tags, since I've eliminated almost all of the closing tags besides </table>.

On top of that, let's be honest, <html><head></head><body></body></html> is just useless boilerplate. We all know what goes in the head and what goes in the body, and I think we can all figure out where the HTML starts and ends. The browser knows all of this too, which is why all of those tags are completely optional. Not only does getting rid of that make the code less noisy, it solves the age-old problem of "should I indent the head and body or not?" (Though, in practice, I still keep <html lang=en> so that I can specify the language for people who use screen readers. </html> is just silly, though.)


How likely is it that a dev would be reading an HTML table with hardcoded data in the code editor in this day and age?

Today, data tables are mostly dynamically generated, probably via accessing an API endpoint of some sort. The dev would at most write a function that would output the data inside a <table> element. But the table data itself would only be visible in the browser, it wouldn't be hardcoded into the HTML markup.


If you're following the advice of the post, and using HTML as your authoring format, this is pretty common. I do it a lot on my blog.

It's similar to how often you see tables in Markdown.


Even the official Markdown guide suggests using a table generator because the markup is a hassle:

https://www.markdownguide.org/extended-syntax/

HTML tables are similarly cumbersome, with or without closing tags. If you have a table of static information to fill in, copy/pasting from Excel into a HTML table generator and pasting the output is signficantly faster.


Markdown is suggesting using a generator for writing, but you were saying it was unlikely that someone would be reading a table in source.

I'd also much rather write an HTML table by hand than a markdown one.


What is your point? Even if it is a rare occurrence, from time to time you might want to output an html table using "print", and it will be slightly more comfortable that way. I sometimes find myself editing html by hand. Thanks to the optional and auto-closing tags it is just as easy as markdown, and you don't need an extra conversion step.

The verbose format has no advantage whatsoever, unless you are doing weird xml stuff.


I mentioned "block elements" in my other comment, but to be clear, CSS has no effect on how HTML is parsed, so the display property isn't relevant here.

Furthermore, you need to know about this rule even if you do write `</p>`. If you write this:

    <p>
        <div>hello</div>
    </p>
it is invalid HTML. This is because the paragraph is auto-closed, and then your code is equivalent to this:

    <p>
        </p><div>hello</div>
    </p>
resulting in an error because of the extra `</p>` after the `</div>`.


Do browsers actually do that?


They can't produce imbalanced tags because it's impossible to represent in a DOM tree.

They will indeed auto close the (first) paragraph, and also auto open a new <p> element because of the extra closing </p> tag.

Try it out by typing this in your address bar and open the inspector:

    data:text/html;charset=utf-8,<!DOCTYPE html><p><div>hello</div></p>
The generated HTML (document.body.parentNode.innerHTML):

    <head></head><body><p></p><div>hello</div><p></p></body>
Browsers will go out of their way to produce a (valid) DOM from pretty much any string (as per the HTML5 spec). (Almost?) nothing is a syntax error. You won't get a HTML5-compliant browser to show a parse error to the user in HTML (of course, this is not the case in XHTML).

edit: sorry niconii, I edited my comment under your feet.


Just to add to this, HTML parsers will attempt to make sense of any random line noise you give it and turn it into a DOM. When I say something is "invalid" HTML, what I mean is that it's not allowed by the spec and will result in an error if you run it through a validator (which you should do!)

For example, try running the following document through the W3C's HTML validator[1]:

  <!DOCTYPE html>
  <html lang=en>
  <title>Test Document</title>

  <p>
    <div></div>
  </p>
The HTML spec contains a list of all possible parse errors[2].

[1] https://validator.w3.org/nu/#textarea

[2] https://html.spec.whatwg.org/multipage/parsing.html#parse-er...


It is allowed by the spec afaik (the spec precisely tells browsers how to interpret this, so everything is perfectly specified in the spec boundaries - and so it's not "out of spec").

Of course, that screams "MISTAKE" that a validator should warn you about. Like a linter that would spot missing extra parentheses for an assignment in a if condition in C-like language. It is allowed to not put the parentheses, but it is recommended to put them.

And of course, that makes "Valid HTML" (almost?) redundant (There are probably "vocabulary" errors that are possible, like a missing src attribute for an img or a missing title tag in head - don't take my words on this though).

div in p is not invalid, it's outright impossible to obtain from HTML parsing.

You can obtain this by doing this in JavaScript:

    document.body.appendChild(document.createElement("p"));
    document.body.firstChild.appendChild(document.createElement("div")))
Or by parsing as XHTML:

    data:application/xhtml+xml;charset=utf-8,<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml"><head><title>Hello</title></head><body><p><div></div></p></body></html>

You get:

    document.body.innerHTML

    <p xmlns="http://www.w3.org/1999/xhtml"><div/></p>
Which I realize is actually a bit scary, I go out of my way to write XHTML in the hope any error will be caught, but parsing as text/html actually produces a valid dom where parsing as XHTML won't necessarily.


This is not quite right. The HTML spec specifies not only what browsers/user agents are allowed to do, but also what document authors are allowed to do.

While the HTML parser does handle errors, to conform to the spec, document authors must not make these errors.

Here is an excerpt from the spec[1]:

> As described in the conformance requirements section below, this specification describes conformance criteria for a variety of conformance classes. In particular, there are conformance requirements that apply to producers, for example authors and the documents they create, and there are conformance requirements that apply to consumers, for example web browsers. They can be distinguished by what they are requiring: a requirement on a producer states what is allowed, while a requirement on a consumer states how software is to act.

Furthermore, a user agent is not required to correct errors, and can simply halt at the first error[2]:

> The error handling for parse errors is well-defined (that's the processing rules described throughout this specification), but user agents, while parsing an HTML document, may abort the parser at the first parse error that they encounter for which they do not wish to apply the rules described in this specification.

[1] https://html.spec.whatwg.org/multipage/introduction.html#how...

[2] https://html.spec.whatwg.org/multipage/parsing.html#parse-er...


I stand corrected! Thank you. I should have read the spec before posting.


Yes, that is a case where it's not legal to omit `</p>` if you don't want the image to be inside the paragraph. Take a look at the spec[1], which outlines exactly when it is valid to omit tags.

The `p` element in particular has pretty scary rules, but it basically boils down to "if it doesn't make sense for this to be inside a paragraph (usually because it's a block element of some kind), it ends the paragraph."

[1] https://html.spec.whatwg.org/multipage/syntax.html#optional-...


Just to clarify, HTML5 didn't introduce this, it's been the case since at least 1993.

https://www.w3.org/MarkUp/HTMLPlus/htmlplus_1.html


Before HTML5 it was common to hear things like "technically you don't need it, but it's not to spec so don't do it".


Well, then they were wrong, because it's always been part of every HTML spec.

Perhaps they were thinking of XHTML, which did require all tags because it was based on XML.


I believe before html5 they weren't required, but what the parsed DOM tree looked like wasn't clearly defined if they weren't, and in practice different browsers did it differently meaning you could very easily end up with it rendering differently in different browsers. html5 fixed that.


This is not correct. What you’re describing sounds like how HTML5 standardized parsing of invalid HTML, but that is not the same thing as implicit closing tags, which have have always been valid, correct HTML producing an unambiguous, clearly defined result.


Oh, for sure, actually parsing HTML was awful before HTML5. The spec sometimes diverged from how browsers actually interpreted HTML, and error correction basically boiled down to browsers trying to reverse-engineer each other to figure out how they handled broken HTML. HTML5 was a godsend for actually standardizing all of that.


Probably mostly a casualty of that window when people were trying to make XHTML happen, so all the SGML-isms like omitted close tags became verboten.


Browsers literally would not render with one tiny syntax error/spec-deviation in your HTML

No different than having a JSX syntax error in React today


In XHTML strict mode, which basically nobody used.

In XHTML transitional, and HTML1-4, what you'd get is browsers with divergent understandings of your DOM tree structure such that content that worked fine in one would be a horribly mangled mess (often breaking layout in difficult to read ways) in another browser.


Some of us tried XHTML. A little bit of extra consistency in syntax and closed tags was rewarded with being able to use tooling built around XML expectations. I dug that.

And then I found out there were some browsers that wouldn't handle it right unless content-type headers were set up right on the server, and of course that made it an extra pain, especially on commodity hosting where you might not have that remotely under your control.

The "strict warnings"/"rendering problems crash the page" setting was also... mixed. Certainly prompted you to pinpoint the issue and fix it but that level of nag often felt unnecessarily militant after what everyone was used to.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: