Hacker News new | past | comments | ask | show | jobs | submit login

To implement an HTML parser, you don’t need to worry about the corner cases at all, because the spec has your back and spells out exactly how every single case should be handled, in the form of state machines, which is how you will implement it. There are involved details, to be sure, but it’s genuinely not hard to follow the spec. The idiosyncrasies document you cite is for authors of HTML: implementers genuinely don’t need to worry about them, because it’s all covered by the spec.

For document.title: naturally in a browser you would use that; I intended to describe just how you would achieve it without that. And I completely forgot about document.getElementsByTagNameNS for some reason, which is of course more sensible than a querySelector + find. Note that .textContent.trim() doesn’t match the algorithm which is spelled out in the spec (just below the earlier link), on two counts. Firstly, all sequences of ASCII whitespace in the middle of the string need to be collapsed to a single space ("\r\n\f\t hello\r\n\f\t world\r\n\f\t " → "hello world"). Secondly, .textContent is insufficient, including the text content of element children as well, whereas the spec says child text content (with a link to the exact definition); HTML syntax can’t produce such elements (the parser switches into RCDATA state), but XML syntax can, as can DOM manipulation by scripting. Examples that are both titled “included” rather than the textContent “inclexcludeduded”:

  data:application/xhtml+xml,<title xmlns="http://www.w3.org/1999/xhtml">incl<b>excluded</b>uded</title>
  data:text/html,<title>incl</title><script>document.querySelector("title").append(b=document.createElement("b"),"uded"),b.append("excluded")</script>



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: