* All white space in HTML and XML is preserved verbatim.
* HTML has a default presentation scheme that varies by interpreter. For everything else use CSS.
* The default presentation of white space in HTML and XML is what is called tokenized space, which is that all consecutive white space characters are displayed as a single space character. Again, you can control this with CSS.
* White space does not determine the behavior or display of other HTML tags.
* White space is a text node in the DOM. If it is adjacent to other text that text and white space are one text node, otherwise the white space is its own text node.
That should be all there is to it. JavaScript has absolutely no bearing on this subject.
> White space does not determine the behavior or display of other HTML tags.
As mentioned in the article, the collapsing behavior of leading/trailing white space can affect other elements.
The only difference between the two cases in the following example is that the latter has a space at the end of first element's content. That causes the second element's leading space to not be rendered: https://codepen.io/mkantor/pen/RNbXVJM?editors=1000
In addition, the two cases can be toggled on and off using CSS alone, because `display:inline`, `display:block`, and `display:inline-block` all do different things to the two pieces of whitespace:
* `display:inline` allows the trailing space to be rendered (in red). There is now whitespace between the two elements, therefore all other whitespace up until the next non-whitespace character will be ignored.
* `display:inline-block` will remove the trailing red whitespace. Therefore when the blue whitespace is found, this is rendered.
* `display:block` will render the element as a block. Therefore trailing whitespace in that element and leading whitespace in the next element will both be ignored.
This is surprisingly weird and funky. I thought I had a fairly good handle of how odd whitespace in HTML could be, but I didn't realise how much it was controlled by different CSS display settings.
> All white space in HTML and XML is preserved verbatim.
Do you mean "any change to whitespace in HTML or XML results in a semantically inequivalent document"? That's a perspective you can take, but it has pretty undesirable results: it means you can't ever reformat HTML or XML source. And it very clearly contradicts how most people understand and expect to work with HTML.
> The default presentation of white space in HTML and XML is what is called tokenized space, which is that all consecutive white space characters are displayed as a single space character. Again, you can control this with CSS.
Where is the CSS option to display multiple consecutive spaces, but line break as normal? That's something very much wanted in a lot of cases (and what the author's &ncsp; idea is getting at).
Also, this doesn't make it clear what happens (or should happen) with consecutive white space characters that have different markup.
> White space is a text node in the DOM. If it is adjacent to other text that text and white space are one text node, otherwise the white space is its own text node.
What does "adjacent" mean here? And what happens if the white space is adjacent to text on both sides?
> Do you mean "any change to whitespace in HTML or XML results in a semantically inequivalent document"?
In the strictest sense? Yes.
Of course, we can build tools that make... call it "unsound assumptions"... and I'll happily use them and encourage their use, because you can make the correct judgement call that those assumptions should hold in your context (and that the one causing the assumptions to be broken, if they ever are, is the one "at fault" rather than the tools.)
On the other hand, if those same tools are then automatically applied beyond your control, there's a good chance those unsound assumptions will be broken, and become a source of pain and suffering for whatever strange - or not so strange - edge cases your own context comes with.
Whitespace isn't the only source of this problem - and it's one of the problems I have with WYSIWYG editors in general. Often, they don't clean up after themselves and leave behind a bunch of editor shrapnel, in part because they can't remove stuff that might technically be semantically inequivalent. Those same editors might also remove stuff I wanted to keep!
> Of course, we can build tools that make... call it "unsound assumptions"... and I'll happily use them and encourage their use, because you can make the correct judgement call that those assumptions should hold in your context (and that the one causing the assumptions to be broken, if they ever are, is the one "at fault" rather than the tools.)
That's a pretty bad way to standardise a data format IMO. If readers, writers, and tools all want these representations to be equivalent, far better to make that equivalence part of the standard - the point of the standard is to support the use cases, and being able to sensibly reformat HTML is far more valuable than being able to preserve a distinction that doesn't show up in any browser and most writers would never intend anyway.
The need for equivalence, if any, is in the parsing and not the visual presentation. HTML does not consider itself, according to its maintainers, to be a presentation format.
Which makes it decidedly unfortunate that you cannot determine whether a sequence of spaces are collapsible or not without consulting the presentation layer, even though this should be a semantic/parsing question.
Are there actual renderers that don't, and are there real users who consider that reasonable behaviour? I mean maybe someone somewhere has a spacebar heating workflow that relies on it, but file formats and standards should not add ways to shoot yourself in the foot if they can help it.
As this article shows, browsers actually collapse spaces differently based on the specific CSS applied - and this is in fact intended behavior, not some corner case.
Also, the output of HTML parsers is the HTML structure, and changing that to collapse spaces would break numerous tools. So while probably all HTML renderers do some kind of space collapsing, there are many other uses of HTML parsing that don't. Most likely the syntax highlighting in your HTML editor of choice in fact relies on a space-preserving HTML parser, just for one example.
> browsers actually collapse spaces differently based on the specific CSS applied
Sure. But they do all collapse spaces. I don't think anyone wants their browser to always preserve all the spaces that are in the source.
> and this is in fact intended behavior, not some corner case.
Eh maybe. They collapse the spaces of block elements like block elements and the spaces of inline elements like inline elements; that seems like the obvious thing that your renderer would do if you didn't make any deliberate design decision.
> So while probably all HTML renderers do some kind of space collapsing, there are many other uses of HTML parsing that don't. Most likely the syntax highlighting in your HTML editor of choice in fact relies on a space-preserving HTML parser, just for one example.
I very much doubt it. And even if it did, that would be an incredibly backwards reason to keep that behaviour - "we've spent all this effort working around our bad standard, that would be wasted if we fixed the standard".
Creating parsers which entirely ignore parts of the input is generally a bad idea, because you lose the ability to round-trip. That is, it's often a desirable property to have a way to go text1 -> DOM -> text2, and have text2 be identical to text1, or at least very close to it. This is particularly true for markup languages, which intermix text and tags.
But somehow almost every programming language and data format manages to define these equivalences and have it not ruin their editors. JSON is whitespace-insensitive but syntax highlighting it in my editor works fine; I don't know or care what the parser implementation that accomplishes that is, but it's never caused any problems I've heard of.
I really don't get what you mean. HTML and JSON behave essentially the same way in relation to spaces. It's you who seems to be asking for the HTML parsers to apply display logic in the parsing step. And sure, JSON parsers discard whitespace information outside of JSON strings, but that only works because JSON has an explicit string type. In HTML everything is a user-visible string unless it's a tag, so the same logic fundamentally can't be applied.
In fact JSON is the perfect example - if you have multiple spaces or \n in a JSON string and load that into some DOM element with JS at runtime, those spaces will be eaten up just as much by the browser renderer as any spaces that were part of the original HTML. Because, again, HTML and even the DOM don't do any kind of space collapsing; only the browser render step does that, as instructed by CSS.
> JSON parsers discard whitespace information outside of JSON strings, but that only works because JSON has an explicit string type. In HTML everything is a user-visible string unless it's a tag, so the same logic fundamentally can't be applied.
Well, sure. The point is that's an unfortunate design.
But it's a core part of the concept, the whole idea behind a markup language. Basically the whole point of HTML, and even of SGML before it, is that you are adding annotations in-line in a text, not representing a text as a tree-like data structure, at least for much of it.
> > All white space in HTML and XML is preserved verbatim.
> Do you mean "any change to whitespace in HTML or XML results in a semantically inequivalent document"? That's a perspective you can take, but it has pretty undesirable results: it means you can't ever reformat HTML or XML source. And it very clearly contradicts how most people understand and expect to work with HTML.
I don't think that's what grandparent meant. I read that HTML and XML do not impose any coalescing of whitespace. Whatever whitespace is read by a parser is accepted as such. Whether the whitespace has semantic value or not is not a concern for HTML or XML as data formats.
On the other hand, coalescing whitespace is a feature of HTML and XML renderers.
And you are correct: a tool that reformats whitespace inside a <verbatim> tag will output semantically wrong results (e.g. if the contents are Python code). Which supports the point: the semantics of whitespace are not determined by the HTML or XML data formats, but by the tools generating and consuming the data.
> What does "adjacent" mean here? And what happens if the white space is adjacent to text on both sides?
I think it is the simplest sense of adjacent: if the character at the previous or next position in the bytestream is considered text, than the whitespace character is part of the same text node; if it is considered anything else, it's a separate text node. This applies recursively, since whitespace is text itself. If you wanted to specify it very formally, you probably need to include some extra verbiage for escape sequences which represent text characters, but that's the only ambiguity I can think of.
> Any collapsing, un collapsing, ignoring, and so an are handled by manipulating these nodes further. But this is the semantics of the HTML itself.
Well the post I replied to talked about the DOM, and as a description of the behaviour of the DOM I don't think your description is accurate - when you write
"<a href="ABC">foo </a> bar" you only end up with one space character contained in a DOM text node, not two.
So yes, the DOM itself contains what I and the poster before mentioned. It's the presentation layer that decides how to do space collapse.
As a bonus, you can also see that the DOM of the page has four children of `body`: `<div>`, then a text node with the content "\n\n", `<script>`, and another text node containing "\n".
Tested all this with a simple HTML I saved on disk and opened in Firefox.
> Do you mean "any change to whitespace in HTML or XML results in a semantically inequivalent document"?
In a very limited sense it is like this for most languages, at least ones where error traces include line:cols numbers or that give magic macros like php's __LINE__.
If you where to use those data in you logic a formatting could break your code, similarly if you read the text content of you html and use its whitespace in your logic a formatting could break your code, aside from that in most cases 1 whitespace or 1000 whitespace are generally equivalent in HTML
> In a very limited sense it is like this for most languages, at least ones where error traces include line:cols numbers or that give magic macros like php's __LINE__.
But generally people take the view that code that throws error traces with different line numbers can still be semantically equivalent, and that changing your code's behaviour depending on __LINE__ is unreasonable. Ultimately which files are considered equivalent will be a social convention, but it should be a social convention that fits the use cases and makes the file format easier to work with.
* HTML has a default presentation scheme that varies by interpreter. For everything else use CSS.
* The default presentation of white space in HTML and XML is what is called tokenized space, which is that all consecutive white space characters are displayed as a single space character. Again, you can control this with CSS.
* White space does not determine the behavior or display of other HTML tags.
* White space is a text node in the DOM. If it is adjacent to other text that text and white space are one text node, otherwise the white space is its own text node.
That should be all there is to it. JavaScript has absolutely no bearing on this subject.