I'm really hesitating to comment. It's way too easy for us types to miss really good ideas due to our preconceptions. But it's hard for me to look past the fact that this looks like a really bad abstraction over the top of the structure document abstraction that was designed specifically to suit NPR's purpose.
For example, they show the <em> tag being stored by position in the database separately from the content itself, seemingly unaware of the significance that <em> is a semantic tag. It means emphasis. If the iPod doesn't like <em> for emphasis, well then that's why it's a semantic tag. You can remove semantic tags or replace them with alternative markup or binary formats as needed because they have meaning designed for parsing and formatting.
It's all well and good to hate on XML for inappropriate uses. But, critical point here, documents are what XML is for! And there is a shocking wealth of libraries in just about every language available today that will handle document markup just fine.
It seems to me to be a quintessential reinvention of the wheel, and the XML hating has just gone on too far at this point.
It's a lot cheaper to insert your markup by position than to parse the XML on every read. Of course, caching would mitigate much of that, but still I admire the design.
You wouldn't even have to parse the XML -- just anoint one form the canonical internal representation and then have filters which turn the internal representation into other formats.
For example, if you store all documents in a restricted version of HTML, then you can generate a text-only version by replacing all <em> and </em> with star, etc. This is much cheaper than actually parsing XML (no need to build up a data structure representation of the document, just assume it is mostly right already and apply textual transformations).
Then, once you have the plaintext version, cache it. Journalism sites are going to have an absurd number of reads to writes. (One guy writes the article, one guy edits it, 50,000 people see it within a minute of it being posted.) Caching is cheap and easy. Caching is robust against the article changing, too -- you purge the cache (or potentially just let it expire) and bam your new article is mostly right, whereas tracking the offset of <em>s seems to intuitively have a lot of possible failure modes to me. (For example, the first time one of your reporters tries to add a bit of local color with Japanese text in an article, I'll bet dollars to donuts that every tag after the Japanese breaks and the surrounding content gets corrupted.)
That just seems really fragile to me. Yank one space off the front of any article and the entire thing is jacked. I'm just not convinced that it's all that terrible to parse the markup on the way out if you need it transformed, while enforcing strict rules on what (valid) markup can go in.
I agree. The majority of their concerns can be addressed by parsing out markup on the way into the DB and storing the markup-free version in a separate location. This is a lot of acrobatics for marginal return.
I'm especially doubtful of it, because the article doesn't explain the problem with the obvious approach you suggested.
I feel bad to be so negative toward an idea I haven't fully explored, but I get the impression that the author of this article is trying hard to justify a bad approach rather than fixing the code.
Either way, you would have to parse it into something equivalent to what they are storing in the database so you you could handle the various target formats individually.
As long as you tightly control writes to the database, I don't see the problem. And, you don't have the overhead of parsing on every read. For an environment where reads will be a few orders of magnitude greater than writes, it does make some sense.
It seems incredible that they can't sacrifice some speed to parse on every read, and yet are also unwilling to sacrifice some disk space to cache several fully parsed versions.
Considering how quickly CPU and disk performance change, this type of tightrope-balanced optimization seems crazy. Or maybe they're way smarter than me.
This is clearly working for them. What is interesting about this is that it puts a vote on the SGML idea that content and presentation are nicely separated and identified by the tags. But this has always faintly concerned me, as to whether this really will work.
As an example madair in an earlier comment notes that em is intended to be a semantic tag. But do the semantics of this particular tag ever mean anything outside of "show this text as emphasized"?
I am wondering if whatever semantic markup we do is still all about presentation, and takes a fairly arbitrary abstraction of what the text is to mean.
It's a twist, but if we say that an iPod cannot emphasize text, then we can instruct the system to ignore <em> tags. Because the semantics of <em> are clearly defined, a judgment call is available to us to feel confident that we can ignore the tag completely under the circumstances. So we utilize the tag itself still by instructing the system to swallow it.
See, this is what I find a bit unclear. The concept is that presentation is distinct from content. But if we say that the semantics of the em tag are to effect the presentation, they seem to be not be separated anymore, right?
At a certain point the semantics of the content get translated into an output format. In a browser it's italic. In an iPod it's ignored.
The semantic tags are the means, but output presentation is still the end, and there's certainly no reason to expect that all devices can take the original semantic markup and use CSS or some other mechanism to present it. And because of the semantic meaning we can make the em disappear with confidence.
For example, they show the <em> tag being stored by position in the database separately from the content itself, seemingly unaware of the significance that <em> is a semantic tag. It means emphasis. If the iPod doesn't like <em> for emphasis, well then that's why it's a semantic tag. You can remove semantic tags or replace them with alternative markup or binary formats as needed because they have meaning designed for parsing and formatting.
It's all well and good to hate on XML for inappropriate uses. But, critical point here, documents are what XML is for! And there is a shocking wealth of libraries in just about every language available today that will handle document markup just fine.
It seems to me to be a quintessential reinvention of the wheel, and the XML hating has just gone on too far at this point.