Posted by Jeff on 01/29/08 18:33
Andy Dingley wrote:
> On 29 Jan, 15:38, Jeff <jeff@spam_me_not.com> wrote:
>>> _Inside_ a CMS, there are strong arguments for using XHTML (or at
>>> least, some XML that shares the same XML schema)
>>> On publishing from a CMS, it's usually easier to serve the document to
>>> the web as HTML than it is as XHTML
>> That has been my thinking. To create it as xhtml and serve it as html.
>> That leaves me with either fixing the stray bits like <br /> or just
>> ignoring them as the browsers do anyway. I suppose I should fix them...
> In general, you just shouldn't even think about writing output
> Really. Stop it right now.
> Why do you need to write a "serialiser" at all? You're probably
> working with XML, from a well-known language. In which case, there's a
> range of XML DOMs to choose from and they're already written for you.
> In particular, skilled people have worked hard to write standards-
> compliant serialiser methods on them, including support for varying
> character encodings. These things work. They work better than most
> people have the skill to duplicate. They take more time to write again
> than the competent people can afford.
> It's like cryptography. There's only half-a-dozen people who should
> ever be allowed to write the components, one of them's insane, one's
> Finnish, two are academics with scary hair, and the others are kept in
> a locked cupboard by the NSA. The rest of us should only ever re-use
> these components, not re-invent them.
> So your serialiser should understand the difference between SGML and
> XML, or at least HTML and XML. Then you tell it which, and it just
> works. If it doesn't, it's broken (so why trust it to get anything
Well, we seem to have diverged somewhere. I'm not converting html to xml
or vice versa.
All I'm doing is taking the CMS data and outputing it as html. That's
pretty easy as the CMS is nothing but a collection of heading,paragraph,
image, list, class... objects. One set of those after another. That's
all html is anyways. It has to write correct XHTML style html, because
the author is not creating paragraphs and lists but merely filling them
in. I understand that many CMS's have an editor component that edits
like a "word" doc. I've always thought that was wrong.
There's two issues that arise and the first is what to do with
linebreaks in paragraphs and headings. Generally the author expects to
see those as newlines. So I convert those to <br>'s with an option not to.
The other is what to do with extraneous markup the author adds. In
RSS this is no problem as they are escaped. Otherwise it's not hard to
ensure that tags are nested correctly and closed properly. It's just a
couple of greedy regexes that check each outside pair of tags to see
that they match. That only leaves any single tags such as <br> that the
author may add. And that is roughly what I was asking. But it's at
worst just another regex to remove trailing "/"'s in tags.
> If you aren't using some sort of intermediate DOM (i.e. direct
> document.write()s) then fix that first. _Especially_ for anything
> resembling XML.
> Here's a spare Clue: If you come to my project again and tell me that
> "We need to write a new serialiser from scratch, because the standard
> one doesn't work because "Our Project Am Spesshull.", then you'll get
> a swift dose of Clueiron justice. I've had this imposed on me three
> times now, all incorrectly. First time was dumb, second time was dumb,
> a big project and disastrous, third time I just blew the bastard thing
> clean out of the source repository (and we all lived happily ever
> Mostly people think this in the first place because they don't grok
> character encodings.
[Reply to this message]