Re: Writing HTML parser wasn't as hard as I thought it'd be — HTML

You are here: Re: Writing HTML parser wasn't as hard as I thought it'd be « HTML « IT news, forums, messages

Posted by John Thingstad on 04/20/07 12:05

On Fri, 20 Apr 2007 09:48:18 +0200, Robert Maas, see =

http://tinyurl.com/uh3t <rem642b@yahoo.com> wrote:

>
> Anyway, after spending an hour single-stepping it all, and finding
> it working perfectly, I had a DOM (Document Object Model)
> structure, i.e. the parse tree, for the HTML file, inside CMUCL, so
> then of course I prettyprinted it to disk. Have a look if you're
> curious:
> <http://www.rawbw.com/~rem/NewPub/parsed-ggadv.dat.txt>
> Any place you see a :TAG that means an opening tag without any
> matching close tag. For <br>, and for the various <option> inside a
> <select>, that's perfectly correct. But for the other stuff I
> mentionned such as <b> and <font> that isn't valid HTML and never
> was, right? I wonder what the w3c validator says about the HTML?
> <http://validator.w3.org/check?uri=3Dhttp%3A%2F%2Fwww.google.com%2Fadv=
anced_group_search%3Fhl%3Den>
> Result: Failed validation, 707 errors
> No kidding!!! Over seven hundred mistakes in a one-page document!!!
> It's amazing my parser actually parses it successfully!!
> Actually, to be fair, many of the errors are because the doctype
> declaraction claims it's XHTML transitional, which requires
> lower-case tags, but in fact most tags are upper case. (And my
> parser is case-insensitive, and *only* parses, doesn't validate at
> all.) I wonder if all the tags were changed to lower case, how
> fewer errors would show up in w3c validator? Modified GG page:
> <http://www.rawbw.com/~rem/NewPub/tmp-ggadv.html>
> <http://validator.w3.org/check?uri=3Dhttp%3A%2F%2Fwww.rawbw.com%2F%7Er=
em%2FNewPub%2Ftmp-ggadv.html>
> Result: Failed validation, 693 errors
> Hmmm, this validation error concerns me:
> 145. Error Line 174 column 49: end tag for "br" omitted, but OMITTA=
G
> NO was specified.
> My guess is some smartypants at Google thought it'd make good P.R.
> to declare the document as XHTML instead of HTML, without realizing
> that the document wasn't valid XHTML at all, and the DTD used was
> totally inappropriate for this document. Does anybody know, from
> eyeballing the entire WebPage source, which DOCTYPE/DTD
> declaraction would be appropriate to make it almost pass
> validation? I bet, with the correct DOCTYPE declaraction, there'd
> be only fifty or a hundred validation errors, mostly the kind I
> mentionned earlier which I discovered when testing my new parser.

As a ex employee of Opera I can say that writing a Web Browser is hard!
It is not so much the parsing of correct HTML as the parsing of incorrec=
t
HTML that poses the problem. Let's face it. It could be simple.
If we all used XHTML and the browser aborted with a error message
when a error occurred. Unfortunately that is hardly the case.
SGML is more difficult to parse. Then there is the fact that many
cites rely on errors in the HTML being handled just like in
Microsoft Explorer. I can't count the number of times I heard that Opera=

was broken just to find that it was a HTML error on the web cite that
Explorer got around.

-- =

Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

Navigation:

Next in forum: Re: Writing HTML parser wasn't as hard as I thought it'd be
Prev in forum: Re: Text overlapping other text when I resize my page
Thread view: Re: Writing HTML parser wasn't as hard as I thought it'd be

[Reply to this message]

Удаленная работа для программистов • Как заработать на Google AdSense • England, UK • статьи на английском • PHP MySQL CMS Apache Oscommerce • Online Business Knowledge Base • DVD MP3 AVI MP4 players codecs conversion help

Home • Search • Site Map • Set as Homepage • Add to Favourites

Сайт изготовлен в Студии Валентина Петручека —
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация