You are here: Re: Writing HTML parser wasn't as hard as I thought it'd be « HTML « IT news, forums, messages
Re: Writing HTML parser wasn't as hard as I thought it'd be

Posted by Robert Maas, see http://tinyurl.com/uh3t on 04/22/07 17:48

> From: t...@sevak.isi.edu (Thomas A. Russ)
> it could also be argued that, especially early on, before web
> authoring tools existed, such laxity contributed to the
> widespread adoption of html. By making the renderer not
> particularly picky about the input, it made it easier for authors
> to hand create the html pages without the frustration of having
> things get rejected and not appear at all.

That part is fine, but what you say next isn't quite right...

> That provided a nicer development environment (somewhat
> reminiscent of Lisp environments), where things would work, even
> if not every part of the document were well-formed and correct.

There are two major aspects of Lisp environments, only one of which
is present in an HTML-coding/viewing environment:
- Tolerable of mistakes, one mistake doesn't abort compilation and
cause totally null output except for compiler diagnostics. TRUE
- Interactive R-E-P loop whereby you can instantly see the result
of each line of code as you write it, and after a mistake
immediately modify your attempt until you get it right before
moving on to the next line of code. NO!!
The interactive model for any Web-based service (HTML, CGI, PHP,
etc.) is very different from Lisp's (or Java's BeanShell's) R-E-P.
Web-based services always deal with an entire application, even if
some parts aren't yet written, either missing or stubbed to show
where they *would* be. The entire application is always re-started
each time you try one new line of code, and you must manually
search to the bottom of the output to see where it is, which is
more work for the human visual processing system than just watching
the R-E-P scroll by where your input is immediately followed by the
corresponding output. (And if you insert new code *between*
existing code, then it's even more effort to scroll to where the
new effect should be located to see how it looks when rendered.)

As an example of this difference without changing languages, I
write both R-E-P applications and CGI applications using Common
Lisp. Whenever I am writing a CGI application, it's a lot more
hassle, because of the totally different debugging environment. I'm
constantly fighting it somehow. Depending on the application, or
which part of the application I'm writing, I use one of two
strategies:
- If I'm writing a totally simple application, I copy an old CGI
launchpad (a .cgi file which does nothing except invoke CMUCL
with appropriate Lisp file to load) and change the name of the
file (the name of Lisp file to load), then I create a dummy Lisp
file which does nothing except make the call to my library
routine to generate CGI/MIME header for either TEXT/PLAIN or
TEXT/HTML, and print a banner, and exit. Then I immediately
start the Web browser to make sure I have at least that "hello
world" piece of trivia working at all before I go on. Then I add
one new line of code at a time and immediately re-load the Web
page, to force re-execution of the entire program up to that
point, and scroll if necessary to bring the result of the new
line of code on-screen. I include a lot of FORTRAN-style
debugging printouts to explicitly show me the result of each new
line of code even if that result wouldn't normally be shown
during normal running of the finished application. After a few
more lines of code have been added and FORTRAN-style
debug-printed out, I start to comment out some of the early
debug-print statements that I'm tired of seeing over and over.
This necessity to add a print statement for virtually every new
line of code is a big nuisance compared to the R-E-P loop where
that always happens by default, and commenting out the print
statements late is a nuisance compared to the R-E-P loop where
old printout simply scrolls off screen all by itself.
- Whenever I write a significant D/P (data-processing algorithm),
to avoid the hassle described above, I usually develop the
entire algorithm in the normal R-E-P loop, then port it to CGI
using the above technique at the very end, so only the interface
from CGI and the toplevel calls to various algorithms need be
debugged in the CGI environment with FORTRAN-style print
statements etc. If the algorithm needs the results from a HTML
form, sometimes I first write a dummy application which does
nothing except call the library function to decode the
urlencoded form contents, then print the association list to
screen. Then I run it once, copy the association list from
screen and paste into the R-E-P debug environment. Then after
the algorithm using that data has been completely debugged, I
splice a call to the algorithm back into the CGI application and
finish up debugging there.
The point is that debugging in a Web-refresh-to-restart-whole-program
environment is so painful compared to R-E-P that I avoid it as much
as possible. But with HTML (or PHP), there's no alterative. There
simply is no way, that I know of anyway, to develop new code in a
R-E-P sort of environment.

Now to be fair, in HTML nearly *every* line of code written (not
counting stylesheets, which are recent compared to "early HTML"
discussion here) produces some visual effect which is physically
located in the same relationships to other visual effects as the
physical relationship of the corresponding source (HTML) code. So
we never have to add extra "print" statements and later comment
them out. At most we might sometimes have to add extra visual
characters around white space just to show whether the white space
is really there, since white space at end of line doesn't show
visually. But still, the need to type input in one window and then
switch to another window and deliberately invoke a page-reload
command and then wait for a network transaction (even if working on
local server) before seeing the result, and *not* seeing the source
code and visual effect together on one screen where the eye can
dart back and forth to spot what mistake in source caused the bad
output, is a significant pain during development, whereby your glib
comparison between HTML code development and Lisp R-E-P code
development just isn't true.

Now if somebody could figure out a way to "block" pieces of HTML
code so that it would be possible for a develoopment environment to
alternate showing source code and rendered output within a single
window, and in fact the programmer could type the source directly
onto this intermix-window, either typing a new block of code at the
bottom, or editing an old block of code, that would make it like
the Lisp R-E-P. But then since HTML is primarily a visual-effect
language, and what is really being debugged is the way text looks
nice laid out on a page, the interspersed source would ruin the
visual effect and in some ways make debugging more difficult. So
maybe instead it could use a variation of the idea whereby the main
display screen shows exactly the rendered output, but aside it is
the source screen, with blocks of code mapped to blocks of
presentation via connecting brackets, somewhat like this:

PRESENTATION SOURCE
----+ +----
Hi, this is a paragraph | | <p>Hi,
of rendered text, all +---+ this is a paragraph of rendered text,
nicely aligned. I wonder | | all nicely aligned.
if it will work? | | I wonder if it will work?</p>
----+ +----

But of course, although that might help today's HTML authors if
somebody created such a tool, no such tool existed back in the
early days we're talking about here, so my argument about the pain
of HTML coding compared to Lisp R-E-P stands.

(Also, it may be difficult to work with tables using the idea of
sequential blocks of HTML source, in fact the whole idea may be
useless for such "interesting" (in Chinese sense) coding.)

> The author could then go back and fix the places that didn't
> work.

Which is rather different from Lisp R-E-P development, where you
hardly ever have to go *back* to fix stuff that didn't work, rather
you fix it immediately while it's still the latest thing you wrote.
If you try to write a whole bunch of Lisp code without bothering to
test each part individually, and *then* you try to run the whole
mess, what happens is similar to what happens when programming in
C, the very first thing that bombs out causes nothing after it to
be properly tested at all. This is a significant difference between
HTML (and other formatting languages, where the various parts of
the script are rather independent), vs. any programming langauge
where later processing is heavily dependent on earlier results.

Now for a real bear, try PHP: It works *only* in a Web environment,
so you can't try it in a interactive environment as you could with
Lisp or Perl, but it's a true programming language, where later
processing steps are heavily dependent on early results, so you
can't just throw together a lot of stuff (as with HTML) and debug
all the independent pieces in any sequence you want. You are
essentially forced to use that painful style of development I
described as the first (least preferred) style of CGI programming.

Back to the main topic: One thing, for the early days, which might
have bridged the gap between sloopy first-cut HTML where the
browser guesses what you really meant (and different browsers guess
differently) and good HTML, would be a way of switching "pedantic"
mode off and on. But hardly any C programmers ever use the pedantic
mode, so why should we expect HTML authors to do so either??

The bottom line is that there's a conflict between ease of
first-cut authoring that made HTML so popular in the early days,
and strict following of the specs to make proper HTML source, and I
don't see any easy solution. Maybe the validation services (such as
W3C provides), together with a "hall of shame" for the worse
offenders at HTML that grossly fails validation, would coerce some
decent fraction of authors to eventually fix their original HTML to
become proper HTML?? (Or maybe Google could do validation on all
Web sites it indexes, and demote any site that fails validation, so
it doesn't show up in the first page of search results, and the
more severely a Web page fails validation the further down the
search results it's forced? If Google can fight the government of
the USA regarding invasion of privacy of users, maybe they can try
my idea here too?? Google *is* the 800 pound gorilla of the Web,
and if they applied reward/punishment to good/bad Web authors, I
think it would have a definite effect. Unfortunately, Google is one
of the wost offenders, as I noted the other day. Nevermind...)

Anybody want to join me in building a Hall of Shame for HTML
authors, starting with Google's grossly bad HTML (declared as
transitional XHTML which is totally bogus, ain't even close to
XHTML)?

 

Navigation:

[Reply to this message]


Удаленная работа для программистов  •  Как заработать на Google AdSense  •  England, UK  •  статьи на английском  •  PHP MySQL CMS Apache Oscommerce  •  Online Business Knowledge Base  •  DVD MP3 AVI MP4 players codecs conversion help
Home  •  Search  •  Site Map  •  Set as Homepage  •  Add to Favourites

Copyright © 2005-2006 Powered by Custom PHP Programming

Сайт изготовлен в Студии Валентина Петручека
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация