Re: Special characters in attributes — HTML

You are here: Re: Special characters in attributes « HTML « IT news, forums, messages

Posted by Jukka K. Korpela on 09/19/07 13:24

Scripsit SDG:

> Hi, I'm writing a web scraper to extract text from a web page,

Sounds like reinventing the wheel. Do you intend to reinvent it from
scratch, or are you using some software package for parsing HTML?

> and I
> need to know what characters can be present inside an attribute of a
> tag.

Apparently you are not using some software package for parsing HTML. Do you
really think you are competent enough to consider SGML parsing, XML parsing,
and tagsoup parsing, including their conflicts?

> So far, in the code of my program, I've written that attributes can
> contain this characters: '!=@/ \[]#.:_()-&;?

What an interesting set of characters. I think it's probably the set you
found lying on your keyboard, excluding - for some odd reason - letters and
digits. And you didn't notice e.g. the poor lonesome "+" or the
innocent-looking "$".

> Did I forget something?

Oh, just about 1,000,000 characters. (I'm not kidding. The character set of
HTML is defined as UCS, commonly known as the Unicode character set, though
more formally the ISO 10646 set. Currently only about 100,000 code points
have been allocated, but can you disallow, in HTML parsing, the unassigned
code points? Hardly.)

> I've looked if there's an official
> specification (like a regular expression for HTML or even only for
> attributes), but so far I haven't found anything.

There are several official specifications for HTML. Didn't you know this?
The character repertoire allowed inside an attribute value depends on the
declaration of the attribute, but it can be CDATA, i.e. arbitrary character
data, excluding just the string delimiter (" or ') and, with some variation
between HTML versions, the ampersand character & as such in many or all
contexts. So the question is what can and needs to _excluded_ (or, better,
treated as markup errors).

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/

Navigation:

Next in forum: Re: Including a xml file in the head tag
Prev in forum: Re: Special characters in attributes
Thread view: Re: Special characters in attributes

[Reply to this message]

Удаленная работа для программистов • Как заработать на Google AdSense • England, UK • статьи на английском • PHP MySQL CMS Apache Oscommerce • Online Business Knowledge Base • DVD MP3 AVI MP4 players codecs conversion help

Home • Search • Site Map • Set as Homepage • Add to Favourites

Сайт изготовлен в Студии Валентина Петручека —
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация