Reply to Re: Regex to get the <html></html> — PHP Programming Language

Posted by gosha bine on 08/02/07 17:36

FFMG wrote:
> Hi,
>
> I want to get the <head> code and a 'simple?' solution seems to be
> be...
>
> preg_match_all("/<[html]+[^>]*>\s*(.*\s*)<\/html>\s*/i", $html,
> $matches, PREG_SET_ORDER);
>
> but I want to make sure that there isn't a better solution to the
> problem especially if the head contains invalid code like...
>
> //--
> <head>
> <meta name="description" content="<head></head>" />
> </head>
> //--
>
> unfortunately this is my html code so I cannot ignore invalid <head>
> like the one above.
>
> So...
> How can I change my regex to ignore head tags inside double or single
> quotes?

I'd suggest

$re = <<<HTML
~
<\w+ \b
(?: " [^"]* " | ' [^']* ' | [^"'>]+ )*
>
| </ \w+ >
| [^<]+
| <
~six
HTML;

This should be able to parse most html or html-alike streams, even
hopeless mailformed.

This is how it works with your example:

///

$html = 'text <head>
<meta name="description" content="<head></head>" />
</head> more text';

preg_match_all($re, $html, $m);
print_r($m[0]);

///

output:

Array
(
[0] => text
[1] => <head>
[2] =>

[3] => <meta name="description" content="<head></head>" />
[4] =>

[5] => </head>
[6] => more text
)

--
gosha bine

extended php parser ~ http://code.google.com/p/pihipi
blok ~ http://www.tagarga.com/blok

[Back to original message]

Удаленная работа для программистов • Как заработать на Google AdSense • England, UK • статьи на английском • PHP MySQL CMS Apache Oscommerce • Online Business Knowledge Base • DVD MP3 AVI MP4 players codecs conversion help

Home • Search • Site Map • Set as Homepage • Add to Favourites

Сайт изготовлен в Студии Валентина Петручека —
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация