Reply to Re: Regex to get the <html></html>

Your name:

Reply:


Posted by gosha bine on 08/02/07 17:36

FFMG wrote:
> Hi,
>
> I want to get the <head> code and a 'simple?' solution seems to be
> be...
>
> preg_match_all("/<[html]+[^>]*>\s*(.*\s*)<\/html>\s*/i", $html,
> $matches, PREG_SET_ORDER);
>
> but I want to make sure that there isn't a better solution to the
> problem especially if the head contains invalid code like...
>
> //--
> <head>
> <meta name="description" content="<head></head>" />
> </head>
> //--
>
> unfortunately this is my html code so I cannot ignore invalid <head>
> like the one above.
>
> So...
> How can I change my regex to ignore head tags inside double or single
> quotes?

I'd suggest

$re = <<<HTML
~
<\w+ \b
(?: " [^"]* " | ' [^']* ' | [^"'>]+ )*
>
| </ \w+ >
| [^<]+
| <
~six
HTML;

This should be able to parse most html or html-alike streams, even
hopeless mailformed.

This is how it works with your example:

///

$html = 'text <head>
<meta name="description" content="<head></head>" />
</head> more text';

preg_match_all($re, $html, $m);
print_r($m[0]);

///

output:

Array
(
[0] => text
[1] => <head>
[2] =>

[3] => <meta name="description" content="<head></head>" />
[4] =>

[5] => </head>
[6] => more text
)


--
gosha bine

extended php parser ~ http://code.google.com/p/pihipi
blok ~ http://www.tagarga.com/blok

[Back to original message]


Удаленная работа для программистов  •  Как заработать на Google AdSense  •  England, UK  •  статьи на английском  •  PHP MySQL CMS Apache Oscommerce  •  Online Business Knowledge Base  •  DVD MP3 AVI MP4 players codecs conversion help
Home  •  Search  •  Site Map  •  Set as Homepage  •  Add to Favourites

Copyright © 2005-2006 Powered by Custom PHP Programming

Сайт изготовлен в Студии Валентина Петручека
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация