|
Posted by gosha bine on 08/02/07 17:36
FFMG wrote:
> Hi,
>
> I want to get the <head> code and a 'simple?' solution seems to be
> be...
>
> preg_match_all("/<[html]+[^>]*>\s*(.*\s*)<\/html>\s*/i", $html,
> $matches, PREG_SET_ORDER);
>
> but I want to make sure that there isn't a better solution to the
> problem especially if the head contains invalid code like...
>
> //--
> <head>
> <meta name="description" content="<head></head>" />
> </head>
> //--
>
> unfortunately this is my html code so I cannot ignore invalid <head>
> like the one above.
>
> So...
> How can I change my regex to ignore head tags inside double or single
> quotes?
I'd suggest
$re = <<<HTML
~
<\w+ \b
(?: " [^"]* " | ' [^']* ' | [^"'>]+ )*
>
| </ \w+ >
| [^<]+
| <
~six
HTML;
This should be able to parse most html or html-alike streams, even
hopeless mailformed.
This is how it works with your example:
///
$html = 'text <head>
<meta name="description" content="<head></head>" />
</head> more text';
preg_match_all($re, $html, $m);
print_r($m[0]);
///
output:
Array
(
[0] => text
[1] => <head>
[2] =>
[3] => <meta name="description" content="<head></head>" />
[4] =>
[5] => </head>
[6] => more text
)
--
gosha bine
extended php parser ~ http://code.google.com/p/pihipi
blok ~ http://www.tagarga.com/blok
Navigation:
[Reply to this message]
|