Reply to Re: How to strip styles embedded within html tags?

Your name:

Reply:


Posted by John Dunlop on 10/18/06 15:26

Steve:

> I'm using the following function that I found on php.net to strip
> out html and return only the text.

PHP has a built-in function strip_tags():

http://www.php.net/manual/en/function.strip-tags.php

But strip_tags() doesn't really do what it says on the tin. It doesn't
know what is and what isn't a tag, and it commits the cardinal sin of
lumping other markup constructs - e.g., comment declarations - under
the rubric of "tag". Neither does it know about the minutiae of HTML,
such as markup minimisation. Whatever looks like a tag, *is* a tag in
its eyes, and vice versa. Add to the mix tag-soup and non-arbitrary
markup, strip_tags() causes real problems.

> $search = array('@<script[^>]*?>.*?</script>@si', // Strip out javascript

Since '>' can occur in attribute values, the scan-forward-until-'>'
technique is primitive and can generate false positives. Without
parsing the rest of the tag, a '>' could be character data, it could
close the tag. Who knows? But I shouldn't think this would cause a
problem in all probabilities. I mean, major browser vendors got away
with this technique, why can't you?

Inverting a quantifier twice, as the pattern below does, means it
reverts to its default greediness:

> '@<style[^>]*?>.*?</style>@siU', // Strip style tags properly

The U pattern modifier inverts quantifier greediness pattern-wide but
the '?' reverts the second star's greediness, meaning that star is now
greedy again. Not what you want. The fix is to either remove the U
pattern modifier or remove the second '?'.

> '@<![\s\S]*?--[ \t\n\r]*>@', // Strip multi-line comments including CDATA

This is made up.

Comments are defined:

[91] comment declaration =
mdo,
( comment,
( s |
comment )* )?,
mdc

[92] comment =
com,
SGML character*,
com

In English, the declaration opens with '<!' (MDO) and closes with '>'
(MDC). In between is an optional comment followed by zero or more
comments or "whitespaces". Comments themselves are composed of '--'
(COM) followed by zero or more SGML characters followed by COM. As a
regular expression (untested):

/<!(--.*?--([ \r\n\t]|--.*?--)*)?>/s

> "@</?[^>]*>*@"

You say error, I say broken by design.

--
Jock

[Back to original message]


Удаленная работа для программистов  •  Как заработать на Google AdSense  •  England, UK  •  статьи на английском  •  PHP MySQL CMS Apache Oscommerce  •  Online Business Knowledge Base  •  DVD MP3 AVI MP4 players codecs conversion help
Home  •  Search  •  Site Map  •  Set as Homepage  •  Add to Favourites

Copyright © 2005-2006 Powered by Custom PHP Programming

Сайт изготовлен в Студии Валентина Петручека
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация