|
Posted by John Dunlop on 10/18/06 15:26
Steve:
> I'm using the following function that I found on php.net to strip
> out html and return only the text.
PHP has a built-in function strip_tags():
http://www.php.net/manual/en/function.strip-tags.php
But strip_tags() doesn't really do what it says on the tin. It doesn't
know what is and what isn't a tag, and it commits the cardinal sin of
lumping other markup constructs - e.g., comment declarations - under
the rubric of "tag". Neither does it know about the minutiae of HTML,
such as markup minimisation. Whatever looks like a tag, *is* a tag in
its eyes, and vice versa. Add to the mix tag-soup and non-arbitrary
markup, strip_tags() causes real problems.
> $search = array('@<script[^>]*?>.*?</script>@si', // Strip out javascript
Since '>' can occur in attribute values, the scan-forward-until-'>'
technique is primitive and can generate false positives. Without
parsing the rest of the tag, a '>' could be character data, it could
close the tag. Who knows? But I shouldn't think this would cause a
problem in all probabilities. I mean, major browser vendors got away
with this technique, why can't you?
Inverting a quantifier twice, as the pattern below does, means it
reverts to its default greediness:
> '@<style[^>]*?>.*?</style>@siU', // Strip style tags properly
The U pattern modifier inverts quantifier greediness pattern-wide but
the '?' reverts the second star's greediness, meaning that star is now
greedy again. Not what you want. The fix is to either remove the U
pattern modifier or remove the second '?'.
> '@<![\s\S]*?--[ \t\n\r]*>@', // Strip multi-line comments including CDATA
This is made up.
Comments are defined:
[91] comment declaration =
mdo,
( comment,
( s |
comment )* )?,
mdc
[92] comment =
com,
SGML character*,
com
In English, the declaration opens with '<!' (MDO) and closes with '>'
(MDC). In between is an optional comment followed by zero or more
comments or "whitespaces". Comments themselves are composed of '--'
(COM) followed by zero or more SGML characters followed by COM. As a
regular expression (untested):
/<!(--.*?--([ \r\n\t]|--.*?--)*)?>/s
> "@</?[^>]*>*@"
You say error, I say broken by design.
--
Jock
Navigation:
[Reply to this message]
|