|
Posted by Rik on 02/03/07 16:49
Michael <mischa.tbaINSERTATHERExs4all.nl> wrote:
> Hello all you (reg|parsing)experts
>
> Here's one for you ;)
>
> I'm creating a kind of markup system, parsing a number of (custom) markup
> tags with the following syntax:
> [tag|arg1|arg2|...]contents[/tag]
> where any tag with no arguments may be written as [tag]contents[/tag] and
> any tag with no contents may be written as [tag|arg1|arg2|...|argn /].
>
> What I have now works fine for constructions like
> [first][second]Hello[/second] world[/first]!
> But consider
> [first]Hello [first]world[/first][/first]!
> Here it will look for the closing tag for the first tag, [first], which
> it
> finds right after "world". It will then process it's contents, "Hello
> [first]world". If [first] happens to be a tag which leaves its contents
> more
> or less intact, it will then find another [first] / [/first] pair and
> parse
> it, but if [first] returns something entirely different (like a database
> value) this will leave a trailing end tag (eg. if the tag maps "Hello
> [first]world" to "Nicey-nice", the result will be "Nicey-nice[/first]").
> Of course, what I want it to do is replace the innermost tag first (with
> contents "world"), substitute the result in the original string and then
> process the outermost tag (if first replaces "world" with "earth", the
> original string would first become "[first]Hello earth[/first]" which
> poses
> no more problem).
>
> The regex I´m currently using is
>
> which basically
> a) finds an opening tag [aaa]
> b) gobbles arguments separated by | until it finds
> b1) the closing bracket ], the contents of the tag -- that's the
> (.*?) part -- and a closing tag [/aaa]
> or
> b2) an implied bracket /]
> Of course the problem is the (.*?) part, which stops as soon as a
> matching
> closing tag is encountered. What I actually need is some way to count the
> number of opening tags of the same name (say aaa) and the number of those
> that are closed, and only match [/aaa] once no more [aaa] tags INSIDE the
> one I'm parsing are open.
> Another way I could think of to solve the problem is to find the first
> opening tag, find the LAST matching closing tag (which could be very far
> away, if I'd use <B>..</B> on each line it would match the entire body of
> the document), then recursively find the first opening tag INSIDE of that
> with a matching closing tag, until there are no more opening tags inside
> -
> effectively parsing the "shortest" tags first (in other words: from the
> inside outwards).
>
> If you're still with me, please help me out on this because currently I'm
> kind of at a loss.
Ask any expert, and they'll say regexes have huge limitations in parsing.
You might want to consider scanning the string and creating a stack of the
tags. It seems to me a better way to be able to process 'broken' tags.
However, for you particular problem, I'd go the opposite way then you
propose. You propose to process the innermost tags first, I'd say process
the outer tags and let it ignore inner tags with exactly the same name.
Maybe something like helps (beware of typing errors, untested):
'%\[ # start of opening tag
([^\]|]+) # type of opening tag captured in \1
(?:|([^\]]+)) # 'attributes'
(
/] # self-closing tag
| # or
] # end of tag, scan futher for end-tag:
(.*?) # arbitrary data
(\[\1(?:|[^\]*])*? # start of nested opening tag of same type
( # ignore untill closing tag of nested set
/\] # again, either self-closing
| # or
.*?\[/\1\] # data + closing tag
).*? # some more arbitrary data
)* # allow for more then one nested tag
.*? # again, some more loose data
\[/\1\] # closing of actual tag
)%sx'
Something similar to this should work, but you can see the problem: this
will only allow for the nesting of 1 level, for every level possible we
would have to alter the regex.
One way I've used in the past is the following:
1. Match (don't replace yet) all opening tags, closing tags and
'self-contained' tags in different array's, with PREG_MATCH_OFFSET.
2. For every opening tags, search the nearest closing tag from the other
array, which doesn't have an opening tag in between.
3. Replace in string (using substr() for instance), remove opening and
closing tags from their respective arrays (keep track of changing offsets,
it's a pain...)
4. Loop untill arrays are empty, or throw an error when tags don't match
(opening tag without closing tag or the other way around).
I know I have thath code somewhere, can't seem to find it at the moment
though :P
--
Rik Wasmus
[Back to original message]
|