|
Posted by Rik on 06/26/06 12:41
McHenry wrote:
> <h2>Field1</h2>
>
> <h3>
>
> $123,456.78 - $987,654.32
>
> </h3>
>
> I would like to capture Field1 and the first numeric value only.
> I have created the following that works somewhat:
> $pattern='%<h2>(?P<field1>.*?)</h2>
> .*?
>
> <h3>.*?\$(?P<field2>.*?)\s.*?</h3> %six'; However I would like to
> improve field2's capture to be the first series of numbers after <h3>
> excluding the thousand seperator and stop the capture as soon as a
> non numeric is encountered other than the decimal point, I cannot
> depend on the dollar sign always being present, so in this case I'd
> capture 123456.78
>
> Thanks in advance...
simple one, capture at least 1 number, fo9llowed by numbers, decimal- or
thousand-seperator:
<h3>.*?(?P<field2>[0-9]+[0-9\.,]*).*?</h3>
advanced, will validate currency format:
<h3>.*?(?P<field2>(?:[1-9][0-9]{0,2}(?:,[0-9]{3})*|0)(?:\.[0-9]{2})?).*?</h3
>
allow for unexpected html tags/attributes, where we don't want to match the
'10' in a '<span margin="10px">' for instance:
<h3[^>]*>(?:[^<]*?(?:<[^>]*>)?)*?(?P<field2>(?:[1-9][0-9]{0,2}(?:,[0-9]{3})*
|0)(?:\.[0-9]{2})?).*?</h3>
Offcourse, if you're naming your captures 'field1' & 'field2', you might as
well not name them at all.
Grtz,
--
Rik Wasmus
Navigation:
[Reply to this message]
|