|
Posted by Jochem Maas on 01/21/05 00:02
Tim Boring wrote:
> Hello! I'm having an odd regex problem. Here's a summary of what I'm
> trying to accomplish:
>
> I've got a report file generated from our business management system
> (Progress 4GL), one fixed-width record per line. I've got a php script
> that reads in the raw file one line at a time, and "strips" out any
> "unwanted" lines (repeated column headings, mostly).
>
> I'm stripping out unwanted lines by looking at the beginning of each
> line and doing the following:
> 1. If the line begins with a non-word character (\W+), discard it;
> 2. If the line begins with the word "Vendor", discard it;
> 3. If the line begins with "Loc", discard it;
> 4. If the line begins with a dash, discard it;
> 5. Else keep the line and write it to an output file.
>
> The way I've implemented this in code is via the code snippet below.
> The problem I'm encountering, however, is that any line that begins with
> a word, such as "AKRN", is matching rule #1, thus discarding the line.
> This is not what I want, but I'm having difficulty spotting my mistake.
>
> To try to help spot the issue, I put in the if(preg_match("/^\W+/",
> $line)) logic, and the weird thing is that this logic isn't outputting
> the line beginning with things like "AKRN", yet the same line is getting
> caught in the switch statement and being discarded.
>
> Any suggestions?
>
> while (!feof($input_handle))
> {
> $line = fgets($input_handle);
>
\W is every NON-word character.
> if (preg_match("/^\W+/", $line))
> {
manual says: "preg_match() returns the number of times pattern matches.
That will be either 0 times (no match) or 1 time because preg_match()
will stop searching after the first match."
so $line will be a string.
> echo "$line\n";
> }
>
if the string is 0 bytes long the switch will equate to
false and match the first false case expression.
> switch ($line)
> {
is $total_counter less than or equal to 5?
if yes then this case runs. else...
> case ($total_counter <= 5):
> fwrite($output_handle, $line);
> $counter++;
> $total_counter++;
> break;
> // Rule #1: non-word character
if $line is empty the it typecasts to boolean as false.
if the regexp does not find match (which it wouldn't in the case of an
the empty string) then preg_match returns false. so this case always
runs when line is empty.
probably every $line will fire this case.
> case preg_match("/^\W+/", $line):
> array_push($tossed_lines, $line);
> echo "Rule #1 violation\n";
> $tossed_counter++;
> $total_counter++;
> break;
> // Rule #2: "Vendor" at beginning of line
none of the rest will fire if $line is empty.
this case should fire if $line is not empty and starts with 'Vendor'
(case insensitive). non empty string and numeric 1 (return val from
preg_match()) both equate to true.
yadda yadda yadda... I just played a little test on PHP5 regarding this
little problem. its all a misunderstanding regarding automatic
typecasting of strings I think:
$> php -r '
switch ("") {
case 0:echo "hello\n";
}
switch ("yes") {
case 1:echo "hello again\n";
}
switch ("") {
case 1:echo "huh?\n";
}
switch ("yes") {
case 0:echo "huh again?\n";
}
assert("" == 0);
assert("yes" == 1);
assert("" == 1);
assert("yes" == 0);
switch ("1yes") {
case 1: echo "oh?\n";
}
switch ("0yes") {
case 0: echo "geddit?\n";
}
'
hello
huh again?
Warning: assert(): Assertion failed in Command line code on line 17
Warning: assert(): Assertion failed in Command line code on line 18
oh?
geddit?
PHP 5.0.2 (cli) (built: Oct 21 2004 13:52:27)
> case preg_match("/^Vendor/i", $line):
> array_push($tossed_lines, $line);
> echo "Rule #2 violation\n";
> $tossed_counter++;
> $total_counter++;
> break;
> // Rule #3: "Loc" at beginning of line
> case preg_match("/^Loc/i", $line):
> array_push($tossed_lines, $line);
> echo "Rule #3 violation\n";
> $tossed_counter++;
> $total_counter++;
> break;
> // Rule #4: dash character at beginning of line
I think the /^\W+/ above will always catch this case first..
change the order of the case statements?
> case preg_match("/^\-/", $line):
> array_push($tossed_lines, $line);
> echo "Rule #4 violation\n";
> $tossed_counter++;
> $total_counter++;
> break;
> default:
> fwrite($output_handle, $line);
> $counter++;
> $total_counter++;
> break;
> }
> }
>
[Back to original message]
|