|
Posted by Jason Barnett on 11/26/57 11:16
Pablo Gosse wrote:
> Howdy folks. I'm running into something strange with array_diff that
> I'm hoping someone can shed some light on.
>
> I have two tab-delimited text files, and need to find the lines in the
> first that are not in the second, and vice-versa.
>
> There are 794 records in the first, and 724 in the second.
>
> Simple enough, I thought. The following code should work:
>
> $tmpOriginalGradList = file('/path/to/graduate_list_original.txt');
> $tmpNewGradList = file('/path/to/graduate_list_new.txt');
>
> $diff1 = array_diff($tmpOriginalGradList, $tmpNewGradList);
> $diff2 = array_diff($tmpNewGradList, $tmpOriginalGradList);
>
> I expected that this would set $diff1 to have all elements of
> $tmpOriginalGradList that did not exist in $tmpNewGradList, but it
> actually contains many elements that exist in both.
>
> The same is true for $diff2, in that many of its elements exist in both
> $tmpOriginalGradList and $tmpNewGradList as well.
>
> Since this returns $diff1 as having 253 elements and $diff2 as having
> 183, it sort of makes sense, since the difference between those two
> numbers is 70, which is the difference between the number of lines in
> the two files. But the bottom line is that both $diff1 and $diff2
> contain elements common to both files, which using array_diff simply
> should not be the case.
>
Hard to say what happened here. If I had to take a guess I might say
that you're getting line wrapping in the middle of 183 different records.
> However, when I loop through each file and strip out all the tabs:
>
And really since you have tab-delimited records you should be exploding
on those tabs in order to get the data set. But because I'm slightly
paranoid I would do it on the entire string of the file.
<?php
$str_OriginalGradList =
file_get_contents('/path/to/graduate_list_original.txt');
$ary_OriginalGradList = explode(chr(9), $str_OriginalGradList);
$str_NewGradList = file_get_contents('/path/to/graduate_list_new.txt');
$ary_NewGradList = explode(chr(9), $str_OriginalGradList);
$diff1 = array_diff($ary_OriginalGradList, $ary_NewGradList);
$diff2 = array_diff($ary_NewGradList, $ary_OriginalGradList);
echo '<pre>';
var_dump($diff1);
var_dump($diff2);
echo '</pre>';
?>
> foreach ($tmpOriginalGradList as $k=>$l) {
> $tmp = str_replace(chr(9), '', $l);
> $tmpOriginalGradList[$k] = $tmp;
> }
>
> foreach ($tmpNewGradList as $k=>$l) {
> $tmp = str_replace(chr(9), '', $l);
> $tmpNewGradList[$k] = $tmp;
> }
>
> I get $diff1 as having 75 elements and $diff2 as having 5, which also
> sort of makes sense since there numerically there are 70 lines
> difference between the two files.
>
> I also manually replaced the tabs and checked about 20 of the elements
> in $diff1 and none were found in the new text file, and none of the 5
> elements in $diff2 were found in the original text file.
>
75 / 5 is probably the right mix. Programmatically you can check this
by comparing the diffs with each list.
> However, if in the code above I replace the tabs with a space instead of
> just stripping them out, then the numbers are again 253 and 183.
>
> I'm inclined to think the second set of results is accurate, since I was
> unable to find any of the 20 elements I tested in $diff1 in the new text
> file, and none of the elements in $diff2 are in the original text file.
>
> Does anyone have any idea why this is happening? The tab-delimited
> files were generated from Excel spreadsheets using the same script, so
> there wouldn't be any difference in the formatting of the files.
>
The sad truth is that this is quite possibly the root cause of your
problem. I have had many many problems caused by MS Excel conversion
to/from other types of data. I don't completely understand the escaping
process in Excel, but double quotes have always been a problem. And
occasionally it seems like Excel just barfs on a tab / comma. Why it
does that is completely beyond me. I can't count the number of times
that I have opened up a comma delimited file in Excel, just *looked* at
the file, saved it, and when I view the source it's been mangled a bit.
Moral of the story: I don't ever use Excel to view tab or comma
delimited types of data unless I have a backup someplace.
Navigation:
[Reply to this message]
|