|
Posted by Csaba Gabor on 06/21/06 09:56
I'm comparing the text of (snippets of) web pages which I expect to be
quite different or quite similar. In the case where they are similar,
I would like to display the more recent one and say something like:
Word 2 added [before word 2 in original]: "Jack be nimble"
Words 10-11 changed to: "the quick brown fox"
[from words 9-11 in original]: "the brown fast quick fox"
Words [20-22 in original] before word 20 removed: "sat in a corner on"
One way to do this is to replace all spacing chars with \n, write the
strings to two files, and then run diff (FC on my Win XP Pro = file
compare), collecting the output. Does anyone happen to have PHP code
for this where I don't have to write files? Note that the diff is on
words of the strings and not characters.
In particular, the normal algorithms for this (longest common
subsequence) only produce a number, and don't note the differences.
Also, I wanted to give a threshhold (about 10, say) and if the longest
common subsequence differs from the shorter string (strings are on the
order of 100 words) by at least this amount, then simply fail (since
the difference between the two strings would be deemed to great). This
should make the algorithm far more efficient. The corresponding
argument in FC would be /LB10.
Thanks,
Csaba Gabor from Vienna
Ref: http://www.ics.uci.edu/~dan/class/161/notes/6/Dynamic.html
http://en.wikipedia.org/wiki/Longest-common_subsequence_problem
Note that this problem is distinct from the longest common substring
problem.
Navigation:
[Reply to this message]
|