|  | Posted by Csaba Gabor on 06/21/06 09:56 
I'm comparing the text of (snippets of) web pages which I expect to bequite different or quite similar.  In the case where they are similar,
 I would like to display the more recent one and say something like:
 Word 2 added [before word 2 in original]: "Jack be nimble"
 
 Words 10-11 changed to: "the quick brown fox"
 [from words 9-11 in original]: "the brown fast quick fox"
 
 Words [20-22 in original] before word 20 removed: "sat in a corner on"
 
 
 One way to do this is to replace all spacing chars with \n, write the
 strings to two files, and then run diff (FC on my Win XP Pro = file
 compare), collecting the output.  Does anyone happen to have PHP code
 for this where I don't have to write files?  Note that the diff is on
 words of the strings and not characters.
 
 In particular, the normal algorithms for this (longest common
 subsequence) only produce a number, and don't note the differences.
 
 Also, I wanted to give a threshhold (about 10, say) and if the longest
 common subsequence differs from the shorter string (strings are on the
 order of 100 words) by at least this amount, then simply fail (since
 the difference between the two strings would be deemed to great).  This
 should make the algorithm far more efficient.  The corresponding
 argument in FC would be /LB10.
 
 Thanks,
 Csaba Gabor from Vienna
 
 Ref: http://www.ics.uci.edu/~dan/class/161/notes/6/Dynamic.html
 http://en.wikipedia.org/wiki/Longest-common_subsequence_problem
 Note that this problem is distinct from the longest common substring
 problem.
  Navigation: [Reply to this message] |