| 
	
 | 
 Posted by groupy on 05/31/06 08:37 
ok, let's take a look at a sample table representing the problem: 
 
A  | B | C | D 
-------------------- 
a1   b1  c1 d1 
a1   b2  c2 d2 
a1   b1  c3 d3 
a4   b4  c4 d3 
a5   b5  c5 d5 
a6   b6  c6 d3 
 
The duplications are: 
rows 1+2+3 on A 
row 1+3 on B 
rows 3+4+6 on D 
the only unique (in all params) row is 5 
note: finding first that row 1 similar to 2 on A & deleting it will 
loose information because we WON'T know if row 1 similar to row 3 on B. 
The same goes for the deletion of row 3 : it will cause lose of data 
regarding it's similarity to row 4 on D 
 
 
The Simple query for retriving all duplicated rows which consumes most 
time is: 
SELECT COUNT(*),A,B,C,D 
 FROM tbl 
GROUP BY A,B,C,D 
HAVING count(*)>1 
It takes about 2 weaks on a 1.5 million rows, while all fields are 
nvchars & the DB is in SQL-Server 
 
THANK YOU ALL
 
  
Navigation:
[Reply to this message] 
 |