You are here: Re: VERY chalanging question - Explanation « MsSQL Server « IT news, forums, messages
Re: VERY chalanging question - Explanation

Posted by Erland Sommarskog on 05/31/06 11:40

groupy (liav.ezer@gmail.com) writes:
> A | B | C | D
> --------------------
> a1 b1 c1 d1
> a1 b2 c2 d2
> a1 b1 c3 d3
> a4 b4 c4 d3
> a5 b5 c5 d5
> a6 b6 c6 d3
>
> The duplications are:
> rows 1+2+3 on A
> row 1+3 on B
> rows 3+4+6 on D
> the only unique (in all params) row is 5
> note: finding first that row 1 similar to 2 on A & deleting it will
> loose information because we WON'T know if row 1 similar to row 3 on B.
> The same goes for the deletion of row 3 : it will cause lose of data
> regarding it's similarity to row 4 on D
>
>
> The Simple query for retriving all duplicated rows which consumes most
> time is:
> SELECT COUNT(*),A,B,C,D
> FROM tbl
> GROUP BY A,B,C,D
> HAVING count(*)>1
> It takes about 2 weaks on a 1.5 million rows, while all fields are
> nvchars & the DB is in SQL-Server

I sincerely doubt that this statement takes two weeks to run for 1.5
million rows. Had you said 1.5 milliard rows, I could maybe have
believed it.

Anyway, first index each column individually. Then try:

DELETE tbl
FROM tbl a
JOIN tbl b ON a.A = b.A
WHERE a.B > b.B OR
a.C > b.C OR
a.D > b.D

DELETE tbl
FROM tbl a
JOIN tbl b ON a.B = b.B
WHERE a.C > b.C OR
a.D > b.D

DELETE tbl
FROM tbl a
JOIN tbl b ON a.C = b.C
WHERE a.D > b.C

After this operation, you still have the rows that have the same values
in four columns. But it is not clear from your description whether you
have such duplicates. If you have this maybe the best:

ATLER TABLE tbl ADD ident int IDENTITY

DELETE tbl
FROM tbl a
JOIN tbl b ON a.A = b.A
WHERE a.ident > b.ident

DELETE tbl
FROM tbl a
JOIN tbl b ON a.B = b.B
WHERE a.ident > b.ident

DELETE tbl
FROM tbl a
JOIN tbl b ON a.C = b.C
WHERE a.ident > b.ident

ALTER TABLE tbl DROP COLUMN ident

Note: all the above is untested. For tested solutions (at least with
regards to correctness), please post:

o CREATE TABLE statement for the table.
o INSERT statements with sample data.
o The desired result given the sample.

--
Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx

 

Navigation:

[Reply to this message]


Удаленная работа для программистов  •  Как заработать на Google AdSense  •  England, UK  •  статьи на английском  •  PHP MySQL CMS Apache Oscommerce  •  Online Business Knowledge Base  •  DVD MP3 AVI MP4 players codecs conversion help
Home  •  Search  •  Site Map  •  Set as Homepage  •  Add to Favourites

Copyright © 2005-2006 Powered by Custom PHP Programming

Сайт изготовлен в Студии Валентина Петручека
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация