Reply to Re: Assigning group numbers for millions of data

Your name:

Reply:


Posted by Erland Sommarskog on 07/19/69 11:43

(jacob.dba@gmail.com) writes:
> I want to assign group number according to this business logic.
> 1. Records with equal SSN and (similar first name or last name) belong
> to the same group.
> John Smith 1234
> Smith John 1234
> S John 1234
> J Smith 1234
> John Smith and Smith John falls in the same group Number as long as
> they have similar SSN.
> This is because I have a record of equal SSN but the first name and
> last name is switched because of people who make error inserting last
> name as first name and vice versa. John Smith and Smith John will have
> equal group Name if they have equal SSN.
> 2. There are records with equal SSN but different first name and last
> name. These belong to different group numbers.
> Equal SSN doesn't guarantee equal group number, at least one of the
> first name or last name should be the same. John Smith and Dan Brown
> with equal SSN=1234 shouldn't fall in the same group number.

What if you have both John Smith and Southerland Jane? Are the
same person or not?

This looks like a very difficult task, and the fact that you have
800 million rows certainly does not help to make it easier.

I think you need to scrap the idea you got from Itzik. My gut feeling
say that it will not scale.

Here is a very simple-minded solution where I've assumed that as
long as any combination of initials match, it's the same group.


CREATE TABLE [TU_People_Data] (
[tu_id] [bigint] NOT NULL ,
[count_id] [int] NOT NULL ,
[fname] [varchar] (32) COLLATE Latin1_General_CI_AS NULL ,
[lname] [varchar] (32) COLLATE Latin1_General_CI_AS NULL ,
[ssn] [int] NULL ,
CONSTRAINT [PK_tu_bulk_people] PRIMARY KEY CLUSTERED
(
[tu_id],
[count_id]
) ON [PRIMARY]
) ON [PRIMARY]
GO
CREATE TABLE #initials (ssn int NOT NULL,
fname varchar(32) NOT NULL,
lname varchar(32) NOT NULL,
initials char(2) NOT NULL)
go
CREATE TABLE #ssnmania (ident int NOT NULL,
ssn int NOT NULL,
initials char(2) NOT NULL,
PRIMARY KEY(ssn, initials))
go
INSERT #initals (ssn, fname, lname, initials)
SELECT DISTINCT ssn, fname, lname,
CASE WHEN fname < lname
THEN substring(fname, 1, 1) + substring(lname, 1, 1)
ELSE substring(lname, 1, 1) + substring(fname, 1, 1)
END
FROM TU_People_Data
go
INSERT #ssnmania (ssn, initials)
SELECT DISTINCT ssn, initials
FROM #initials
go
SELECT i.ssn, i.fname, i.lname, i.initials, groupno = s.ident
FROM #initials i
JOIN #ssnmania s ON i.ssn = s.ssn
AND s.initials = i.initials
go
DROP TABLE #initials, #ssnmania, TU_People_Data




--
Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx

[Back to original message]


Удаленная работа для программистов  •  Как заработать на Google AdSense  •  England, UK  •  статьи на английском  •  PHP MySQL CMS Apache Oscommerce  •  Online Business Knowledge Base  •  DVD MP3 AVI MP4 players codecs conversion help
Home  •  Search  •  Site Map  •  Set as Homepage  •  Add to Favourites

Copyright © 2005-2006 Powered by Custom PHP Programming

Сайт изготовлен в Студии Валентина Петручека
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация