Subscribe to
Posts
Comments

I was asked about this question at work and since I asked myself this question before I thought there might be a lot of people that would encounter this question. Hopfully I can save you some time if you have this question.

Suppose you have a table called MyTable that has millions of records and has no primiary key or unique index. You know that there are some records duplicated on fieldA and fieldB. How do you get rid of the dupilicates so that all records in the table have uniqe fieldA and filedB combined?

A T-SQL script that answers this question is as follow

1) First we create a index on the two fields

create index ix_fieldAB on MyTable (fieldA, fieldB)

2) After that we find out the values of fieldA and fieldB that have duplicates and put them in a temporary table

select fieldA, fieldB into #temtable1 from Mytable group by fieldA, fieldB having count(*)>1

3) Then we set aside all the records that have duplicates and add a unique field, seq. We do that by joining Mytable to #temtable1. In order to speed up query we create an index on #temtable1

create index ix_fieldAB on #temtable1 (fieldA, fieldB)
select a.*, identity(int, 1, 1) as seq into #temtable2 from MyTable a, #temtable1 b where a.fieldA=b.fieldA and a.fieldB=b.fieldB

4) We then remove all records from MyTable that have duplicates

delete a from Mytable a, #temtable1 b where a.fieldA=b.fieldA and a.fieldB = b.fieldB

5) We keep only one of the duplicates and delete the rest of it. After that we remove the seq field and put all of the remining records back to MyTable

delete from #temtable2 where seq not in (select min(seq) from #temtable2)
alter table #temtable2 drop column seq
GO

insert into MyTable select * from #temtable2

Now the table is de-duped

You would ask what if I want to de-dup records that are duplicated on more than two fields. You can follow the same methodology generally. Another trick is to use the checksum function. Here is an example.

1) Add a checksumvalue field

alter table MyTable add checksumvalue INT
GO

2) Calculate the check sum for the fields that have records duplicate on

update Mytable set checksum_value = checksum(fieldA, fieldB, fieldC, ….)

3) Create index on the checksum field

create index ix_checksum on MyTable (checksum_value)

4) Find out the checksumvalue that have duplicates

select checksum_value into #temtable1 from MyTable group by checksum_value having count(*)>1

5) Set aside the records that are duplicated on checksum_value (hence duplicated on fieldA, fieldB, fieldC, …)

create index ix_checksum on #temtable1 (checksum_value)
select a.*, identity(int, 1, 1) into #temtable2 from MyTable a, #temtable1 b where a.checksum_value = b.checksum_value

6) Remove all records that have duplicates on fieldA, fieldB, fieldC, ….

delete from Mytable a, #temtable1 b where a.checksum_value=b.checksum_value

7) We keep only one of the duplicates and delete the rest of it. After that we remove the seq field and put all of the remining records back to MyTable

delete from #temtable2 where seq not in (select min(seq) from #temtable2)
alter table #temtable2 drop column seq
GO

insert into MyTable select * from #temtable2

At the end we remove the checksum_value field

Alter table Mytable drop column checksum_value

Now the table is de-duped


Related Posts:

  • A Fast Way to Update Large Amount of Data in SQL Server With Select Into Statement
  • NULL Value and Aniti Semi Join in SQL - Watch out the Pitfall
  • MySQL Update Statement
  • Join to Parametrized Table Function - Cross Apply and Outer Apply in SQL Server 2005
  • A Real Life Job Interview Question for SQL Database Administrator


  • 1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
    Loading ... Loading ...

    RSS feed | Trackback URI

    Comments »

    No comments yet.

    Name (required)
    E-mail (required - never shown publicly)
    URI
    Subscribe to comments via email
    Your Comment (smaller size | larger size)
    You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> in your comment.