How to De-dup a Table That Has Records Duplicated on Two or More fields Efficiently
October 14th, 2007 by Andrew Chen
I was asked about this question at work and since I asked myself this question before I thought there might be a lot of people that would encounter this question. Hopfully I can save you some time if you have this question.
Suppose you have a table called MyTable that has millions of records and has no primiary key or unique index. You know that there are some records duplicated on fieldA and fieldB. How do you get rid of the dupilicates so that all records in the table have uniqe fieldA and filedB combined?
A T-SQL script that answers this question is as follow
1) First we create a index on the two fields
create index ix_fieldAB on MyTable (fieldA, fieldB)
2) After that we find out the values of fieldA and fieldB that have duplicates and put them in a temporary table
select fieldA, fieldB into #temtable1 from Mytable group by fieldA, fieldB having count(*)>1
3) Then we set aside all the records that have duplicates and add a unique field, seq. We do that by joining Mytable to #temtable1. In order to speed up query we create an index on #temtable1
create index ix_fieldAB on #temtable1 (fieldA, fieldB)
select a.*, identity(int, 1, 1) as seq into #temtable2 from MyTable a, #temtable1 b where a.fieldA=b.fieldA and a.fieldB=b.fieldB
4) We then remove all records from MyTable that have duplicates
delete a from Mytable a, #temtable1 b where a.fieldA=b.fieldA and a.fieldB = b.fieldB
5) We keep only one of the duplicates and delete the rest of it. After that we remove the seq field and put all of the remining records back to MyTable
delete from #temtable2 where seq not in (select min(seq) from #temtable2)
alter table #temtable2 drop column seq
GO
insert into MyTable select * from #temtable2
Now the table is de-duped
You would ask what if I want to de-dup records that are duplicated on more than two fields. You can follow the same methodology generally. Another trick is to use the checksum function. Here is an example.
1) Add a checksumvalue field
alter table MyTable add checksumvalue INT
GO
2) Calculate the check sum for the fields that have records duplicate on
update Mytable set checksum_value = checksum(fieldA, fieldB, fieldC, ….)
3) Create index on the checksum field
create index ix_checksum on MyTable (checksum_value)
4) Find out the checksumvalue that have duplicates
select checksum_value into #temtable1 from MyTable group by checksum_value having count(*)>1
5) Set aside the records that are duplicated on checksum_value (hence duplicated on fieldA, fieldB, fieldC, …)
create index ix_checksum on #temtable1 (checksum_value)
select a.*, identity(int, 1, 1) into #temtable2 from MyTable a, #temtable1 b where a.checksum_value = b.checksum_value
6) Remove all records that have duplicates on fieldA, fieldB, fieldC, ….
delete from Mytable a, #temtable1 b where a.checksum_value=b.checksum_value
7) We keep only one of the duplicates and delete the rest of it. After that we remove the seq field and put all of the remining records back to MyTable
delete from #temtable2 where seq not in (select min(seq) from #temtable2)
alter table #temtable2 drop column seq
GO
insert into MyTable select * from #temtable2
At the end we remove the checksum_value field
Alter table Mytable drop column checksum_value
Now the table is de-duped


No comments yet.