SQLBank: Delete Duplicate Records

Thursday, January 28, 2010

SQL Server - Working with Duplicate Records

The following examples deals with how to retrieve or find duplicate records and how to remove them in various scenarios. Someone has asked me that he had imported a file twice by mistake and which caused duplicates in the data and he wanted to revert back it. In such scenarios, the following code will help you out even if there is no unique identity in the data.

To retrieve the duplicate records, use the following code
SELECT Column2, Column3, COUNT(*) FROM Test1
GROUP BY Column2, Column3
ORDER BY COUNT(*) DESC

Or you can use the following
SELECT Column2, Column3, COUNT(*) RepeatedCount FROM Test1
GROUP BY Column2, Column3
ORDER BY RepeatedCount DESC

How to remove such duplicates?
DELETE FROM Test1
WHERE column1 NOT IN
(SELECT MAX(Column1) FROM Test1
GROUP BY Column2, Column3)

However, the above code will not help you if you have multiple duplicate records. Moreover if there is no duplicate for any record, such records will also be deleted. Refer to the following scenario.

CREATE TABLE Test1(
column2 INT,
column3 VARCHAR(10))

INSERT INTO Test1 VALUES(10, 'Greens')
INSERT INTO Test1 VALUES(20, 'Springs')
INSERT INTO Test1 VALUES(30, 'Meadows')
INSERT INTO Test1 VALUES(10, 'Greens')
INSERT INTO Test1 VALUES(20, 'Springs')

SELECT * FROM Test1

SELECT Column1= IDENTITY(INT, 1,1),
Column2,
Column3
INTO Test2
FROM Test1
GROUP BY Column2, Column3

SELECT * FROM Test2

if you require a identity value to all your data then use the following; Drop the table if it is already exists.

DROP TABLE Test2;

SELECT Column1= IDENTITY(INT, 1,1),
Column2,
Column3
INTO Test2
FROM Test1

SELECT * FROM Test2

If you require a identity value to the unique records then

DELETE FROM Test2 WHERE Column1 IN
(
SELECT MAX(Column1) FROM Test2
WHERE Column2 IN
(
SELECT Column2 FROM Test2
GROUP BY Column2, Column3
HAVING COUNT(*) > 1
)
GROUP BY Column2, Column3
)

SELECT * FROM Test2

In the above scenarios we have an identity column which is mandatory while removing such duplicates from the system. If your table do not have such column then you need to dump all the unique records into another table in which you should maintain one identity column. Refer to the following scenario.

CREATE TABLE Test1(
column1 INT IDENTITY,
column2 INT,
column3 VARCHAR(10)
)

INSERT INTO Test1 VALUES(10, 'Greens')
INSERT INTO Test1 VALUES(20, 'Springs')
INSERT INTO Test1 VALUES(10, 'Greens')
INSERT INTO Test1 VALUES(20, 'Springs')

SELECT * FROM Test1

Thursday, January 28, 2010

SQL Server - Working with Duplicate Records

Big Data & SQL