Thursday, January 28, 2010

SQL Server - Working with Duplicate Records

The following examples deals with how to retrieve or find duplicate records and how to remove them in various scenarios. Someone has asked me that he had imported a file twice by mistake and which caused duplicates in the data and he wanted to revert back it. In such scenarios, the following code will help you out even if there is no unique identity in the data.

To retrieve the duplicate records, use the following code
SELECT Column2, Column3, COUNT(*) FROM Test1
GROUP BY Column2, Column3
ORDER BY COUNT(*) DESC


Or you can use the following

SELECT Column2, Column3, COUNT(*) RepeatedCount FROM Test1
GROUP BY Column2, Column3
ORDER BY RepeatedCount DESC


How to remove such duplicates?

DELETE FROM Test1
WHERE column1 NOT IN
(SELECT MAX(Column1) FROM Test1
GROUP BY Column2, Column3)


However, the above code will not help you if you have multiple duplicate records. Moreover if there is no duplicate for any record, such records will also be deleted. Refer to the following scenario.


CREATE TABLE Test1(
column2 INT,
column3 VARCHAR(10))

INSERT INTO Test1 VALUES(10, 'Greens')
INSERT INTO Test1 VALUES(20, 'Springs')
INSERT INTO Test1 VALUES(30, 'Meadows')
INSERT INTO Test1 VALUES(10, 'Greens')
INSERT INTO Test1 VALUES(20, 'Springs')

SELECT * FROM Test1

SELECT Column1= IDENTITY(INT, 1,1),
Column2,
Column3
INTO Test2
FROM Test1
GROUP BY Column2, Column3

SELECT * FROM Test2


if you require a identity value to all your data then use the following; Drop the table if it is already exists.


DROP TABLE Test2;

SELECT Column1= IDENTITY(INT, 1,1),
Column2,
Column3
INTO Test2
FROM Test1

SELECT * FROM Test2


If you require a identity value to the unique records then


DELETE FROM Test2 WHERE Column1 IN
(
SELECT MAX(Column1) FROM Test2
WHERE Column2 IN
(
SELECT Column2 FROM Test2
GROUP BY Column2, Column3
HAVING COUNT(*) > 1
)
GROUP BY Column2, Column3
)

SELECT * FROM Test2


In the above scenarios we have an identity column which is mandatory while removing such duplicates from the system. If your table do not have such column then you need to dump all the unique records into another table in which you should maintain one identity column. Refer to the following scenario.


CREATE TABLE Test1(
column1 INT IDENTITY,
column2 INT,
column3 VARCHAR(10)
)

INSERT INTO Test1 VALUES(10, 'Greens')
INSERT INTO Test1 VALUES(20, 'Springs')
INSERT INTO Test1 VALUES(10, 'Greens')
INSERT INTO Test1 VALUES(20, 'Springs')

SELECT * FROM Test1

No comments:

Post a Comment

Big Data & SQL

Hi Everybody, Please do visit my new blog that has much more information about Big Data and SQL. The site covers big data and almost all the...