Problem solve Get help with specific problems with your technologies, process and projects.

Finding and deleting duplicate records in a table

I have a table with one field, UserID. This table inadvertently had duplicate UserIDs appended. How can I delete duplicate UserIDs and leave one record for each ID so I don't have to use a "SELECT DISTINCT USERID..." when querying the table?

Let me start by saying that it is much, much easier to set up constraints on a table initially than to try to fix data anomalies after the table is populated. If you only want one instance of a user in the table, then a PRIMARY KEY constraint would definitely be in order. Furthermore, any host application or other procedural code should be able to handle the exception raised when someone or some process tries to insert a duplicate.

After having said that, I realize you still have a problem to solve. I know of no way in standard SQL to solve the problem without copying distinct duplicates out of the table first, removing the duplicates and then reinserting them. I'll illustrate this concept with an example, and then, I'll talk about a way to do it with non-standard SQL. Here is our table and our data:

 create table Users ( UserID varchar(30) ); insert into Users values ('MAX'); insert into Users values ('STEPHANIE'); insert into Users values ('BOB'); insert into Users values ('BOB'); insert into Users values ('FRED'); insert into Users values ('SALLY'); insert into Users values ('FRED'); insert into Users values ('SALLY'); insert into Users values ('BOB'); insert into Users values ('BOB'); insert into Users values ('SALLY'); insert into Users values ('BOB'); insert into Users values ('SALLY');

To find the UserIDs that are duplicated we use the HAVING clause:

 select UserID, count(*) from Users group by UserID having count(*) > 1;

The results are what we would expect:

 USERID COUNT(*) ---------- ---------- BOB 5 FRED 2 SALLY 4

So we would need to remove four BOBs, one FRED, and three SALLYs to eliminate the duplication in the table. The problem stems from the fact that standard SQL does not have support for accessing rows other than by the values of their columns. There's really no way to say, "Delete all but the first one." This is by design, as the relational model which SQL attempts to implement doesn't support duplicate rows. One solution might be to create an interim table to store a distinct copy of the duplicated rows while we clean house. I'll illustrate. First we create a temporary table and populate it with distinct copies of those rows that are duplicated. The GROUP BY USERID clause ensures the rows being inserted are unique.

 create table TemporaryUsers ( UserID varchar(30) ); insert into TemporaryUsers select UserID from Users group by UserID having count(*) > 1;

Then we remove those rows that are duplicated from the original table.

 delete from Users t where UserID in ( select UserID from TemporaryUsers );

Then we repopulate the original table with the values in our temporary table, and drop the temporary table because it is no longer needed.

 insert into Users select UserID from TemporaryUsers; drop table TemporaryUsers;

Here's what the Users table looks like now:

 select * from Users; USERID ---------- MAX STEPHANIE BOB FRED SALLY

Some vendors have extended their implementation of SQL to uniquely identify rows other than using the primary key, usually with some physical addressing. By deleting all but the first physical row, we can accomplish the same results. Oracle has a ROWID pseudo column which can be used to do the job. Here's the Oracle SQL:

 delete from Users t where rowid != ( select min(rowid) from Users where UserID = t.UserID );

Dig Deeper on Using Oracle PL-SQL