I have 2 tables (1/2 mil. rows each) that should be identical, but one has more records, for whatever reason. What is the best, most efficient way to determine which records are different? (There are no duplicates in either table.) I have heard subselects, joins, etc. from others.
"Best" and "most efficient" are not necessarily congruent. Sometimes a good solution (easy to write, easy to understand, easy to maintain) performs horribly. Sometimes the most efficient solution requires query gyrations that I would not classify as a good solution. In your case, where the tables are of reasonable size, indexes will be important no matter what you do.
You are right that you can achieve what you want several ways -- subselects, joins, and special operators.
Let's use table1 and table2 as our example tables, and let's assume we want to check for different records in both of them.
The subselect method goes like this --
select table1.columns from table1 where not exists (select 1 from table2 where table2.id = table1.id)
This gives you all the rows in table1 that don't have matching rows in table 2. Note that in the subselect after the word SELECT it is necessary to select something, so conveniently choose the integer 1 instead of a table column -- it could be anything, really (including the asterisk, but that's a different subject for another day). Since a NOT EXISTS will always evaluate only true or false, the subselect doesn't need to return anything other than an indication that a row was or was not found. (If this sounds familiar, it's my standard spiel about the EXISTS subselect, which I last used in this answer.)
We also want to check for rows in table2 that don't have matching rows in table 1, and this second query is like the previous one, but with the tables reversed --
select table2.columns from table2 where not exists (select 1 from table1 where table1.id = table2.id)
The second method involves using left joins instead of subselects --
select table1.columns from table1 left join table2 on table1.id = table2.id where table2.id is null
This may sound a little weird, joining on a column and checking it for nulls, but that is exactly what to do to find those rows of table1 which do not have a matching row from table2. In a left join, the database places nulls into all the columns from table2 when there is no matching row from table2.
And to find all the rows from table2 that are different, that don't have a match in table1, we could use "table1 right join table2" instead of a left join, but that just confuses things unduly and I prefer to write left joins in all cases --
select table2.columns from table2 left join table1 on table2.id = table1.id where table1.id is null
The third method is the "best" solution in my opinion, because it uses SQL language operators intended for just this situation. However, not all databases implement these operators.
To find all the rows of table1 that do not exist in table2, use this query --
select table1.columns from table1 except select table2.columns from table2
The EXCEPT operator is called MINUS in Oracle.
We also want the rows of table2 that aren't in table1, and the query for that is, yup, you guessed it --
select table2.columns from table2 except select table1.columns from table1
As for efficiency, the database will determine its own access strategy -- for instance, subselects are usually implemented as though they were joins anyway. I haven't seen how the EXCEPT operator is implemented, but it's fair to assume that it will be just as efficient as the other methods. Don't forget your indexes on the primary keys!
This was first published in May 2001