NULL Value and Aniti Semi Join in SQL - Watch out the Pitfall
March 25th, 2007 by Andrew Chen
I use this Anti Semi Join query at work frequently,
select * from #tableA a where a.ID not in (select ID from #tableB b)
The purpose of the query is to return all the records in temporary table #tableA that their IDs do not appear in temporary table #tableB. This is a typical Anti Semi Join operation.
Contrary to Anti Semi Join the Semi Join query,
select * from #tableA a where a.ID in (select ID from #tableB b)
is to find the records in #tableA that their IDs appear in #tableB. Only the records
Anyway I remember I experienced an interesting problem when I did the similar Anti Semi Join query. I remember I had around 10,000 records in #tableA and only around 1,000 thousand records in #tableB. The ID in #tableA had an unique index on it. When I ran the query it returned zero records. It really didn’t make any sense to me. As you can see there were 10,000 unique IDs in #tableA. They could not possibly all appear in #tableB which had only 1,000 records. The Anti Semi Join query,
select * from #tableA a where a.ID not in (select ID from #tableB b)
should have at least returned 10,000 - 1,000 = 9,000 rows. However it returned zero rows. What on earth was happening? I remember it took me a while to find out that one of the records in #tableB had its value of ID set to NULL and that was the culprit.
NULL is a very special value. It means that the value can not be determined. You can interpret it as not having any value. So when you compare something with a specific value to something that has no value the result is neither true nor false. That is the behavior when ANSI_NULLS is set to ON. Basically, we should not compare to NULL value. Let’s look at why the Anti Semi Join query yields zero records. In order for this query to return a record from #tableA, the computer will compare its ID to every single ID in #tableB. Only when all these comparisons yield false, in other words only when it doesn’t equal to any of the IDs in #tableB, will it be returned in the result set. Since there is a NULL ID in #tableB, not all the comparison yields false because the comparison to NULL doesn’t yield false. Hence the record will not be returned in the result set. That applies to every single record in #tableA. Therefore the final result contains no records.
To make it clear here is an example
/*create #tableA and #tableB and insert some records*/
Create table #tableA (ID int, MemberName varchar(50))
Create table #tableB (ID int)
Insert into #tableA values (1, 'Andrew Chen')
Insert into #tableA values (2, 'Marvin Chen')
Insert into #tableA values (3, 'Chen Andrew')
Insert into #tableB values (1)
/*do the Anti Semi Join query*/
select * from #tableA a where a.ID not in (select ID from #tableB b)
The result set is as follow. 2 and 3 do not exist in #tableB so they are returned
ID MemberName
----------- ----------
2 Haishan Chen
3 Chen Andrew
Now you insert a NULL into #tableB
Insert into #tableB values (NULL)
Run the query again you will find nothing is return.
Null value is not comparable. That is the behavior when ANI_NULLS is set to ON.
If you run SET ANI_NULLS OFF and the run the query again you will find 2 and 3 are returned. When ANI_NULLS is off NULL value can be compared. Most tools connect to SQL server using ODBC and OLEDB and both library call SET ANI_NULLS ON for you when connecting to SQL Server. So the default behavior is that NULL value is not comparable.


No comments yet.