The ‘= NULL’ Mistake and other SQL NULL Heresies
The SQL Prompt Best Practice rule checks whether a comparison or expression includes a NULL literal ('NULL'), which in SQL Server, rather than result in an error, will simply always produce a NULL result. Phil Factor explains how to avoid this, and other SQL NULL-related calamities.
SQL Prompt has a code analysis rule (BP011) that checks whether a comparison or expression includes a NULL
literal ('NULL'
). These will always produce a NULL
result. To determine whether a datatype is, or isn’t, NULL
, use IS
NULL
or IS
NOT
NULL
.
Beyond that, working with a nullable datatype in an expression requires use of the COALESCE()
, ISNULL()
or NULLIF()
functions, as appropriate, to deal with NULL
values safely.
What is NULL?
SQL allows any datatype to have a NULL
value. This isn’t the same as a blank string or a zero integer. It boils down to the meaning ‘Unknown’. If you ask SQL theoreticians, they’ll say that it represents missing information and inapplicable information.
Having a NULL
string, for example, isn’t the same as knowing that you have a blank string. It means you don’t know what is in the string. If you are doing a database of your ancestors, and you put the value of NULL
in for your great grandmother, it means that you don’t yet know her name, but certainly doesn’t imply that you haven’t got a great grandmother. NULL
is a state, not a value. It is useful to have a state for data, and the father of the relational model, E.F.Codd, wanted two of them: ‘A-Values’ and ‘I-Values’, representing ‘Missing But Applicable’ and ‘Missing But Inapplicable’. Sadly, there is but one.
NULLs in a comparison
If you’re not used to the concept of ‘null’, and you treat it like a number such as pi, then strange perplexing things will happen with the result of SQL Expressions. Many developers get into trouble when they want a WHERE
clause that filters on a column with a NULL
value in it.
Imagine you have a database of Irish saints, each with their saints’ days, contained in a date column. If a saint doesn’t have a saint’s day, then this column is NULL
, for that row.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
CREATE TABLE #IrishSaintsDay (name NVARCHAR(50), saintsDay DATETIME2(7)); INSERT INTO #IrishSaintsDay VALUES (N'Abbán moccu Corbmaic', N'2019-03-16T00:00:00'), (N'Abel of Reims', N'2019-08-05T00:00:00'), (N'Abran', NULL), (N'Adalgis', NULL), (N'Adomnán', NULL), (N'Adomnán of Coldingham', NULL), (N'Áed mac Bricc', N'2019-11-10T00:00:00'), (N'Áed of Clonmore', N'2019-04-11T00:00:00'), (N'Aidan of Lindisfarne', NULL), (N'Ailbe of Emly', N'2019-09-12T00:00:00'), (N'Ailerán', NULL), (N'Alto of Altomünster', NULL), (N'Andrew the Scot', NULL), (N'Assicus (Tassach)', N'2019-04-14T00:00:00'), (N'Athracht', N'2019-08-11T00:00:00'), (N'Autbod', NULL), (N'Baíthéne', NULL) |
The full table build with 190 saints (there are plenty more) is included with this article (CreateIrishSaintsDays) so you can play along.
You want to find all the saints that don’t have a saint’s day:
1 |
SELECT Name FROM #IrishSaintsDay AS ISD WHERE ISD.saintsDay = NULL |
This seems to go well. No rows are returned, which surely means there are no Irish saints without a saint’s day, which is a relief. Wrong! If you are using SQL Prompt, you will see a squiggly green line under that ISD.saintsDay = NULL
expression.
On investigation you will see this…
The expression ISD.saintsDay = NULL
means “where the saints day is equal to something unknown“. This sort of expression must return NULL
, regardless of whether you compare it to a value or to another NULL
. Putting a NULL
in the value for the saints day could mean we don’t know the saint’s day, or it could mean that the saint never had his or her day, so how can one ever say that something unknown is the same as something else unknown?
To be accurate, the SQL Standard says that this use of WHERE
<expression> = NULL
is invalid syntax and shall lead to an error message or an exception. Unfortunately, in SQL Server it doesn’t: it just returns ‘incorrect’ results.
What you meant to ask is whether the value in the datatype is unknown, not whether it is equal to something that is unknown. In other words, the SQL standard insists that you cannot use =
NULL
, or <>
NULL
, and instead should use IS
NULL
or IS
NOT
NULL
.
We change it…
NULLs in expressions
It isn’t just a comparison with NULL
that will always return NULL
. When you do arithmetic on numeric datatypes, and one of the operands or arguments is NULL
, it is likely that the result will be NULL
. Imagine we were to try running this query:
1 |
SELECT 22.0/(7.0*NULL) |
Yes, it returns NULL
and you get the warning from SQL Prompt with the squiggly green line for the BP011 warning. However, it is an odd mistake to make.
What is more likely is that a developer creates a variable and forgets to initialize it.
1 2 3 4 |
DECLARE @pi FLOAT DECLARE @MutiplicationFactor INT SELECT @Pi=22.0/(7.0*@MutiplicationFactor) SELECT @pi |
The result will be null. Obviously, if you have seven apples and you multiply that by unknown quantity of apples, then the result is an unknown quantity.
We hurriedly initialize the variable with a value.
1 2 3 4 5 |
DECLARE @pi FLOAT DECLARE @MutiplicationFactor INT SELECT @MutiplicationFactor =1 SELECT @Pi=22.0/(7.0*@MutiplicationFactor) SELECT @pi |
This returns 3.14285714285714 (which isn’t quite pi, I know, but doing the Gregory-Leibniz Nilakantha approximation might be a distraction).
So, why didn’t SQL Prompt give a squiggly line when we used a null variable in an expression? It will only protest if you use a null literal (NULL
) in either an expression or a condition. It is doing a static test so cannot detect if you have a NULL
value in a variable. Even if it could do so, it wouldn’t be a good thing to flag, because then we’d get warning on code that uses nullable variables in the useful and valuable ways that you intended. NULL
s can save a lot of tedious work in building strings from components, some of which are optional (concatenation); it also makes building lists easier and in creating reports where the data isn’t continuous.
NULLs in string concatenations
How could we use NULL
to our advantage? Well, to show you an obvious example, we’ll use the person.person
table in AdventureWorks
that has the various parts of what we understand to be a person’s name.
Some of these constituents must be there (NOT
NULL
) , such as FirstName
and LastName
, but others can be NULL
, meaning you just don’t know. You may not know a person’s preferred title of address, middle names or suffix. If you know that someone doesn’t have a middle name, for example, you can legitimately put a blank string. Alternatively, for a part of the name that isn’t necessary for searching and sorting, or identification then entering ‘unknown’ is a reasonable option. Had the creators of the data known definitively that no-one had suffixes, for example, then there would have been no need to use a nullable column to store it.
We are told by the moody boss that he needs a list of all names in the Adventureworks
database, not just first and last names but all parts of the name that we have. We’ll do a subset of the ten most recent customers.
1 2 3 |
SELECT TOP 10 Title + FirstName + MiddleName + LastName + Suffix AS customer FROM Person.Person ORDER BY LastName DESC; |
That didn’t go so well did it? Again, we didn’t get BP011 warning because SQL Prompt, as discussed, merely detects NULL
literals in expressions or conditions.
There were NULL
s in every row (the suffix was, in every case, unknown) and they ‘propagated’ to create a resulting NULL
. It makes sense. If you concatenate strings and you don’t know what is in one of the strings, the result of the concatenation must indeed be NULL
. From the perspective of SQL Server, you asked a silly question and got a silly, but correct, answer.
What you meant to say was ‘if you don’t know it, leave that bit out’, and for this task, the COALESCE
function is your friend. You just give it a list of parameters. It evaluates each in turn until it finds one that is not NULL
, and it returns this as a value. If they are all NULL
, it returns a NULL
. In this case, we can use it to return a blank string if the value in the column is NULL
. Its potential power in dealing with NULL
s is great.
1 2 3 4 5 |
SELECT TOP 10 Coalesce(Title, '') + ' ' + FirstName + ' ' + Coalesce(MiddleName, '') + ' ' + LastName + Coalesce(Suffix, '') AS customer FROM Person.Person ORDER BY LastName DESC; |
After your excitement dies down, you notice that the list doesn’t left-align; there are extra spaces here and there that will trip you up subsequently. You could have tidied this up with a mix of LTRIM
, RTRIM
and REPLACE
, but then other database developers would smirk at it when doing code reviews. So, you do this:
1 2 3 4 5 |
SELECT TOP 10 Coalesce(Title + ' ', '') + FirstName + Coalesce(' ' + MiddleName + ' ', ' ') + LastName + Coalesce(' ' + Suffix + ' ', '') AS customer FROM Person.Person ORDER BY LastName DESC; |
Suddenly, you are over the pain barrier and letting NULL
s work for you. If there is a value in the Title
column, you add a space to it and return it. If the Title column has no value, then NULL
, when added to a string with a space in it, still results in a NULL
, and so the COALESCE
function examines the next parameter in the list and returns the blank string. Some columns in the list are nullable and some aren’t, so we can put spaces, or any other delimiter we want, between the column values, according to whether they are NULL
. We’ve done all this with only slight changes.
However, the moody boss is not happy. It turns out he wanted a comma-delimited list. OK. We now know all about dealing with delimiters because a space character is a delimiter of sorts.
We create a variable, which by its very nature is nullable. We want a comma delimiter between each string and if we just join strings together that we’ve added to a ‘,’ delimiter, we get a comma at the head of the list. To remove it, we use the COALESCE
function to make sure that the first name isn’t preceded with a comma. If the variable has never been set it is NULL
, so it is easy to have special conditions for the first string to be added to a variable. If you add a comma to a NULL
you get a NULL
, so before any values are inserted into the variable, you won’t get the initial delimiter.
1 2 3 4 5 6 7 |
DECLARE @CustomerList NVARCHAR(4000) SELECT TOP 10 @CustomerList=coalesce(@CustomerList+', ','') + Coalesce(Title + ' ', '') + FirstName + Coalesce(' ' + MiddleName + ' ', ' ') + LastName + Coalesce(' ' + Suffix + ' ', '') FROM Person.Person ORDER BY LastName DESC; SELECT @CustomerList |
Which gives the result that we want:
1 2 3 |
Mr. Michael J. Zwilling, Michael J Zwilling, Jake Zukowski, Ms. Judy N. Zugelder, Ms. Patricia M. Zubaty, Ms. Carla J. Zubaty, Karin Zimprich, Ms. Karin Zimprich, Tiffany E Zimmerman, Marc Zimmerman |
However, the moody boss now comes up with an odd request. He only wants people in the list who have a single initial, for middle name, or nothing at all. Those with full middle names will be dealt with in a separate list.
Simple. We just exclude those whose middle-names are more than one character long:
1 2 3 4 5 |
SELECT TOP 10 Coalesce(Title + ' ', '') + FirstName + Coalesce(' ' + MiddleName + ' ', ' ') + LastName + Coalesce(' ' + Suffix + ' ', '') AS customer FROM Person.Person WHERE LEN(middlename) < 2 ORDER BY LastName DESC; |
Unfortunately, this does not work. It gives us all those with single initials (so we no longer see any who have a single initial followed by a dot). However, we’ve lost all the people with no middle name:
If the middle name is NULL, then the answer to “is NULL less that 2” is unknown, and the row is excluded. COALESCE()
would again have come to the rescue.
1 2 3 4 5 |
SELECT TOP 10 Coalesce(Title + ' ', '') + FirstName + Coalesce(' ' + MiddleName + ' ', ' ') + LastName + Coalesce(' ' + Suffix + ' ', '') AS customer FROM Person.Person WHERE LEN(Coalesce(middlename,'')) < 2 ORDER BY LastName DESC; |
This now gives the correct answer.
COALESCE()
is by far the most useful function for dealing with nullable columns or variables safely. However, there are two other functions that I’ll mention just briefly:
ISNULL()
– does less thanCOALESCE()
and could almost have been devised to confuse the unwary because it looks likeIS
NULL
, which means something else. It is also not standard SQL.NULLIF()
– returnsNULL
if two expressions passed as parameters are equal. This is SQL Standard, but is just syntactic candy on top of theCASE
expression, which is easier to understand.
For more details see How to Get NULLs Horribly Wrong in SQL Server.
NULLs in aggregate values
Now, if you are adding up a list of numbers, you will want the NULL
s to propagate. If you add a list of numbers, and you don’t know one of the numbers, then the result should be NULL
, meaning that there is no way of knowing what the sum of that list is. Let’s put this to the test.
1 2 3 |
SELECT Sum(TheValue) AS [sum of The First value], Sum(OtherValue) AS [sum of The other value] FROM (VALUES (1, 1), (3, 1), (7, 2), (NULL, 0)) AS f (TheValue, OtherValue); |
Surprisingly, from what we’ve learned so far, this SQL aggregation expression returns 11 for the sum of the first column rather than NULL…
…but it will warn you that NULL
values were eliminated.
Why was the NULL
ignored rather than propagated? The SQL Standard, I believe, dropped the ball. Officially, ‘The inconsistencies in the SQL standard mean that it is not possible to ascribe any intuitive logical semantics to the treatment of nulls in SQL”.
The SQL Standard decided that all aggregate functions except COUNT(*)
and grouping functions should ignore NULL
s. The COUNT(*)
function returns either a number or zero. All the other aggregate functions will ignore rows that contain NULL
and return NULL
only if the data set contains no rows or contains only rows with NULL
s as arguments to the aggregate function.
Ignoring NULL
s is not the same as replacing NULL
with zero. We can demonstrate this with our existing example by getting the average calculated in different ways.
1 2 3 4 5 6 |
SELECT Avg(TheValue) AS [Average of The First value], Sum(TheValue)/Count(*) AS [Calculated Average of The First value], Avg(OtherValue) AS [Average of The second value], Sum(OtherValue)/Count(*) AS [Calculated Average of The second value] FROM (VALUES (Convert(FLOAT,1),Convert(FLOAT, 1)), (3, 1), (7, 2), (NULL, 0)) AS f (TheValue, OtherValue); |
For the TheValue
column, which had a NULL
value, the calculated average wasn’t the same as the average produced by the aggregate function. Why was that? The COUNT(*)
returned the number of rows whether or not they contained null values, whereas the AVG
function ignored the row if the column contained a NULL
.
What I should have done was to specify the column in the COUNT()
aggregate. Then it can ignore rows whenever that column has a NULL
in it:
1 2 3 4 5 6 |
SELECT Avg(TheValue) AS [Average of The First value], Sum(TheValue)/Count(TheValue) AS [Calculated Average of The First value], Avg(OtherValue) AS [Average of The second value], Sum(OtherValue)/Count(OtherValue) AS [Calculated Average of The second value] FROM (VALUES (Convert(FLOAT,1),Convert(FLOAT, 1)), (3, 1), (7, 2), (NULL, 0)) AS f (TheValue, OtherValue); |
OK. We’re feeling less confused now. When you ask for the sum or average, you are asking for all non-NULL
values in the column to be aggregated. If you are doing calculations which rely on the value of the count of the values, you must always specify the column in the COUNT()
aggregate so as to count only the non-null rows for that column.
NULLs with grouping, sorting and deduplicating
The SQL Standard says that NULL
is a state of a datatype rather than a value. So, how then do you sort data? You might expect from the behavior of NULL
values in expressions that sorting by a nullable column will result in those rows that have a NULL
value in a sort column being scattered randomly around the sort order. Not so. The NULL
s are sorted as if they were the same value, all together. Where in the sort order should they be? The SQL Standard allows you to specify whether they float to the top or sink to the bottom of the sort order with the keywords NULLS FIRST
and NULLS LAST
. SQL Server is always NULLS
FIRST
.
We can try this with the saints of Ireland and their saint’s days.
1 |
SELECT * FROM #IrishSaintsDay ORDER BY saintsDay, name DESC; |
…and…
1 |
SELECT * FROM #IrishSaintsDay ORDER BY saintsDay, name ASC; |
You will find that in both cases, the result starts with NULL
s.
Like sorting, grouping is done as if all NULL
values are considered equal, in contrast to the way that they are treated in an expression. They are collected into a single group.
1 |
SELECT Count(*) AS [Number of saints], SaintsDay FROM #IrishSaintsDay GROUP BY #IrishSaintsDay.saintsDay |
62 saints have an unknown or non-existent saint’s day.
What about UNION
when a column has NULL
s in it? The UNION
operation will remove all duplicate rows. However, what if two rows both have NULL
s for one or columns but are otherwise-identical? If the database engine were unable to compare NULL
s, then it would be unable to say ‘deduplicate’ during UNION
operations. We can easily test this.
1 2 3 4 |
SELECT Count(*) FROM (SELECT * FROM #IrishSaintsDay WHERE name <'d' UNION SELECT * FROM #IrishSaintsDay)saints |
The result is still 190, which is the number in the original table. The UNION
operation compared rows with NULLs but were otherwise the same, decided they were identical and removed the duplicate row!
If we had done a UNION
ALL
instead of a UNION
, we’d have found that there were 272 rows because UNION
ALL
doesn’t remove duplicates from the result.
1 2 3 4 |
SELECT Count(*) FROM (SELECT * FROM #IrishSaintsDay WHERE name <'d' UNION all SELECT * FROM #IrishSaintsDay)saints |
SQL also gives you the answer you expect ,if you use DISTINCT
:
1 2 3 4 |
SELECT DISTINCT name, SaintsDay FROM (SELECT * FROM #IrishSaintsDay WHERE name <'d' UNION all SELECT * FROM #IrishSaintsDay)saints |
Again, only 190 saints are found.
To cut a long story short, whether you use PARTITION
BY
, DISTINCT
, GROUP BY
, UNION
, INTERSECT
, ORDER
BY
and EXCEPT
, SQL plays nice, figures out the answer you expect, and treats NULL
values as if they were a single value, in contrast to the way that it treats then in expressions or comparisons.
You will, however, be thinking to yourself, like professor Ron van der Meyden, ‘The inconsistencies in the SQL standard mean that it is not possible to ascribe any intuitive logical semantics to the treatment of nulls in SQL‘.
Summary
NULL
values are an essential part of the relational database, because data can be both intentionally and unintentionally missing, and it can be unknown.
NULL
can be assigned, but using ‘=
NULL
‘, ‘<>
NULL
‘, or any other comparison operator, in an expression with NULL
as a value, is illegal in SQL and should trigger an error. It can never be correct. The expression will always return NULL
.
Comparisons or arithmetic involving either a variable datatype or a nullable column are unlikely to be safe unless used with COALESCE()
, NULLIF()
or ISNULL()
. However these functions can be very handy, if used guardedly.
Grouping, sorting and ‘uniquifying’ will behave intuitively by ignoring rows where the columns being specified have NULL
s in them, but when doing sums and averages, for example, care needs to be taken to specify the column in the COUNT()
aggregate, rather than using the '*'
parameter.
NULL
s can trip up a programmer, yet provide a very versatile way of handling data, providing a range of software devices for dealing successfully with commercial database processes and reports. Don’t avoid NULL
s; instead, get familiar with them and their use.