The Task Defined.
Imagine this: The customer runs the application that you’ve designed, and cannot see any sign of business transactions over a time period where he’d expect to see them. Your manager asks you to investigate. He will almost certainly ask you to find all of the missing days or hours in a date range where no data has been recorded: The date range can be several years.
The situation may become even worse; missing data can have unpleasant consequences, and that fact will make both the customer and the boss very nervous.
As a result, you will be under stress: you have to reassure your boss and the customer, and you need to do that in a very limited amount of time.
In this article, we will discuss a few SQL, one-query techniques that can help you to find the missing dates.
Preparations
Firstly, you will need to create and load an auxiliary table that will store the sequence of whole numbers. The volume of numbers in that table will be equal to the number of days in the year 2007 (see Listing1):
Listing 1– Create and load an auxiliary tableSET NOCOUNT ON;
DECLARE @startDate smalldatetime, @endDate smalldatetime
SELECT @startDate = ’Jan 01, 2007′,
@endDate = DATEADD(yy, 1, @startDate);
IF EXISTS(SELECT * FROM sysobjects
WHERE ID = (OBJECT_ID(‘sequence’)) AND xtype = ’U’)
DROP TABLE sequence;
CREATE TABLE sequence(num INT NOT NULL PRIMARY KEY);
WITH numbers AS
(
SELECT 1 AS num
UNION ALL
SELECT num + 1 FROM numbers
WHERE num < (SELECT DATEDIFF(dd, @startDate, @endDate))
)
INSERT INTO sequence
SELECT num FROM numbers
OPTION (MAXRECURSION 0);
Secondly, generate some test data and load that data into a sample table. Then produce the gaps, using the next algorithm: delete dates, where the number of days since Jan 01, 1900, divided by 4 or 5 or 6 has zero in remainder.
The second preparation step can be done this way (see Listing2):
Listing2. Create and load a sample tableDECLARE @iniDate datetime;
SELECT @iniDate = 'Dec 31, 2006';
IF EXISTS(SELECT * FROM sysobjects
WHERE ID = (OBJECT_ID('sampleDates')) AND xtype = 'U')
DROP TABLE sampleDates;
CREATE TABLE sampleDates(theDate smalldatetime PRIMARY KEY)
INSERT INTO sampleDates
SELECT DATEADD(dd, num, @iniDate) theDate
FROM sequence;
-- Create gaps
DELETE sampleDates
WHERE DATEDIFF(dd, 0, theDate)%3 = 0
OR DATEDIFF(dd, 0, theDate)%4 = 0
OR DATEDIFF(dd, 0, theDate)%5 = 0;
SELECT * FROM sampleDates;
Results:
theDate
-----------------------
2007-01-02 00:00:00.000
2007-01-03 00:00:00.000
2007-01-06 00:00:00.000
2007-01-09 00:00:00.000
. . . . . . . . . . . .
2007-12-20 00:00:00.000
2007-12-23 00:00:00.000
2007-12-25 00:00:00.000
2007-12-28 00:00:00.000
2007-12-29 00:00:00.000
(146 row(s) affected)
Now that we have the auxiliary table and the sample data, we can try out a few techniques.
Technique #1 -The simple list
The first technique is quite common and uses an outer join (see Listing3):
Listing3. Common solution– The solution, where missing days are represented by NULLs
SELECT theDate
FROM sequence t1 LEFT OUTER JOIN sampleDates t2
ON DATEADD(dd, t1.num, ’Dec 31, 2006′) = t2.theDate;
Results:
theDate
———————–
NULL
2007-01-02 00:00:00.000
2007-01-03 00:00:00.000
NULL
NULL
2007-01-06 00:00:00.000
NULL
NULL
2007-01-09 00:00:00.000
. . . . . . . . . . . .
2007-12-20 00:00:00.000
NULL
NULL
2007-12-23 00:00:00.000
NULL
2007-12-25 00:00:00.000
NULL
NULL
2007-12-28 00:00:00.000
2007-12-29 00:00:00.000
NULL
NULL
(365 row(s) affected)
– The solution that shows only missing days
SELECT DATEADD(dd, num, ’Dec 31, 2006′) missingDate
FROM sequence t1 LEFT OUTER JOIN sampleDates t2
ON DATEADD(dd, t1.num, ’Dec 31, 2006′) = t2.theDate
WHERE t2.theDate IS NULL;
Results:
missingDate
———————–
2007-01-01 00:00:00.000
2007-01-04 00:00:00.000
2007-01-05 00:00:00.000
2007-01-07 00:00:00.000
2007-01-08 00:00:00.000
2007-01-10 00:00:00.000
. . . . . . . . . . . .
2007-12-21 00:00:00.000
2007-12-22 00:00:00.000
2007-12-24 00:00:00.000
2007-12-26 00:00:00.000
2007-12-27 00:00:00.000
2007-12-30 00:00:00.000
2007-12-31 00:00:00.000
(219 row(s) affected)
There is nothing wrong with the solutions shown in the Listing3. However, if you need to find the missing hours in the whole year or the missing days in a few years, these queries may produce a long list of dates, which would be soon become unwieldy and difficult to interpret.
Your manager would not appreciate being given this sort of report.
Therefore, you need to find another solution, which will represent the result in a way that is more convenient for reading and analyzing.
Technique #2 -finding the ranges
Think of the missing dates as being a number of gaps in the continuous sequence of the dates. Then, instead of generating each missing date, you can show just the beginning and the end of each gap. The code in the Listing4 demonstrates two variants of such a solution:
Listing 4. The ‘gaps’ solution– First variant without SQL Server date/time function
SELECT t1.col1 AS startOfGap, MIN(t2.col1) AS endOfGap
FROM
(SELECT col1 = theDate + 1 FROM sampleDates tbl1
WHERE NOT EXISTS(SELECT * FROM sampleDates tbl2
WHERE tbl2.theDate = tbl1.theDate + 1)
AND theDate <> (SELECT MAX(theDate) FROM sampleDates)) t1
INNER JOIN
(SELECT col1 = theDate - 1 FROM sampleDates tbl1
WHERE NOT EXISTS(SELECT * FROM sampleDates tbl2
WHERE tbl1.theDate = tbl2.theDate + 1)
AND theDate <> (SELECT MIN(theDate) FROM sampleDates)) t2
ON t1.col1 <= t2.col1
GROUP BY t1.col1;
Results:
startOfGap endOfGap
———————– ———————–
2007-01-04 00:00:00.000 2007-01-05 00:00:00.000
2007-01-07 00:00:00.000 2007-01-08 00:00:00.000
2007-01-10 00:00:00.000 2007-01-10 00:00:00.000
2007-01-12 00:00:00.000 2007-01-13 00:00:00.000
. . . . . . . . . . . . . . . . . . . . . . . .
2007-12-18 00:00:00.000 2007-12-18 00:00:00.000
2007-12-21 00:00:00.000 2007-12-22 00:00:00.000
2007-12-24 00:00:00.000 2007-12-24 00:00:00.000
2007-12-26 00:00:00.000 2007-12-27 00:00:00.000
(108 row(s) affected)
– Second variant with SQL Server date/time functions
SELECT t1.col1 AS startOfGap, MIN(t2.col1) AS endOfGap
FROM
(SELECT col1 = DATEADD(dd, 1, theDate) FROM sampleDates tbl1
WHERE NOT EXISTS(SELECT * FROM sampleDates tbl2
WHERE DATEDIFF(dd, tbl1.theDate, tbl2.theDate) = 1)
AND theDate <> (SELECT MAX(theDate) FROM sampleDates)) t1
INNER JOIN
(SELECT col1 = DATEADD(dd, -1, theDate) FROM sampleDates tbl1
WHERE NOT EXISTS(SELECT * FROM sampleDates tbl2
WHERE DATEDIFF(dd, tbl2.theDate, tbl1.theDate) = 1)
AND theDate <> (SELECT MIN(theDate) FROM sampleDates)) t2
ON t1.col1 <= t2.col1
GROUP BY t1.col1
ORDER BY t1.col1;
Results:
startOfGap endOfGap
———————– ———————–
2007-01-04 00:00:00.000 2007-01-05 00:00:00.000
2007-01-07 00:00:00.000 2007-01-08 00:00:00.000
2007-01-10 00:00:00.000 2007-01-10 00:00:00.000
2007-01-12 00:00:00.000 2007-01-13 00:00:00.000
. . . . . . . . . . . . . . . . . . . . . . . .
2007-12-18 00:00:00.000 2007-12-18 00:00:00.000
2007-12-21 00:00:00.000 2007-12-22 00:00:00.000
2007-12-24 00:00:00.000 2007-12-24 00:00:00.000
2007-12-26 00:00:00.000 2007-12-27 00:00:00.000
(108 row(s) affected)
Both queries from the Listing 4 use self-join and produce the same result in approximately the same time. However, if you test the queries on a larger sample set; for ten years, for example, instead of one year, you will find that the second query runs much faster than the first one.
The reason for such a difference in performance is the way in which the queries deal with dates.
The first query uses an arithmetic operator (+) in order to add one day to the date. The second query for the same operation employs SQL Server built-in function.
(You can read the article “Best Practices for Date/Time Calculations in SQL Server” , which explains, why it is possible to use the arithmetic operators in date/time calculations and why you should avoid such a usage.
The first query from the Listing 4 is typical in showing how the use of arithmetic operations in date/time calculations can decrease the queries’ performance.)
So far, we examined the technique #2 on relatively small data sets.
There are, however, the situations when the amount of data that needs to be processed will be significantly larger. For example, your system may collect the POS (point of sale) transactions from multi-chain department and grocery stores located in different countries and (or) time zones.
In that case, if you asked to find the missing minutes (seconds) in a few months or even years, you may find that the fastest (second) query from the Listing 4 is also inefficient.
Then, you will need to enhance your queries in order to get the acceptable execution time.
One possible enhancement would be to employ an auxiliary column, as shown in the Listing 5:
Listing5. Enhanced solutions for gaps– Solution with identity column
IF EXISTS(SELECT * FROM sysobjects
WHERE ID = (OBJECT_ID(‘sampleDates2′)) AND xtype = ’U’)
DROP TABLE sampleDates2;
CREATE TABLE sampleDates2(n INT IDENTITY(1,1) PRIMARY KEY, theDate datetime);
INSERT INTO sampleDates2(theDate) SELECT theDate FROM sampleDates;
SELECT DATEADD(dd, 1, a.theDate) AS startOfGap,
DATEADD(dd, -1, b.theDate) AS endOfGap
FROM sampleDates2 a INNER JOIN sampleDates2 b
ON a.n = b.n - 1
WHERE DATEDIFF(dd, a.theDate, DATEADD(dd, -1, b.theDate)) <> 0
GO
– Solution with ROW_NUMBER() function
SELECT DATEADD(dd, 1, a.theDate) AS startOfGap,
DATEADD(dd, -1, b.theDate) AS endOfGap
FROM
(SELECT ROW_NUMBER() OVER(ORDER BY (theDate)) AS RowNum, theDate
FROM sampleDates) a
INNER JOIN
(SELECT ROW_NUMBER() OVER(ORDER BY (theDate)) AS RowNum, theDate
FROM sampleDates) b
ON a.RowNum = b.RowNum - 1
WHERE DATEDIFF(dd, a.theDate, DATEADD(dd, -1, b.theDate)) <> 0
GO
– Solution with common table expression
WITH sampleCTE(rowNum, theDate) AS
(
SELECT ROW_NUMBER() OVER(ORDER BY theDate) AS RowNum, theDate
FROM sampleDates
)
SELECT DATEADD(dd, 1, a.theDate) AS startOfGap,
DATEADD(dd, -1, b.theDate) AS endOfGap
FROM sampleCTE a INNER JOIN sampleCTE b
ON a.RowNum = b.RowNum - 1
WHERE DATEDIFF(dd, a.theDate, DATEADD(dd, -1, b.theDate)) <> 0
The queries from the Listing 5 are based on the idea of SQL Server MVP Erland Sommarskog (Stockholm).
All you need to do is to assign the order number to each row in the table and then to compare the values (they can be dates or numbers or even letters) in the adjacent rows, using the order numbers.
The first query in the Listing 5 uses the identity column in order to enumerate the rows in the table.
The second and third queries use ranking function ROW_NUMBER().
All three queries are much faster than the fastest (second) query from the Listing 4.
Note, that queries from the Listing 5 require column theDate (table sampleDates) to be sorted before assigning the order number. Since table sampleDates has a primary key constraint, the clustered index is also created on column theDate. That means you don’t need to sort the data in table sampleDates explicitly: clustered index will do that automatically.
One more thing that you should note is the incomplete results, returned by all the queries from the Listing 4 and Listing 5.
Indeed, Jan 01, Dec 30 and Dec 31 are missing dates in table sampleDates. However, they are not shown in the result as the gaps, because all the queries treat Jan 02, 2007 as the beginning and Dec 29, 2007 as the end of the sample data set.
In order to get the correct result, you need to insert the dates Dec 31, 2006 and Jan 01, 2008 into the table sampleDates.
Technique #3 -the Pivot table
One more technique presents missing dates in the form of the spreadsheet.
This type of output is one of the most attractive for the human eye and probably is the favorite document type of your boss.
The technique shown in the Listing 6, uses SQL Server 2005 PIVOT operator in order to produce tabular output:
Listing6. Solution with PIVOT operatorSELECT *
FROM
(SELECT
CONVERT(VARCHAR(13),DATEADD(dd,num,’Dec 31, 2006′),107) AS missingDate,
DATEPART(yy, DATEADD(dd, num, ’Dec 31, 2006′)) YearNum,
DATEPART(wk, DATEADD(dd, num, ’Dec 31, 2006′)) Wk_Year,
DATENAME(dw, DATEADD(dd, num, ’Dec 31, 2006′)) weekDayName
FROM sequence t1 LEFT OUTER JOIN sampleDates t2
ON DATEADD(dd, t1.num, ’Dec 31, 2006′) = t2.theDate
WHERE t2.theDate IS NULL
) sourceTable
PIVOT
(MIN(missingDate)
FOR weekDayName
IN
(Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday)) AS pivotTable;
The beauty of the query from the Listing 6 is not only in its cross-tab view. That query also shows days of the week, which is very convenient, when you need to analyze business (application) that runs not whole week, let us say 5-6 days per week. In that case, the missing Saturdays and Sundays will be considered as weekends, but not as the missing days.
How To Find The Missing Hours
This bonus example shows how to apply the above techniques to the case of missing hours
You can also use this paragraph as a reference, when you need to find missing minutes or seconds.
First, create a sample data set (see Listing 7):
Listing7. Create sample data set for hours– Generate hours for the year 2007
DECLARE @iniDate smalldatetime;
SELECT @iniDate = ’Dec 31, 2006′;
IF EXISTS(SELECT * FROM sysobjects
WHERE ID = (OBJECT_ID(‘sampleDates_2′)) AND xtype = ’U’)
DROP TABLE sampleDates_2;
CREATE TABLE sampleDates_2(theDate smalldatetime PRIMARY KEY);
INSERT INTO sampleDates_2
SELECT DATEADD(hh,num-1, days) hours
FROM (SELECT DATEADD(dd, num, @iniDate) days
FROM sequence) t1
CROSS JOIN sequence t2
WHERE num <=24
ORDER BY hours
– Create gaps in the sequence of hours
DELETE sampleDates_2
WHERE DATEDIFF(hh, 0, theDate)%4 = 0
AND DATEDIFF(dd, ’Jan 01, 2007′,theDate) < 100;
DELETE sampleDates_2
WHERE DATEDIFF(hh, 0, theDate)%5 = 0
AND DATEDIFF(dd, ’Jan 01, 2007′,theDate) BETWEEN 50 AND 150;
DELETE sampleDates_2
WHERE DATEDIFF(hh, 0, theDate)%7 = 0
AND DATEDIFF(dd, ’Jan 01, 2007′,theDate) BETWEEN 75 AND 325;
– Just to make the last hour of the year missing
DELETE sampleDates_2 WHERE theDate = ’2007-12-31 23:00:00′;
– In order to get the very first gap in the beginning of the year and
– the very last gap at the end of the year, add the next dates
INSERT INTO sampleDates_2 VALUES(‘Dec 31, 2006 23:00:00’);
INSERT INTO sampleDates_2 VALUES(‘Jan 01, 2008 00:00:00’);
The following script shows the solutions for missing hours (see Listing 8):
Listing8. Solutions for missing hours– 1) The gaps solution for missing hours
WITH sampleCTE(rowNum, theDate) AS
(
SELECT ROW_NUMBER() OVER(ORDER BY (theDate)) AS RowNum, theDate
FROM sampleDates_2
)
SELECT DATEADD(hh, 1, a.theDate) AS startOfGap,
DATEADD(hh, -1, b.theDate) AS endOfGap
FROM sampleCTE a INNER JOIN sampleCTE b
ON a.RowNum = b.RowNum - 1
WHERE DATEDIFF(hh, a.theDate, DATEADD(hh, -1, b.theDate)) <> 0;
– 2) The cross-tab solution that counts the number of missing hours per each day
WITH sourceTable(hours, days, weekDayName) AS
(
SELECT
t3.hours,
CAST(DATEDIFF(dd, 0, t3.hours) AS smalldatetime) days,
DATENAME(dw, t3.hours) weekDayName
FROM (SELECT DATEADD(hh,num-1, days) hours, t2.num
FROM
(SELECT DATEADD(dd, num, ’Dec 31, 2006′) days
FROM sequence) t1 CROSS JOIN sequence t2
WHERE num <=24) t3 LEFT OUTER JOIN sampleDates_2 t4
ON t3.hours = t4.theDate
WHERE t4.theDate IS NULL
)
SELECT *
FROM sourceTable
PIVOT
(COUNT(hours)
FOR weekDayName
IN
(Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday)) AS pivotTable;
– 3) The cross-tab solution that shows missing hours per each day
SELECT *
FROM
(SELECT CONVERT(VARCHAR(15), t3.hours, 108) hours,
t3.num - 1 AS dayHour,
DATEPART(mm, t3.hours) monthNum,
DATEPART(dd, t3.hours) dateNum,
CONVERT(VARCHAR(13), t3.hours, 107) days
FROM (SELECT DATEADD(hh,num-1, days) hours, t2.num
FROM
(SELECT DATEADD(dd, num, ’Dec 31, 2006′) days
FROM sequence) t1 CROSS JOIN sequence t2
WHERE num <=24) t3 LEFT OUTER JOIN sampleDates_2 t4
ON t3.hours = t4.theDate
WHERE t4.theDate IS NULL
) sourceTable
PIVOT
(MIN(hours)
FOR dayHour
IN
([0],[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12],
[13],[14],[15],[16],[17],[18],[19],[20],[21],[22],[23])) AS pivotTable;
The first solution from the Listing 8 shows the missing hours as the beginning and the end of the gap.
The second solution shows the number of missing hours per each day
If this is not enough, the third solution will give you detailed information about missing hours.
The Task Completed: Here Are The Results
You found missing dates and sent the results to your boss. After short investigation, the problem was identified: some data has been deleted from database by mistake, during the application upgrade.
In that situation, everybody relied on you and once again, you rose to the challenge. You found the old backup, restored the database under another name and moved the missing data to the original database.
Here is the result of your efforts:
The customer is satisfied. The manager impressed by your skills. You can be proud of your achievements.
Load comments