One interesting problem that you can come across occasionally is to find the contiguous sequence of numbers in a one-dimensional array of numbers that gives you the greatest sum value. Now that doesn’t sound particularly interesting until you are asked for a particular report like ‘How long was the period in which we were trading most profitably last year?’, or you maybe need to find the brightest part of an image. It is one of those techniques that like linear regression or ANOVA that you find all sorts of uses for once you understand it.
It is called the Maximum subarray problem, Kadane’s algorithm, The ‘Largest Sum Contiguous Subarray’ or the ‘Greatest subsequential sum’ algorithm. Learn it and you’ve aced four technical interview questions! I’ve never seen a SQL Solution.
Normally, the answer you’d give when you see a one-dimensional array of numbers like 45,12,32,90,48 would be ‘the whole lot contribute to the sum’, but that would assume that all the numbers were positive. What if they are not? They might, for example represent the difference from the mean. They might represent both input and output of a black-box system. Imagine that you were doing cashbook entries where cash was going in, and payments were going out, and you wanted to know the longest sequence of entries where you were accumulating money? What if you were doing just-in-time stocking of components and wanted to know those periods in which you were piling up components?
In the sequence of numbers −2, 1, −3, 4, −1, 2, 1, −5, 4 – the contiguous subarray with the largest sum is 4, −1, 2, 1, – with sum of 6.
for the sequence -1, 0, 15, 3, -9, 12, -4 – it would be 0, 15, 3, -9, 12 – with a sum of 21
for the sequence -1, -2, 3, 5, 6, -2, -1, 4, -4, 2, -2 – it would be 3, 5, 6, -2, -1, 4 – with a sum of 15
There are plenty of brute force ways of doing this but Kadane’s algorithm succeeds in being linear. This is given in the Wikipedia. Although it is used mostly for one-dimensional arrays, it will work with two-dimensional arrays as well if you simply sum up the rows, and then let the basic kadane algorithm run on the resulting one-dimensional array.
So how is this done in SQL? The simplest thing is to translate the algorithm into SQL directly, because it is very efficient. I’ve added to it slightly to give you the subarray as well as the sum. This works well-enough on small arrays and is linear. Ideally, you’d want a set-based method but I’ve never outgrown an iterative solution
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
IF(OBJECT_ID('tempDB..#MaxSubarray') IS NOT NULL) DROP TABLE #MaxSubarray; --if the temp tables exists then delete it CREATE TABLE #MaxSubarray ( TheOrder INT IDENTITY PRIMARY KEY, value INT ); --create a table representing the list with the sequence no. as the primary key INSERT INTO #MaxSubarray (value) SELECT x.VALUE FROM (VALUES (-2),(1),(-3),(4),(-1),(2),(1),(-5),(4)) x(VALUE); --and now fill it with the list items DECLARE @max_ending_here INT, --the total so far (the partial sum accumulator) @max_so_far INT, --The Maximum of the current solution @ii INT, --loop variable @iiMax INT, --length of the array @x INT, --the value of the current array element in the sequence (current row) @start INT, --the start point of the current solution @end INT; --the end - so far- of the current best solution --so now we intialise all our variables SELECT @start = -1, @end = -1, @max_so_far = 0, @ii = 1, @iiMax = MAX(#MaxSubarray.TheOrder) FROM #MaxSubarray; --iterate through the array (I know!) WHILE @ii <= @iiMax BEGIN --get the value of the next value in the sequence SELECT @x = #MaxSubarray.value FROM #MaxSubarray WHERE #MaxSubarray.TheOrder = @ii; --update the running sum total (partial sum accumulator) SELECT @max_ending_here = COALESCE(@max_ending_here, 0) + @x; --IF it IS greater value than anything so far update our maximum --and update the index of the end of the run IF @max_ending_here > @max_so_far SELECT @max_so_far = @max_ending_here, @end = @ii; SELECT @ii = @ii + 1; --Now, if our accumulator has gone negative then we start again --at the next value following IF @max_ending_here < 0 SELECT @max_ending_here = 0, @start = @ii; END; SELECT @max_so_far; -- our highest positive sum fr4om a subsequence SELECT TheOrder,value --and list the subsequence that gave the highest sum FROM #MaxSubarray WHERE #MaxSubarray.TheOrder BETWEEN @start AND @end ORDER BY #MaxSubarray.TheOrder; --if nothing in the array, the subarray is empty and the total is 0 |
Now there are several ways you might want to use this, but we want to have a version that we can unit-test easily as we work with the algorithm so we’d want to feed it lists whare the solution has been done already and see if it gives the correct solution. Normally, you’d probably take a table as a parameter and return a table result representing the subarray, but that is a bit messier to unit-test.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 |
IF EXISTS (SELECT * FROM INFORMATION_SCHEMA.ROUTINES WHERE ROUTINES.ROUTINE_TYPE='procedure' AND ROUTINES.ROUTINE_SCHEMA='dbo' AND ROUTINES.ROUTINE_NAME = 'MaximumValueSubarray') DROP PROCEDURE MaximumValueSubarray go CREATE procedure dbo.MaximumValueSubarray /** summary: > This function takes a sequence as a list of numbers and returns both a list and a value. The list represents the subsequence as an ascll string representing a sequenced list of integers in the array and the value is the maximum sum value from aggregating that subsequence Author: Phil Factor Revision: 1.0 date: 19 Jan 2017 example: > DECLARE @SubArray VARCHAR(MAX), @MaxValue int EXECUTE MaximumValueSubarray '−2, 1, −3, 4, −1, 2, 1, −5, 4', @SubArray OUTPUT, @MaxValue OUTPUT SELECT @SubArray as 'subarray', @MaxValue as 'Max_Value' returns: > The subsequence as an ascll string representing a sequenced list of integers in the array The maximum sum value from aggregating a subsequence **/ @Sequence VARCHAR(MAX), --eg '−2, 1, −3, 4, −1, 2, 1, −5, 4' @SubArray VARCHAR(MAX) OUTPUT, @MaxValue INT OUTPUT AS SELECT @Subarray='', @MaxValue=0 --initialise the output variables IF COALESCE(@Sequence,'')='' RETURN --nothing to do DECLARE @XMLVersionOfList XML --we convert the list into XML SELECT @XMLVersionOfList = '<list><i>' + REPLACE(@Sequence, ',', '</i><i>') + '</i></list>'; --we now take each component and insert it into a row of a table variable DECLARE @MaxSubarray TABLE ( TheOrder INT IDENTITY PRIMARY KEY, value INT ); INSERT INTO @MaxSubarray(Value) SELECT x.y.value('.', 'varchar(10)') AS IDs FROM @XMLVersionOfList.nodes('/list/i/text()') AS x(y); DECLARE @max_ending_here INT, --the total so far (the partial sum accumulator) @max_so_far INT, --The Maximum of the current solution @ii INT, --loop variable @iiMax INT, --length of the array @x INT, --the value of the current array element in the sequence (current row) @start INT, --the start point of the current solution @end INT; --the end - so far- of the current best solution --so now we intialise all our variables SELECT @start = -1, @end = -1, @max_so_far = 0, @ii = 1, @iiMax = MAX(TheOrder) FROM @MaxSubarray; --iterate through the array (I know!) WHILE @ii <= @iiMax BEGIN --get the value of the next value in the sequence SELECT @x = value FROM @MaxSubarray WHERE TheOrder = @ii; --update the running sum total (partial sum accumulator) SELECT @max_ending_here = COALESCE(@max_ending_here, 0) + @x; --IF it IS greater value than anything so far update our maximum --and update the index of the end of the run IF @max_ending_here > @max_so_far SELECT @max_so_far = @max_ending_here, @end = @ii; SELECT @ii = @ii + 1; --Now, if our accumulator has gone negative then we start again --at the next value following IF @max_ending_here < 0 SELECT @max_ending_here = 0, @start = @ii; END; SELECT @MaxValue=@max_so_far; -- our highest positive sum from a subsequence SELECT @SubArray= @SubArray+','+CONVERT(VARCHAR(10),value) --and list the subsequence that gave the highest sum FROM @MaxSubarray WHERE TheOrder BETWEEN @start AND @end ORDER BY TheOrder; SELECT @SubArray=STUFF(@SubArray,1,1,'') --if nothing in the array, the subarray is empty and the total is 0 GO --now we test it. DECLARE @SubArray VARCHAR(MAX), @MaxValue INT EXECUTE MaximumValueSubarray '−2, 1, −3, 4, −1, 2, 1, −5, 4', @SubArray OUTPUT, @MaxValue OUTPUT IF @SubArray <>'4,-1,2,1' or @MaxValue <> 6 RAISERROR ('Test 1 failed with ''%s'' and ''%d''',16,1,@SubArray,@MaxValue) EXECUTE MaximumValueSubarray '-1, 0, 15, 3, -9, 12, -4 ', @SubArray OUTPUT, @MaxValue OUTPUT IF @SubArray <>'0,15,3,-9,12' or @MaxValue <>21 RAISERROR ('Test 2 failed with ''%s'' and ''%d''',16,1,@SubArray,@MaxValue) EXECUTE MaximumValueSubarray ' -1, -2, 3, 5, 6, -2, -1, 4, -4, 2, -2', @SubArray OUTPUT, @MaxValue OUTPUT IF @SubArray <>'3,5,6,-2,-1,4' or @MaxValue <>15 RAISERROR ('Test 3 failed with ''%s'' and ''%d''',16,1,@SubArray,@MaxValue) DECLARE @SubArray VARCHAR(MAX), @MaxValue int EXECUTE MaximumValueSubarray '1', @SubArray OUTPUT, @MaxValue OUTPUT SELECT @SubArray as 'subarray', @MaxValue as 'Max_Value' |
So there you have it. As you can see, it is iterative, but grows linearly with the length of the array. but you can use Quirky Update to speed it up considerably. I don’t know of a fast set-based algorithm but I’d be pleasantly surprised if you can come up with something faster than Kadane’s Algorithm.
Load comments