This part is part of an ongoing series of blogs I am writing while preparing to give a presentation based on the prefix of this blog’s title. It is a work in progress as I explore the effects of the new in-memory model. I would love comments that tell me that I am wrong (especially if it turns out that I am!)
The first issue that arose in my tests was with the new concurrency model. I started out with the stored procedure I will show later in the entry. It sees if a customer exists, and if not create it. The table is the customerAddress table, with the following structure:
With on-disk structures, I have a uniqueness constraint on the Address columns, along with the customerId. The customerId is the first column in the constraint declaration, so the performance was great. I don’t use a transaction or error handling (yet) because I want to simulate what a normal application might do. I put the code in stored procedure form because, well, it will just be easier to manage for me. (Plus we can change that code to be proper a lot easier, and when I build the natively compiled procedures, things will be easier to simulate.
Basically, the gist is, look up the address using all of the address columns that are in the uniqueness constraint. If you get a customerAddressId, return it, if not, create the customer and then return it, using the alternate key values:
CREATE PROCEDURE Customers.CustomerAddress$CreateAndReturn
(
@customerId INT,
@Line1 NVARCHAR(30),
@Line2 NVARCHAR(30),
@City NVARCHAR(30),
@ZipCode VARCHAR(20),
@USStateCode CHAR(2),
@AddressTypeCode NVARCHAR(20)
)
AS
BEGIN
SET NOCOUNT ON
DECLARE @customerAddressId INT = (
SELECT CustomerAddressId
FROM Customers.CustomerAddress
WHERE CustomerId = @CustomerId
AND (Line1 = @Line1 OR (Line1 IS NULL AND @Line1 IS NULL))
AND (Line2 = @Line2 OR (Line2 IS NULL AND @Line2 IS NULL))
AND (City = @City OR (City IS NULL AND @City IS NULL))
AND (ZipCode = @ZipCode OR (ZipCode IS NULL AND @ZipCode IS NULL))
AND USStateCode = @USStateCode
AND AddressTypeCode = @AddressTypeCode ) –we are assuming validation is done elsewhere
IF @customerAddressId IS NULL
BEGIN
INSERT INTO [Customers].[CustomerAddress] ( [CustomerId], [Line1], [Line2], [City], [ZipCode], [USStateCode], [AddressTypeCode] )
VALUES ( @CustomerId, @Line1, @Line2, @City, @ZipCode, @USStateCode, @AddressTypeCode )
SET @customerAddressId = (
SELECT CustomerAddressId
FROM Customers.CustomerAddress
WHERE CustomerId = @CustomerId
AND (Line1 = @Line1 OR (Line1 IS NULL AND @Line1 IS NULL))
AND (Line2 = @Line2 OR (Line2 IS NULL AND @Line2 IS NULL))
AND (City = @City OR (City IS NULL AND @City IS NULL))
AND (ZipCode = @ZipCode OR (ZipCode IS NULL AND @ZipCode IS NULL))
AND USStateCode = @USStateCode
AND AddressTypeCode = @AddressTypeCode )
END
RETURN COALESCE(@customerAddressId, -100)
END
GO
My test data resides in a side database, with a lot of duplicated customers and addresses. The goal is to do a good number of reads, but not make the code so complex that we start to deal with slight variances in addresses. These are the kinds of things one might do in an ETL process, or certainly as a background process in their OLTP system. For this example, I want to avoid duplication, but only if it is 100% duplicated.
The code is called in batches of four inter connected sets of procedures. Each returns a surrogate key via a return statement (it isn’t 100 finished, as I haven’t really handled stuff like –100 being returned, even though it really couldn’t occur with my current data.):
DECLARE @CustomerId INT
EXEC @CustomerId = Customers.Customer$CreateAndReturn @firstName = ‘Tyler’, @MiddleName = ‘R’, @LastName = ‘David’, @EmailAddress = ‘Tyler@email.com’
DECLARE @CustomerAddressId INT
EXEC @CustomerAddressId = Customers.CustomerAddress$CreateAndReturn @customerId = @CustomerId,@Line1=’9 Gordon Highway’,@line2=’Apt 296′,@city=’Buffalo’,@ZipCode=’81254′,@USStateCode=’MO’,@AddressTypeCode=’Office’
DECLARE @SalesOrderId int
EXEC @SalesOrderId = Sales.SalesOrder$Create @CustomerId=@CustomerId,@CustomerAddressId=@CustomerAddressId,@SalesOrderStatusCode=’New’
EXEC Sales.SalesOrderItem$Create @SalesOrderId=@SalesOrderId,@SalesOrderItemNumber=48904,@Quantity=3.6386,@UnitPrice=14.57,@ProductPriceId=3779
EXEC Sales.SalesOrderItem$Create @SalesOrderId=@SalesOrderId,@SalesOrderItemNumber=98015,@Quantity=3.0596,@UnitPrice=5.31,@ProductPriceId=1043
I have 4 files of these stored procedure batches, each file having 10,000 of batches in it. I start them all at the same time using my WaitFor Sync FrameWork I blogged about a few year’s back. When I was using the on-disk tables, not once did this code fail to work, and not once did I get an error, even though I certainly could have if a duplicated row was inserted:
DECLARE @customerAddressId INT = (
SELECT CustomerAddressId
FROM Customers.CustomerAddress
WHERE CustomerId = @CustomerId
AND (Line1 = @Line1 OR (Line1 IS NULL AND @Line1 IS NULL))
AND (Line2 = @Line2 OR (Line2 IS NULL AND @Line2 IS NULL))
AND (City = @City OR (City IS NULL AND @City IS NULL))
AND (ZipCode = @ZipCode OR (ZipCode IS NULL AND @ZipCode IS NULL))
AND USStateCode = @USStateCode
AND AddressTypeCode = @AddressTypeCode ) –we are assuming validation is done elsewhere
IF @customerAddressId IS NULL
BEGIN
INSERT INTO [Customers].[CustomerAddress] ( [CustomerId], [Line1], [Line2], [City], [ZipCode], [USStateCode], [AddressTypeCode] )
VALUES ( @CustomerId, @Line1, @Line2, @City, @ZipCode, @USStateCode, @AddressTypeCode )
Time is the enemy. Time passes between the fetch of the CustomerAddressId and the insert. If another connection inserts the row in those moments between these statements, you could get a duplicate key error. Which I accepted as a possibility. If I finished this code for on-disk, I would handle that error by refetching the key. No problem.
When I changed my table to an in-mem table (which I will talk more about the actual structure in a blog or two), I immediately got duplicated data. Why? Three reasons.
1. The in-memory code ran 300% faster with little optimization.
2. No UNIQUE key constraint. I put a hash index on the customerId so the WHERE clause would use a simple seek operation, but with only a PRIMARY KEY constraint, there was nothing to stop the duplicate
3. The no locking optimistic concurrency control. (I won’t go into too much detail, but read this if you haven’t yet: http://sqlblog.com/blogs/kalen_delaney/archive/2013/10/20/sql-server-2014-in-memory-oltp-hekaton-whitepaper-for-ctp2.aspx)
Each of these factors figured into the issue. Since the code is faster, it is more likely that we will have collisions. This was certainly a factor, as in my first tests, I forgot to index the customerId, and the code ran twice as slow as the on-disk version, and there were no collisions.
Without a uniqueness constraint, the rows will get created with no issue, even if you accidentally get duplicates to create. The lack of constraints is one of my least favorite parts of the whole in-memory structures.
Using Multi-Valued Concurrency Control (MVCC), even while the row is being inserted (or really, as long as the transaction has not been committed), other connections can execute the search for the address and get back no results. In the default isolation level, access is in the basic snapshot isolation level. The second connection sees how the table is when they start the transaction. Even if I escalated to REPEATABLE READ or SERIALIZABLE, it wouldn’t eliminate duplication, since the new row’s uniqueness is defined as a surrogate key and there would technically be no collision. Is using a surrogate key the best solution? It isn’t feeling like it for these tables, but you know what, this is the norm for most people. I could easily just chuck the primary key on the natural key here, and let the identity be the possible (and far less likely) duplication point. But that seems like cheating, and what if a table has 2 natural key choices, or 3, or more. You would be stuck again with this issue (and it will not be easy to solve.)
Fortunately, this code here is going to fail IF multiple rows are committed by the time it executes:
SET @customerAddressId = (SELECT CustomerAddressId
FROM Customers.CustomerAddress
WHERE CustomerId = @CustomerId
AND (Line1 = @Line1 OR (Line1 IS NULL AND @Line1 IS NULL))
AND (Line2 = @Line2 OR (Line2 IS NULL AND @Line2 IS NULL))
AND (City = @City OR (City IS NULL AND @City IS NULL))
AND (ZipCode = @ZipCode OR (ZipCode IS NULL AND @ZipCode IS NULL))
AND USStateCode = @USStateCode
AND AddressTypeCode = @AddressTypeCode )
Since a subquery can only return 1 value and not cause an error. But with MVCC, this too could be an issue if we want to put it in a transaction and rollback on an error. Sadly however, it will not be so easy to fix because it would only fail if the transaction with the insert has been committed.
As I start to think about how to solve this problem, the solution is going to have to be very different than with disk based tables. There, if I threw a transaction around the search and the insert (and possible a few indexing hints to block multiple readers), I could stop the duplication easily. However, in all of the MVCC isolation levels, readers are never blocked. I could use an application lock to single thread the inserts, but that would definitely not be the most efficient operation, and it would not work at all in managed code.
In reality, there is another more common solution (as I expect a lot of people wouldn’t even have the alternate key on the address to start with). Just give in and accept the duplicates as part of high performance business. We can minimize the damage by changing the two subqueries to:
SET @customerAddressId = (SELECT MIN(CustomerAddressId)
So we always try to use the first one. With a low cardinality table like a customer’s address, you will only get back a few rows at most, so the aggregate isn’t going to be terribly costly. And you build cleanup routines to reallocate duplicated addresses. This is something that you will need to do with the data in any case, since it is altogether possible that the user types: ‘9 Gordon Hwy’ instead of ‘9 Gordon Highway’, and you get duplicates you need to clean up anyhow. My current times to enter the 40,000 rows across 4 connections using the command line SQLCMD interface (the files crashed SSMS!) is around 12 seconds.
When I get back to refining the code tomorrow, this is going to be my new tack for this (and perhaps the customer create) procedure. Minimize duplication, but go ahead and allow it. That problem is for the ETL Architect to deal with. (Now if much of my daily job wasn’t writing ETL, that would actually be comforting.)
Load comments