This is an addition to my book, on page 255, to the end of the section on collation (before the statement that I won’t delve any deeper into collations, of course :)), I want to add the following bit of information about collations and how they affect sorting and searching.
You can download the entire addendum with source code from the:
Pro SQL Server 2005 Database Design and Optimization Homepage
Sort and Searching
One of the more confusing parts of collation is how they affect the ordering of data when sorted and how that can affect other operations like the LIKE operator. When it comes to case insensitive collations, it is pretty clear, data is sorted as if ‘A’ and ‘a’, are the same character. The other common situation is accent insensitivity where ‘a’ and ‘ấ’ appear as the same character. When you are using a case and accent insensitive collation there will be no guarantee when sorting data that either of these characters would come first in the list. The great part about this is that when you search for where like ‘%A%’, you know that you will get back: aardvark, Abel, Penấ with no worries.
In some situations, this is not desirable to have this be the case, and you set up a column, table, or database to be case sensitive and accent sensitive. This is where sometimes you can get confused when using between, like, or other range type queries on characters.
As an example, let me create the following table and seed it with some characters:
create table dbo.TestSorting
(
value nvarchar(1) collate Latin1_General_CI_AI
)
insert into dbo.TestSorting
select ‘A’
union all
select ‘a’
union all
select nchar(256) –Ā
union all
select ‘b’
union all
select ‘B’
Doing a normal case insensitive search using the base collation:
select value
from dbo.TestSorting
where value like ‘[A-Z]%’
All rows are returned, as expected:
value
—–
A
a
Ā
b
B
The problem comes in when you want to do a case sensitive search. You choose a case sensitive collation, and either use it in the WHERE clause or in your table declaration, and:
select value
from dbo.TestSorting
where value like ‘[A-Z]%’ collate Latin1_General_CS_AI
This returns what seems a confusing set:
value
—–
A
Ā
b
B
Everyone who does this the first time thinks “Why?” And I did too (in fact it was why I wrote this section of the book!) The key is to kook at how the values sort in this collation:
select value
from dbo.TestSorting
order by value collate Latin1_General_CS_AI
value
—–
a
Ā
A
b
B
Case sensitive sorts things aAbBcC, not, as most of us programmers expect intuitively: ABCabc. That is actually how a binary collation works, since it sorts based on the ASCII or Unicode value of the character. Using a binary collation would work, but then you lose the ability to include the special characters in your search which is given to you by using accent insensitivity.
Instead, you can simply use a LIKE such as this:
select value
from dbo.TestSorting
where value like ‘[ABCDEFGHIJKLMNOPQRSTUVWXYZ]%’ collate Latin1_General_CS_AI
This returns only the capital letters including the accented capital letter I included.
value
—–
A
Ā
B
It is a good idea to take a look at the sorting of data when you choose a collation to make sure it is clear to you how data will be sorted.
First in the data you have in your table, like this, if you want to see how it will sort in a binary collation:
select value
from dbo.TestSorting
order by value collate Latin1_General_bin
value
—–
A
B
a
b
Ā
And then I have a query I use for this purpose that I can use to generate the characters I am interested and see how they sort
;with digits (i) as( select 1 as i union all select 2 as i union all select 3 union all
select 4 union all select 5 union all select 6 union all select 7 union all
select 8 union all select 9 union all select 0)
,sequence (i) as (
SELECT D1.i + (10*D2.i) + (100*D3.i) + (1000*D4.i) + (10000*D5.i)
FROM digits AS D1 CROSS JOIN digits AS D2 CROSS JOIN digits AS D3
CROSS JOIN digits AS D4 CROSS JOIN digits as D5
)
select i, nchar(i) as character
from sequence
where i between 48 and 122 –vary to include any characters in the character set of choice
order by nchar(i) collate Latin1_General_bin
Crossposted to http://drsql.spaces.live.com
Load comments