02 June 2017

23760 views

15 0

The Basics of Good T-SQL Coding Style – Part 2: Defining Database Objects

Technical debt is a real problem in database development, where corners have been cut in the rush to keep to dates. The result may work but the problems are in the details: such things as inconsistent naming of objects, or of defining columns; sloppy use of data types, archaic syntax or obsolete system functions. With databases, technical debt is even harder to pay back. Robert Sheldon explains how and why you can get it right first time instead.

The series so far:

When rushing to implement or update a database, developers might fudge on the T-SQL code for the sake of keeping the project moving forward, especially when managers and scrum overlords are calling for blood. That’s not to say the code will generate errors or take down backend systems, but it might fail to adhere to best practices or acceptable standards, resulting in code that can be difficult to review, troubleshoot, or refactor (and consequently wasting a lot more time than had originally been saved).

Having concise and solid coding standards in place can help mitigate some of these challenges because developers have a set of guidelines for naming objects, defining columns, using data types, implementing stored procedures, and taking a number of other steps. Adhering to standards also makes it easier for team members to collaborate on projects, without anyone having to slog through and repair inconsistent and ineffective code.

In this article, we cover many of the T-SQL coding considerations to take into account when defining or updating database objects. This article is the second in a series about T-SQL coding. In the first one, The Basics of Good T-SQL Coding Style, we covered general practices that can apply to any type of code. In subsequent articles, we’ll get into specifics about T-SQL queries and data manipulation, as well as performance-related issues.

Object identifiers

Agreeing on a common approach to naming database objects can be one of the most contentious issues you’ll run up against when standardizing you’re T-SQL code. Practices and opinions vary widely, and getting team consensus can represent a significant effort. Regardless, the underlying rule that should govern all naming practices is consistency. No matter what standards you decide to implement, everyone in your organization should be working toward this goal.

Unfortunately, standardizing how to approach identifiers is no small task, not only because of the wide range of opinions (even within a small team), but also because of the many factors that should be taken into account. To give you a sense of what you’ll be up against, let’s start with an example of T-SQL code that includes several database definition language (DDL) statements:

USE master;

CREATE DATABASE hr_db;

USE hr_db;

CREATE SCHEMA hr;

CREATE TABLE hr.tblPerson(

person_id INT IDENTITY PRIMARY KEY,

FirstName VARCHAR(50) NOT NULL,

lastName VARCHAR(50) NOT NULL);

CREATE TABLE hr.tbl_personnel(

id INT IDENTITY PRIMARY KEY,

first_name VARCHAR(50) NOT NULL,

last_name VARCHAR(50) NOT NULL,

liid VARCHAR(50) NOT NULL);

CREATE TABLE HR.HR(

[NUM] INT IDENTITY PRIMARY KEY,

[ID] INT NOT NULL REFERENCES hr.tblPerson(person_id),

[CASE] INT NOT NULL,

[CS CODE] CHAR(2) NOT NULL DEFAULT 'ZX',

[DESC] VARCHAR(250) NOT NULL,

[DATE] DATETIME2 NOT NULL);

The T-SQL code is fairly straightforward. We’re creating a database, adding a schema to the database, and adding three tables to the schema. Although the code should run fine in SQL Server, the identifier styles are all over the place. A single developer might not be so inconsistent (at least that’s the hope), but multiple developers often are, even when contributing to a relatively small database.

To be begin with, there’s the issue of the table names. The first one takes a singular approach (tblPerson), the second takes a plural approach (tbl_personnel), and the third name (HR) is extremely vague and uses the same name as the schema, which is also vague. The database itself is named hr_db, which appears specific to human resources, so the schema and table names offer little insight into the underlying data

In addition, the table names, along with other identifiers, use an inconsistent combination of uppercase, lowercase and underscores. Some of the column names, such as CASE, DESC, and DATE, use T-SQL reserved words, and one column (CS CODE) includes a space. As a result, we must enclose the identifiers in quotes or brackets when we define the columns and when we query them, as shown in the following example:

1 2	SELECT [CASE], [ID], [CS CODE], [DESC], [DATE] FROM HR.HR;

As with the hr schema and HR table, many of the identifiers are simply too vague to provide us with insight into the underlying data. What data does the liid column contain? How does the NUM column differ from the ID column in the HR table? Is the CASE column an identifier of some type, given that it is defined with the INT data type? And what is the CS CODE column for?

Database identifiers should be clear, concise, and meaningful, with abbreviations used only when there is no doubt of their meaning (such as using ID in place of identification). Object names should not include reserved words or special characters (e.g., tab, newline, quotation mark, bracket), unless they cannot be avoided, as in the case of legacy systems or object names generated by an application). Some developers include spaces in the list of special characters to be avoided because they too must be delimited. As for underscores, some developers like them, others do not. With hyphens, you can string together multiple terms without having to delimited them.

Again, your team should aim to be as consistent as possible when it comes to identifiers. You might decide to use pascal case (PascalCase) or camel case (camelCase) for all identifiers, or you might use pascal case for objects such as tables, but use lowercase and underscores for objects such as columns. The goal is to make the code as readable as possible so that it is easy to review, troubleshoot, and update.

Another issue that the above code raises is whether to prefix object names with an abbreviation that indicates the object type, as in tblPerson and tbl_personnel. Many would argue against this convention, although the use of prefixes has its supporters. Again, the key is to be consistent.

Be aware, however, that SQL Server adds the sp_ prefix to its system stored procedure names. As a result, when the database engine encounters any procedure using this prefix, it first tries to locate the it in the master database. If the procedure is user-defined, the database engine must waste processing cycles to locate it. In addition, the database ignores any user-defined procedures that share the same name as a system stored procedure. Plus, the sp_ prefix can result in confusion when reviewing database objects.

The same is true for system functions. SQL Server uses the fn_ prefix for their names, so using that prefix for user-defined functions can result in unnecessary confusion. Also resist the temptation to include the data type in the column name. Data types change, but the uncertainty remains.

In addition, it’s a good idea to manually provide names for objects such as constraints or indexes, rather than going with the system-generated names, which are generally fairly ugly. For example, when I ran the above code on my system, the database engine generated the following names for the HR table:

Primary key: PK__HR__C7D1B76B3F4C721D
Foreign key: FK__HR__ID__276EDEB3
Default constraint: DF__HR__TYPE__276EDEB3

When developing naming standards, you should also decide whether to take different approaches to labelling based on object type. For example, you might want to use nouns for table and view names (such as Personnel), but use a mix of verbs and nouns for functions and stored procedure names (such as GetSalesTotals). Also avoid using the same name for different types of objects, as in the hr schema and the HR table.

When it comes to standardizing database identifiers, you must take into account all object types, including tables, views, stored procedures, triggers, indexes, constraints, columns, functions, schemas, and the database itself. Anyone who reviews the T-SQL or examines the database should be able to get a good sense of each object’s purpose, without having to turn to additional documentation or other resources.

Table design

When implementing a data-driven solution with SQL Server in the backend, one of the first steps most teams will likely undertake is to create a detailed database design to support the front-end application. As part of this process, a data architect will define the various tables and relationships between them, often using a modeling tool that generates the scripts necessary to create the database objects. The T-SQL code might need to be refined, but most of the work is done.

Whether or not a data modeling tool is used, a database team must still ensure that its DDL scripts adhere to a set of agreed-upon standards to ensure code consistency within and across projects. Nowhere is this more important within the table definitions themselves. Whether developing a database from scratch or updating an existing database, developers need to make certain they’re implementing tables that follow best practices and conform to team standards.

Although a full discussion of database and table design is beyond the scope of this article, there are still a number of issues worth looking at in preparation for developing your coding standards. Let’s start with the following CREATE TABLE statements:

CREATE TABLE hr.Personnel(

PersonID INT IDENTITY,

FirstName VARCHAR(50) NOT NULL,

LastName VARCHAR(50) NOT NULL,

LoginID VARCHAR(50) NOT NULL,

MgrFirstLastName VARCHAR(50) NOT NULL);

CREATE TABLE hr.Cases(

ReportID INT IDENTITY,

PersonID INT NOT NULL PRIMARY KEY,

FirstName VARCHAR(50) NOT NULL,

LastName VARCHAR(50) NOT NULL,

CaseID INT NOT NULL,

CaseType VARCHAR(50) NULL,

CaseDescription VARCHAR(250) NOT NULL,

ReportDate DATETIME2 NOT NULL);

First, there is the issue of primary keys. In this case, the Personnel table includes no primary key, and the Cases table defines the primary key on the PersonID column, when in fact the ReportID column is more likely the better candidate. Except in rare cases, you should define a primary key on each table, and that primary key should be made up of the column or columns that uniquely identify each row in the table.

You must also keep in mind how the database engine creates a table’s clustered index. When you define a primary key on one or more columns, the database engine automatically uses those columns for the clustered index, unless a clustered index already exists or you specify otherwise. You should choose the clustered column (or columns) based on your queries and performance considerations. Although the primary key is often the best candidate for the clustered index, which is why this is the default behavior, this is not a foregone conclusion. It is not at all unusual to define the clustered index on a column other than the primary key.

When defining a primary key, you should specifically include the index type (clustered or nonclustered). If the plan to use a different column for your clustered index, you should create a unique index on that column and specify that it is the clustered index. Every table should have a clustered index so the data is not stored as a simple heap.

Also be certain that you define the necessary foreign key constraints on your tables. For example, the PersonID column in the Cases table above should actually be a foreign key that references the PersonID column in the Personnel table. In many cases, you will also want to define indexes on your foreign key columns, depending on how you query the data.

That said, you’ll likely put off many decisions about indexing until you are performance-tuning your database, although you should at least be considering your clustered indexes when defining your tables. A full discussion of indexing is beyond the scope of this article, but you’ll want to be sure you standardize how to create and name those indexes.

In general, your tables should follow good relational design practices. Many database gurus recommend that you normalize to at least the third normal form, with some flexibility in there to address performance issues. At the very least, aim for atomic, non-repetitive data, while keeping NULL values to a minimum.

Returning to the example above, notice the MgrFirstLastName column in the Personnel table stores the name of the employee’s manager. If we define the table in this way, we will end up with repetitive, non-atomic data, which makes it far more difficult to ensure the integrity of the data. A better option would be to simply list the PersonID value for that manager, perhaps using the hierarchical data types to better track who reports to whom.

Now let’s return to the Cases table, which has its own share of problems. To begin with, the FirstName and LastName columns include the same information as the Personnel table, so these columns can be eliminated.

Also notice that the CaseType column is defined with a VARCHAR(50) data type. If there were only a relatively few different case types, we would end up with a great deal of redundant data. In addition, because the data type specifies a length of 50, we are either sizing the type too large or we’re expecting long descriptive names. If the latter, we should consider short abbreviations rather than full descriptions, perhaps putting the descriptions in a separate table or using extended properties.

In addition, the CaseType column permits NULL values, which can conceivable add up, depending on the nature of the data. A situation such as this might warrant the use of a sparse column or the creation of a separate table that maps case types to report IDs, which can help eliminate the NULL values altogether.

When you design your tables, you must assess how each column is defined to ensure that you’re storing the data the best way possible. Also be sure that you’re not creating tables that are too wide or that contain unrelated data. An effective table contains a logical set of data that interfaces with other sets of data in a meaningful way, while ensuring the integrity of that data.

Column definitions

Now let’s move on to the column definitions themselves. We’ll start with the following CREATE TABLE statement:

CREATE TABLE hr.Vendors(

VendorID UNIQUEIDENTIFIER DEFAULT NEWID() PRIMARY KEY,

VendorName NVARCHAR(200) COLLATE Latin1_General_100_CI_AS,

VendorRating DECIMAL(3,1) NOT NULL,

PurchasesLastYear MONEY NULL,

PurchasesYTD MONEY NULL,

PurchasesDiff AS PurchasesYTD - PurchasesLastYear);

As with previous examples, this statement is full of questionable code. For example, the first column, VendorID, is defined as the primary key and uses the UNIQUEIDENTIFIER data type, which means the column will store only GUID values. Because no clustered index is defined elsewhere on the table, the database engine will create a clustered index based on this column, which can result in significant page and index fragmentation, especially when used in conjunction with the NEWID function.

GUIDs also require more memory and storage when compared to integers, which can be a factor with large data sets. However, if you must use GUIDs for your clustered index, at least consider using the NEWSEQUENTIALID function, rather than NEWID to help minimize some of the fragmentation. However, use NEWSEQUENTIALID only if you’re not concerned about privacy because the function makes it possible to guess the value of the next generated GUID.

Now let’s look at the second column definition, VendorName, which includes the COLLATE clause. Although this in itself is not a problem, you want to be careful not to introduce collation mismatches in your query. In fact, there are a number of issues to take into account when introducing multiple collations in your database. If you want to learn more about working with collations, check out my article Questions About SQL Server Collations You Were Too Shy to Ask.

The VendorName column also fails to specify the nullability. As a result, SQL Server will assume that NULL values are permitted. Chances are good this was not the intent, but even if it were, you should specify the nullability, just so there is no question. Also, when working with character columns, you might consider creating a default constraint that specifies an empty string, rather than permitting NULL values. But avoid defining default constraints on columns that permit NULL values.

While we’re on the topic of constraints, don’t be afraid to use them, especially check constraints. Sometimes a few well-placed constraints can prove invaluable in helping to ensure data integrity. You can always use the NOCHECK option to disable them for large load operations. Just be sure you re-enable them when you’re done.

The last column in the Vendors table is PurchasesDiff, a computed column that calculates the difference between the PurchasesLastYear and PurchasesYTD columns. Although there is nothing wrong with including computed columns, you should be careful about their use. Be sure any query performance gains you might see are worth the additional overhead.

Data types

We’ll finish up our discussion of column definitions by covering data types. You must specify a data type for each column. The data type limits the kind of data that can be stored within that column, helping to enforce data integrity. You should select the data type that is most appropriate for the data that will be stored in the column.

The following CREATE TABLE statement helps to illustrate many of the considerations to take into account when defining your columns:

CREATE TABLE hr.Candidates(

CandidateID INT IDENTITY PRIMARY KEY,

FirstName VARCHAR NOT NULL,

LastName NVARCHAR(50) NOT NULL,

InterviewRating DECIMAL NOT NULL,

CorrelationScores FLOAT(25) NOT NULL,

CurrentResume VARBINARY(MAX) NULL,

CandidatePic IMAGE NULL);

Let’s start with the FirstName column, which is defined with the VARCHAR data type. Notice that no length is provided, which means the database engine will assign a length of 1. As a result, you’ll receive an error if you try to insert a larger value into the column. You should always specify the length of a character column, even if you want it to be 1.

Also notice that the FirstName column is defined with a non-Unicode type, whereas the LastName column is defined with a Unicode type (NVARCHAR). Make sure you’re using Unicode and non-Unicode types appropriately for the data you’re working with, keeping Unicode storage requirements in mind.

Also note that some database developers warn against using the CHAR or NCHAR data type for columns that permit NULL values because of the potential for wasting storage space. Be sure your standards cover these types of issues.

Now take a look at the InterviewRating column, which is configured with the DECIMAL data type. Whenever you specify numeric types, you should include the precision and scale when they’re supported, otherwise you will again have the defaults forced on you.

Also be careful using the REAL or FLOAT data types, as in the CorrelationScores column. These are approximate-number types, which means not all values in the permitted range can be represented exactly. For more information about some of the issues with types, see my article Questions about SQL Server Data Types You were Too Shy to Ask.

Now onto the last two columns. The CurrentResume column takes the VARBINARY data type, which stores variable-length binary data (in this case, resume documents), and the CandidatePic column takes the IMAGE data type for storing the candidate’s picture. It is generally considered a best practice not to store binary or image data within the database. Instead, you should store only pointers to the files and keep the files themselves separate from the database.

The CandidatePic column also has another problem. It uses a deprecated data type, something you should avoid whenever possible.

Stored procedures

Now we get to the topic of stored procedures, which come with a wide assortment of considerations when trying to shore up T-SQL coding standards. In this article, we’ll focus on the procedure’s structure as a whole, and leave the data manipulation language (DML) for later in this series.

We’ll start with an example stored procedure, but before we create it, let’s update the Candidates table from an earlier example:

ALTER TABLE hr.Candidates

ALTER COLUMN FirstName VARCHAR(50) NOT NULL;

ALTER TABLE hr.Candidates

ALTER COLUMN InterviewRating DECIMAL(3,1) NOT NULL;

All we’re doing here is fixing the data types in the two columns so the VARCHAR data type includes the length and the DECIMAL data type includes the precision and scale.

Next, we create the AddCandidate stored procedure, which includes four input parameters for passing in the data we need to add a new candidate:

CREATE PROCEDURE hr.AddCandidate

@FirstName NVARCHAR(50) = '',

@LastName NVARCHAR(50) = '',

@InterviewRating DECIMAL(3,1) = 0,

@CorrelationScores FLOAT(25) = 0

SET QUOTED_IDENTIFIER OFF;

SET ANSI_NULLS OFF;

SET ARITHABORT OFF

INSERT INTO hr.Candidates

(FirstName, LastName, InterviewRating, CorrelationScores)

VALUES(@FirstName, @LastName, @InterviewRating, @CorrelationScores)

DECLARE @CandidateID INT = @@IDENTITY

SELECT @CandidateID AS CandidateID;

Although this is a very basic procedure, it helps demonstrate some important issues. For example, the procedure does not include the SET NOCOUNT ON statement before the DML statements. Be default, the option is set as OFF, which means the database engine will send DONE_IN_PROC messages to the client for each executed statement in the procedure. For procedures that include numerous statements, this can represent a significant increase in network traffic, especially if the statements return little to no data. In addition, the database engine has to count the number of affected rows. In most cases, you should set the NOCOUNT option to ON. (Note that this issue also applies to triggers.)

Although the procedure doesn’t include the SET NOCOUNT ON statement, it does include several other SET statements. The first two statements set the QUOTED_IDENTIFIER and ANSI_NULL options to OFF. Unfortunately, these statements have no impact. The database engine ignores these SET statements when they’re included within the body of a stored procedure, trigger, or function. The engine instead uses the settings as they’re defined at the time the object is created or modified.

Also note that some SET statements, such as SET ARITHABORT OFF, cause the stored procedure to be recompiled each time it runs. For any SET statement you include, be sure you understand its impact on performance and whether it will force a recompilation. Related to this, also be careful when specifying the RECOMPILE option within your procedures.

Another consideration when standardizing procedure code is whether to enclose everything after the AS keyword in a BEGIN…END block, which is an optional syntax element. Some developers recommend that BEGIN…END always be used. Others suggest using them only when you have multiple statements. And there are those who believe they’re completely unnecessary. Whatever your team decides, just be consistent.

One other note. You might have noticed that the procedure includes no error handling. Chances are, you’ll want to include the necessary TRY…CATCH blocks in your procedures to gracefully handle any issues that might arise.

Also be sure that you’re using the RETURN statement correctly, whether inside or outside of the TRY…CATCH structure. The statement’s purpose is to return execution status only, not query results. If you want to return actual data, you can use SELECT statements or output parameters. For scalar values, it’s usually recommended to use output parameters because they’re better utilized by ADO. For example, in the procedure above, we’re using a SELECT statement to return the CandidateID value, but we can instead use an output parameter.

Also notice that, to get the CandidateID value, we use the @@IDENTITY system variable, but we might want to instead use the SCOPE_IDENTITY() function, depending on how we want to control the session scope.

Another important issue to take into account is dynamic SQL. You should be very careful about including it in your procedures. Not only can it represent a security risk (SQL injection), but it also has performance implications because the query engine generates the execution plan at runtime. However, if you can’t come up with a workaround to dynamic SQL, consider using sp_executesql when running the dynamic SQL and including an EXECUTE AS clause if necessary to control the security context in which the procedure runs.

Not surprisingly, there are numerous other issues to take into account when standardizing your stored procedure definitions, such as when to use global temporary tables, whether cursors should be permitted, whether parameters should be defined with default values, and how to handle functions when they’re called repeatedly within a procedure (i.e., call the function once and put the results in a variable).

Other considerations

There are, of course, plenty of other considerations to take into account when standardizing your DDL code. For example, you should avoid creating triggers that return results or creating views that contain an ORDER BY clause. In addition, your standard should cover how to use views to abstract the underlying tables and persist complex logic. You should also consider using constraints instead of triggers to enforce data integrity, when it is practical to do so.

Other considerations include the enforcement of object-level security and writing code for both SQL Server databases and Azure SQL Database (taking into account the differences between the two). And, of course, you should always keep performance at the forefront of your thinking. For example, try to keep your stored procedures and user-defined functions in the same database where most or all of the data is located.

Clearly, when it comes to standardizing your T-SQL code, there are many factors you must be aware of. And we haven’t even gotten to the DML stuff. The point is, standardizing your team’s code is no small matter, and it will take a concerted effort on everyone’s part to come up with a strategy that the team can work with. Only then can you hope to turn out readable and high-performing code.

The Basics of Good T-SQL Coding Style – Part 2: Defining Database Objects

The series so far:

Object identifiers

Table design

Column definitions

Data types

Stored procedures

Other considerations

Subscribe for more articles

Rate this article

Robert Sheldon

Related articles

The Basics of Good T-SQL Coding Style – Part 4: Performance

Snake draft sorting in SQL Server, part 2

Tags