Menu Bar

Saturday, September 10, 2011

Clustered Indexes and Primary Keys and Unique Constraints, Oh My!


"Captain, can you stop her from being cheerful please." - Jayne Cobb, Firefly Pilot

At a presentation at SQL Saturday #94 today, the presenter brought up a point I hadn't thought about.  The differences between a primary key, a unique index and a clustered index.

I admit, I wasn't up to snuff on the intricate details between the three so I did a little research into the matter to clarify for myself.

Primary Key

A Primary Key enforces entity integrity on a table, that is, uniqueness.  A Primary Key is one or more columns that uniquely identify a row in a table.  Each table may have one and only one Primary Key.  Any column declared in a Primary Key cannot contain NULL values.  Obviously, because a primary key enforces uniqueness, there can be no duplicate values in the index.

When a Primary Key constraint is created for a table, the database engine creates a unique index for the columns defined in the Primary Key (Clustered, by default, if the table doesn't already have one).  Columns of PRIMARY KEY constraints that are specified when a table is created are implicitly converted to NOT NULL.

If a clustered index does not already exist on the table or a nonclustered index is not explicitly specified, a unique, clustered index is created to enforce the PRIMARY KEY constraint.  We'll get to clustered indexes in a moment.

A Primary Key can also be used in conjunction with with a Foreign Key of another table to enforce referential integrity.

A Primary Key constraint cannot be dropped if it is referenced by a Foreign Key constraint in another table.  The Foreign Key constraint must be deleted first.

Unique

Unique can apply in two ways - Unique Constraint and Unique Index.

A Unique Index will enforce uniqueness on one or more columns in a database.  There may be more than one Unique Index on a database.  Unique Indexes do not have to be the Primary Key of the table. 

A Unique Constraint can be applied to a column in a table.  When a column is defined with a Unique constraint, the database engine automatically creates a unique nonclustered index covering those columns.

Unique keys and indexes may also be used as the reference point in a Foreign Key Constraint.

Unique Indexes / Constraints allow nulls (where Primary Keys do not).  But, like any value appearing in a unique column, only one null is allowed - uniqueness must be maintained. There could be more than one null in a unique index if the rest of the columns in the index are still unique even with the null.

Example:

create table tab1 (col1 char(2), col2 char(2));
insert into tab1 values ('1', null);
insert into tab1 values ('2', null);

Executing the following fails (duplicate nulls):

create unique nonclustered index idx1 on tab1 (col2);

Executing the following succeeds (Even though there are duplicate nulls, the indexes columns still remain unique):

create unique nonclustered index idx1 on tab1 (col1, col2);

Clustered Indexes

A Clustered Index is created to physically order the data on disk.  The leaf nodes of the index actually contain the data rows of the table.  A table may have only one Clustered Index (since the Clustered Index represents the physical ordering of the data, having more than one would be impossible).

A Clustered Index is not required to be Unique.

The Clustered Index should be built before the Nonclustered Indexes because creating a Clustered Index causes all existing Nonclustered Indexes to be rebuilt.

If not specifically specified to be Clustered, the default setting for creating an Index is Nonclustered.

It is important to note that if "ON FILEGROUP" is specified in creating a Clustered Index, since the table's data pages are stored in the index leaf pages, if a different filegroup is specified on the index that was used in the CREATE TABLE statement, then the table is moved to the filegroup specified in the CREATE INDEX statement.

The Unique Clustered Index on a table does not necessarily have to be the Primary Key of the table.

Non Clustered Index

The leaf layer of a Nonclustered Index contains index pages instead of data pages.

(OK stealing the following couple of paragraphs from the Nonclustered Index Structures page from Books Online since they cover it so very well):

[Begin Steal]

The row locators in nonclustered index rows are either a pointer to a row or are a clustered index key for a row, as described in the following:

For a table with no Clustered Index (called a Heap), the index page contains a row locator (pointer) to the row.  This pointer is the file identifier (ID), page number, and number of the row on the page. The whole pointer is known as a Row ID (RID).

For a table that uses a Clustered Index, the row locator is the clustered index key for the row. If the clustered index is not a unique index, SQL Server makes any duplicate keys unique by adding an internally generated value called a uniqueifier. This four-byte value is not visible to users. It is only added when required to make the clustered key unique for use in nonclustered indexes. SQL Server retrieves the data row by searching the clustered index using the clustered index key stored in the leaf row of the nonclustered index.

[End Steal]

So, a Primary Key without a Clustered Index on the table would use a RID on a related Nonclustered Index to find a row in the table where as a Clustered Index on a table uses the Clustered Index Key on a related Nonclustered Index lookup.

The end result of the fetch is the same, but the how it does it is different.

Now, after reading all this, we get to the difference between a Primary Key and a Clustered Index.  Both can be used to specify uniqueness on a table (if the Clustered Index is defined as Unique of course).  The Primary Key just specifies Uniqueness.  The Clustered Index can define how the Nonclustered Indexes behave.

In order to prevent page splits, it is wise to have the primary index on a table be an ever increasing value.  If rows are inserted into the middle of the existing primary index values, inevitably page splits will occur that fragment the index.

As Kimberly Tripp notes on her blog (see below for the complete article):

In summary, the clustering key really has all of these purposes:

1) It defines the lookup value used by the nonclustered indexes (should be unique, narrow and static)

2) It defines the table's order (physically at creation and logically maintained through a linked list after that) - so we need to be careful of fragmentation

3) It can be used to answer a query (either as a table scan - or, if the query wants a subset of data (a range query) and the clustering key supports that range, then yes, the clustering key can be used to reduce the cost of the scan (it can seek with a partial scan)

This certainly become longer than I had intended, but I believe this covers the major aspects of Clustered and Nonclustered Indexing (with and without Uniqueness) in SQL Server.

References:
http://msdn.microsoft.com/en-us/library/ms191236.aspx
http://msdn.microsoft.com/en-us/library/ms181043.aspx
http://msdn.microsoft.com/en-us/library/ms191166.aspx
http://msdn.microsoft.com/en-us/library/ms177420.aspx
http://msdn.microsoft.com/en-us/library/ms175132.aspx
http://msdn.microsoft.com/en-us/library/ms177443.aspx
http://msdn.microsoft.com/en-us/library/ms177484.aspx
http://sqlskills.com/BLOGS/KIMBERLY/post/GUIDs-as-PRIMARY-KEYs-andor-the-clustering-key.aspx