How to ensure data quality


The most fundamental thing you can do to ensure the quality of your data is to choose a reasonable container for the value being stored. Want to store an integer value? Use an integer type. Want to store an image? Use a varbinary(max). Could you use a varchar( 50) or varchar(max) to store every value? In almost every situation, you sure could. You could also stick your tongue in a lamp socket, but I guarantee you it's a bad idea unless you want to say hi to your cat that your mother told you was "sent off to live on the farm." Way too often, I see designs where every column is nvarchar(50), unless they want to store something larger; then it's an nvarchar(max). Storagewise, unless you expect that some values will have special characters, using a Unicode-enabled type just costs you space, and even worse is absolutely horrible for data integrity. It's perfectly possible to do this and get by, as all values can be represented as a textual value in a T-SQL DML statement. So why bother? First, consider performance. As an example, consider storing a GUID value in a varchar column. Using the uniqueidentifier data type, you can store the 36-character value in 16 bytes. Put it in a varchar column and you'll need 36 bytes, or 72 if the column is nvarchar. The second, more important problem is data integrity. A uniqueidentifier column requires a properly formatted GUID to be stored, whereas a varchar column is happy to take Your data integrity blows as a value. Setting the NULL specification of a column to allow or disallow NULLs is an important step in getting correct data. It's quite common to see that someone has created a perfectly acceptable design, but left each column nullable. Part of this is probably laziness, as not specifying a column's NULL specification will by default allow the column to allow NULL values. NULLs greatly increase the complexity of using a column, because NULL isn't equal to anything, including itself. But NULL = NULL isn't false; rather it evaluates to NULL. Then consider the expression NOT(NULL) which also evaluates to NULL, but really looks like it should be true, right? You might be thinking that it's silly, that you'd never need to care about this type of thing, right? Well, consider the following:

SELECT *
FROM table
WHERE NOT(nullableColumn = @value)

If all values for nullableColumn were NULL, then that statement would return the same rows as the following:

SELECT *
FROM table
WHERE (nullableColumn = @value)

Because the expression nullableColumn = @value evaluates to a Boolean value, in the case where nullableColumn is NULL, nothing is returned. Only rows where the expression evaluates to TRUE are returned. If you need to allow a value to have no known value, then having nullable columns is acceptable. But, in many cases, columns allow NULLs because the implementer didn't think or care if it mattered and let some other code manage whether data is optional. The problem here is twofold:

The SQL Server engine uses all information it has to make your queries faster. The fact that all values must be filled in helps the optimizer to not worry about the case where a value is NULL.

You have to write code that validates this over and over. Placing it in the definition of the data makes sure that it's always adhered to, regardless of whether the UI has checked to make sure that the user entered a value.

So the fact is, there's no reason not to disallow nulls where they should never occur; it can only help your system, and never hurt it. I'm not going to get any deeper into the NULL discussion than this, and I leave it to you to find all of the ways that NULL makes comparisons more interesting. The point of this is to show that although NULL can be a useful tool, it behooves you to protect against unnecessary NULL values, as unhandled NULLs in comparisons will mess up your query results in ways that won't be obvious to you without some thought.

One of the most common newbie questions is how to remove duplicate data from a table. After demonstrating the requisite query to remove the duplicated data (removing what might actually be a row of important data that the user updated instead of the kept copy), a nice but stern rant follows on why you must protect your data against duplicated data. In SQL Server, uniqueness is enforced using either a primary key or a unique constraint. The primary key is used for the main value that each row will be referenced as. Unique constraints are then applied to all other candidate keys, which are frequently other sets of columns that a user will want to reference a row as. Every table should have a primary key to make sure that you have some way to distinguish one row from other rows. In addition, it's important to make sure that you have either the primary key or a unique key on a value in the table that has some meaning. An identity value in the table doesn't have any actual meaning, as you could have two rows with exactly the same value other than the identity key, and the two rows would be technically indistinguishable from one another. This isn't always completely possible, as there are some tables where a key that has full meaning can't be created, such as a table that logs events that have occurred, where multiple events of the same type could occur at the same instance. For logging situations, adding a sequence number is an acceptable trade-off, though that shouldn't be the only value in the key. As an example, if you had a customer table with an identity value customerId for a surrogate key (the name given a key that's generated without regard to any meaningful data to stand in for a more complex key value) and a customerNumber. You can also use ALTER TABLE syntax to add a PRIMARY KEY or UNIQUE constraint once the table has been created. Internally, PRIMARY KEYS and UNIQUE constraints are actually hybrid objects-they're constraints, but physically they're implemented internally by creating a unique index. The syntax for creating a PRIMARY KEY or UNIQUE constraint includes several of the different settings for an index, such as whether the index to create is clustered, what partition to place it on, and so on. Semantically, the distinction between a unique index and a constraint should be considered as part of your design. A constraint declares that some factor must be true, and for key constraints, this is declaring that the values in the selected columns must be unique. An index is a physical construct used specifically to speed up some operation; in this case, SQL Server is applying a unique index to speed up checking for duplicate values. I generally think of an index as something that can be created or dropped by the production staff without helping or harming anything other than performance. As mentioned before, indexes are, in general, only to be used to improve performance, not for data protection. For SQL Server 2008, Microsoft has given us a WHERE clause on the CREATE INDEX syntax that also has data protection capabilities not found in UNIQUE constraints. If you want to protect for unique values except for when the value is some given value (most commonly, the use is unique values when not NULL), you can use a unique index with a WHERE clause. For example, if you have a table that holds persons, and you want to associate a unique customer number with the user if she becomes a customer, but not until, you could define an index such as the following:

CREATE INDEX AKSperson_customerNumber on person(customerNumber)
WHERE customerNumber is not null

If you're using a previous version of SQL Server, there are other methods to use, such as an indexed view where you define the view to exclude data like you did with the WHERE clause on the CREATE INDEX statement. Then you add a clustered unique index on the view. Either method is acceptable, and implements the same sort of constructs with the same overhead.

Legal Disclaimer

Our website is not responsible for the information contained by this article. Articleinput.com is a free articles resource thus practically any visitor can submit an article. However if you notice any copyrighted material, please contact us and we will remove the article(s) in discussion right away.

Note: This article was sent to us by: Steve Miller at 01072010

Related Articles

1. Lambda expressions and generator functions in Python
Short functions like those you just saw can also be defined using lambda expressions of the form. Lambda expressions are anonymous little funct...

2. How to create a basic program in Python
Up until now, you've been using the Python interpreter mainly in interactive mode. For production use, you'll want to create Python programs or...

3. How to make Python script execution in UNIX and Mac OS X and in Windows
If you're on UNIX, you can easily make a script directly executable. Note that if Python 3.x isn't your default version of Python, you may need...

4. The difference between scripts on Windows scripts on UNIX
The way you call scripts on Windows differs from the way scripts are called on Linux/ UNIX, and that difference can affect what kind of scripts...

5. Python applications are distributed as source files
You can distribute your scripts as source files (as .py files). You can also ship them as byte code (as .pyc or .pyo files). A byte code file w...

6. How to use the Python file system
Working with files involves one of two things: basic I/O and working with the filesystem (for example, naming, creating, moving, or referring t...

7. Client complaints about inconsistent query results
You've probably seen it before: a client has called the help desk and reported that a couple of queries have seemingly inconsistent results. Yo...

8. How to use a foreign key constraint
A foreign key constraint is used to make sure that columns involved in a relationship contain only correct data. I start out the article with a...

9. Triggers in SQL Server programming
Triggers are stored batches of T-SQL, much like stored procedures. But instead of being called directly, they execute indirectly when you execu...