It can be argued that database development, as an engineering discipline, was born along with the relational model in 1970. It has been almost 40 years (as I write these words), yet the field continues to grow and evolve-seemingly at a faster rate every year. This tremendous growth can easily be seen in the many facets of the Microsoft database platform. SQL Server is no longer just a simple SQL database system; it has become an application platform, a vehicle for the creation of complex and multifaceted data solutions. Today's database developer is expected to understand not only the Transact- SQL dialect spoken by SQL Server, but also the intricacies of the many components that must be controlled in order to make the database system do their bidding. This variety can be seen in the many topics discussed in the pages ahead: indexing, full-text search, SQL CLR integration, XML, external interfaces such as ADO.NET, and even mobile device development are all subjects within the realm of database development. The sheer volume of knowledge both required and available for consumption can seem daunting, and giving up is not an option.
The most important thing we can do is understand that while no one can know everything, we can strive to continually learn and enhance our skill sets, and that is where this article comes in. When reading SQL Server newsgroups or blogs, you could easily get the impression that there are two ways to manipulate data: declarative (set-based) or iterative (cursor-based). And that iterative code is always bad and should be avoided like the plague. Those impressions are both wrong. Iterative code isn't always bad (though, in all honesty, it usually is). And there's more to SQL Server than declarative or iterative-there are ways to combine them, adding their strengths and avoiding their weaknesses. This article is about one such method: set-based iteration. The technique of set-based iteration can lead to efficient solutions for problems that don't lend themselves to declarative solutions, because those would result in an amount of work that grows exponentially with the amount of data. In those cases, the trick is to find a declarative query that solves a part of the problem (as much as feasible), and that doesn't have the exponential performance problem- then repeat that query until all work has been done. So instead of attempting a single set-based leap, or taking millions of single-row-sized miniature steps in a cursor, set-based iteration arrives at the destination by taking a few seven-mile leaps.
In this article, I'll first explain the need for an extra alternative by discussing the weaknesses and limitations of purely iterative and purely declarative coding. I'll then explain the technique of set-based iteration by presenting two examples: first a fairly simple one, and then a more advanced case. Developing SQL Server code can be challenging. You have so many ways to achieve the same result that the challenge isn't coming up with working code, but picking the "best" working code from a bunch of alternatives. So what's the point of adding yet another technique, other than making an already tough choice even harder? The answer is that there are cases (admittedly, not many) where none of the existing options yield acceptable performance, and set-based iteration does. Declarative coding is, without any doubt, the most-used way to manipulate data in SQL Server. And for good reason, because in most cases it's the fastest possible code. The basic principle of declarative code is that you don't tell the computer how to process the data in order to create the required results, but instead declare the results you want and leave it to the DBMS to figure out how to get those results. Declarative code is also called set-based code because the declared required results aren't based on individual rows of data, but on the entire set of data. For example, if you need to find out which employees earn more than their manager, the declarative answer would involve one single query, specifying all the tables that hold the source data in its FROM clause, all the required output columns in its SELECT clause, and using a WHERE clause to filter out only those employees that meet the salary requirement.
The main benefit of declarative coding is its raw performance. For one thing, SQL Server has been heavily optimized toward processing declarative code. But also, the query optimizer-the SQL Server component that selects how to process each query-can use all the elements in your database (including indexes, constraints, and statistics on data distribution) to find the most efficient way to process your request, and even adapt the execution plan when indexes are added or statistics indicate a major change in data distribution. Another benefit is that declarative code is often much shorter and (once you get the hang of it) easier to read and maintain than iterative code. Shorter, easier-to-read code directly translates into a reduction of development cost, and an even larger reduction of future maintenance cost. Aside from the learning curve for people with a background in iterative coding, there's only one problem with the set-based approach. Because you have to declare the results in terms of the original input, you can't take shortcuts by specifying end results in terms of intermediate results. In some cases, this results in queries that are awkward to write and hard to read. In other cases, it may result in queries that force SQL Server to do more work than would otherwise be required. Running totals is an example of this. There's no way to tell SQL Server to calculate the running total of each row as the total of the previous row plus the value of the current row, because the running total of the previous row isn't available in the input, and partial query results (even though SQL Server does know them) can't be specified in the language.
The only way to calculate running totals in a set-based fashion is to specify each running total as the sum of the values in all preceding rows. That implies that a lot more summation is done than would be required if intermediate results were available. This results in performance that degrades exponentially with the amount of data, so even if you have no problems in your test environment, you will have problems in your 100-million-row production database!
Our website is not responsible for the information contained by this article. Articleinput.com is a free articles resource thus practically any visitor can submit an article. However if you notice any copyrighted material, please contact us and we will remove the article(s) in discussion right away.
Note: This article was sent to us by: Christopher Baker at 01072010
1. Associative arrays in Python are dictionaries
All articles are property of their respective authors. Please read our Privacy Policy!
© 2009 ArticleInput.com.