Sean Holtshousen – Developer Blog

I created this blog to provide F.A.Q. to day to day SQL, C#.Net and other language questions.

November 10, 2009

Common Table Expressions (CTE)

0
Interesting article I found here: http://www.simple-talk.com/sql/t-sql-programming/sql-server-2005-common-table-expressions/

CTE basics
The concept behind a CTE is simplicity itself. Consider the following statement:

with MyCTE(x)
as
(select x=’hello’)
select x from MyCTE

This defines a CTE called MyCTE. In brackets after the as keyword is the query that defines the CTE. The subsequent query references our CTE, in this case simply returning the string “hello”.

Like a derived table, a CTE lasts only for the duration of a query but, in contrast to a derived table, a CTE can be referenced multiple times in the same query. So, we now we have a way of calculating percentages and performing arithmetic using aggregates without repeating queries or using a temp table:

with MyCTE(x)
as
(
select top 10 x = id from sysobjects
)
select x, maxx = (select max(x) from MyCTE), pct =
100.0 * x / (select sum(x) from MyCTE)
from MyCTE

This returns (on my system):

x maxx pct
4 2137058649 2.515723270440
5 2137058649 3.144654088050
7 2137058649 4.402515723270
8 2137058649 5.031446540880
13 2137058649 8.176100628930
15 2137058649 9.433962264150
25 2137058649 15.723270440251
26 2137058649 16.352201257861
27 2137058649 16.981132075471
29 2137058649 18.238993710691

Note that although this has only referenced sysobjects once in the CTE, the query plan will confirm that sysobjects is actually scanned 3 times – each aggregate in the result set causes an additional scan. As a result, it would still be more efficient to accumulate the values in a temporary table or table variable if you are accessing large tables.

CTE and recursion
More interesting, in my opinion, is the use of recursion with CTEs. The table defined in the CTE can be referenced in the CTE itself to give a recursive expression, using union all:

with MyCTE(x)
as
(
select x = convert(varchar(8000),’hello’)
union all
select x + ‘a’ from MyCTE where len(x) < 100
)
select x from MyCTE
order by x

The query:

select x = convert(varchar(8000),'hello')

is called the anchor member. This is executed in the first pass and will populate the CTE with the result, in this case hello. This initial CTE is repeatedly executed until the complete result set is returned. The next entry:

select x + 'a' from MyCTE where len(x) < 100

is a recursive member as it references the CTE, MyCTE. The recursive member is executed with the anchor member output to give helloa. The next pass takes helloa as input and returns helloaa, and so on so that we arrive at a CTE populated with rows as follows:

hello
helloa
helloaa
helloaaa
helloaaaa
…

The recursion will terminate when the recursive member produces no rows – in this case recursion stops when the length of x equals 99 – or when the recursion limit is reached (more about that later). The CTE is then output by the following statement:

select x from MyCTE
order by x

There are a few interesting issues associated with even this simple CTE usage. Note that the anchor member populates the CTE with a varchar(8000). Had the convert not been included then you would expect the datatype of x to be defined by that of the anchor member (varchar(5)) and to give an error when trying to insert the longer recursive entries. This does indeed happen – but you would expect varchar(1000) to be fine. Not so. This still produces the error that the datatypes don't match. In fact, you should always cast the recursive member output to be the same as the anchor member:

with MyCTE(x)
as
(
select x = convert(varchar(1000),'hello')
union all
select convert(varchar(1000),x + 'a') from MyCTE
where len(x) < 100
)
select x from MyCTE
order by x

But why, then, did the code work with varchar(8000)? This looks like a bug. As far as I can tell varchar(8000) and nvarchar(4000) seem to be the only values that work. Varchar(max) and varchar(7999) give an error as do all the other values that I have tried.

Multiple anchor members
Any entry that does not reference the CTE will be considered an anchor member so we can also include multiple anchor members using a union all:

with MyCTE(x)
as
(
select x = convert(varchar(1000),'hello')
union all
select x = convert(varchar(1000),'goodbye')
union all
select convert(varchar(1000),x + 'a') from MyCTE
where len(x) < 10
)
select x from MyCTE
order by len(x), x

This adds two rows from the anchor members and hence the recursive member acts on two rows for each pass to produce the output:

hello
helloa
goodbye
helloaa
goodbyea
helloaaa
goodbyeaa
helloaaaa
goodbyeaaa
helloaaaaa

The first recursive pass produces the output "helloa" and "goodbyea". The second produces "helloaa" and "goodbyeaa". The third produces "helloaaa" and "goodbyeaaa". For subsequent passes the string containing "goodbye" is too long for the check len(x)<10 but recursion continues as long as some output is produced – so we get more entries from hello than goodbye.

Multiple recursive members
We can also include multiple recursive members to produce extra rows at each pass:

with MyCTE(x)
as
(
select x = convert(varchar(1000),'hello')
union all
select convert(varchar(1000),x + 'a') from MyCTE where len(x) < 10
union all
select convert(varchar(1000),x + 'b') from MyCTE where len(x) < 10
)
select x from MyCTE
order by len(x), x

This produces the result:

hello
helloa
hellob
helloaa
helloab
helloba
hellobb
helloaaa
helloaab
helloaba
helloabb
hellobaa
hellobab
hellobba
hellobbb
helloaaaa
….
63 rows of output.

In order to understand this output, you need to remember that the input to each pass is the output from the previous pass. On the first recursive pass the input is the row from the anchor member. This produces two rows – one from each recursive member. On the next pass the input is the two rows output from the previous pass. Each recursive member outputs two rows producing four in total. In other words, the number of rows output doubles with each pass.

We can modify the output by limiting the rows on which the recursive members act. For example:

with MyCTE(x)
as
(
select x = convert(varchar(1000),'hello')
union all
select convert(varchar(1000),x + 'a') from MyCTE
where len(x) < 10 and (len(x) = 5 or x like '%a')
union all
select convert(varchar(1000),x + 'b') from MyCTE
where len(x) < 10 and (len(x) = 5 or x like '%b')
)
select x from MyCTE
order by len(x), x

giving:

hello
helloa
hellob
helloaa
hellobb
helloaaa
hellobbb
helloaaaa
hellobbbb
helloaaaaa
hellobbbbb

Another method is to flag the recursive members:

with MyCTE(i, x)
as
(
select i = 0, x = convert(varchar(1000),'hello')
union all
select i = 1, convert(varchar(1000),x + 'a') from MyCTE
where len(x) < 10 and i in (0,1)
union all
select i = 2, convert(varchar(1000),x + 'b') from MyCTE
where len(x) < 10 and i in (0,2)
)
select x from MyCTE
order by len(x), x

This gives the same result as above. It forces each recursive member to act only on the anchor member output and on any output that it itself has produced. This is at the expense of including the redundant data column "i" in the CTE but it does make the code a lot more readable. Using a similar technique we can output a value to indicate which pass produces each row:

with MyCTE(r1, r2, i, x)
as
(
select r1 = 1, r2 = 1, i = 0, x = convert(varchar(1000),'hello')
union all
select r1 = r1 + 1, r2 = r2, i = 1, convert(varchar(1000),x + 'a') from MyCTE
where len(x) < 10 and i in (0,1)
union all
select r1 = r1, r2 = r2 + 1, i = 2, convert(varchar(1000),x + 'b') from MyCTE
where len(x) < 10 and i in (0,2)
)
select r1, r2, x from MyCTE
order by len(x), x

This returns the following, r1 giving the pass number for the first recursive member and r2 for the second:

r1 r2 x
1 1 hello
2 1 helloa
1 2 hellob
3 1 helloaa
1 3 hellobb
4 1 helloaaa
1 4 hellobbb
5 1 helloaaaa
1 5 hellobbbb
6 1 helloaaaaa
1 6 hellobbbbb

This is very useful for debugging and also for detecting how a CTE is processed and for controlling the number of passes for each recursive member. For instance:

with MyCTE(r1, r2, i, x)
as
(
select r1 = 1, r2 = 1, i = 0, x = convert(varchar(1000),'hello')
union all
select r1 = r1 + 1, r2 = r2, i = 1, convert(varchar(1000),x + 'a') from MyCTE
where i in (0,1) and R1 < 3
union all
select r1 = r1, r2 = r2 + 1, i = 2, convert(varchar(1000),x + 'b') from MyCTE
where i in (0,2) and R2 < 6
)
select r1, r2, x from MyCTE
order by len(x), x

Here I have terminated the first recursive member after two iterations and the second after five, giving:

1 1 hello
2 1 helloa
1 2 hellob
3 1 helloaa
1 3 hellobb
1 4 hellobbb
1 5 hellobbbb
1 6 hellobbbbb

Note that recursion continues until a pass produces no output – although the first recursive member terminates early recursion continues until both recursive members produce no output.

Recursion limit
In order to demonstrate the recursion limit, we start by producing a series of numbers in a CTE:

with MyCTE(i)
as
(
select i = 1
union all
select i = i + 1 from MyCTE where i < 100
)
select i
from MyCTE
order by i

This happily produces the numbers 1 to 100 (a tally table). However, try increasing the recursion limit as follows:

with MyCTE(i)
as
(
select i = 1
union all
select i = i + 1 from MyCTE where i < 1000
)
select i
from MyCTE
order by i

You will receive the following error:

"The statement terminated. The maximum recursion 100 has been exhausted before statement completion"

It sounds like there is a limit of 100 for recursion but luckily this is just the default limit. For development this is very useful; to save time and space consider setting it lower for first attempts.

The maximum number of recursions can be set via the maxrecursion option, up to a maximum of 32767 (wouldn't it be nice, though, if the error message suggested that the recursion limit should be increased?):

with MyCTE(i)
as
(
select i = 1
union all
select i = i + 1 from MyCTE where i < 1000
)
select i
from MyCTE
order by i
option (maxrecursion 1000)

Uses for Common Table Expressions
Finally, let's review some of the more interesting applications of CTEs.

Traversing a hierarchy
This is covered well in BOL and now gives SQL Server something to compete with the Oracle connect by operator.

Date Ranges
A common requirement is to aggregate entries per day (or month or year). This is easy using a group by, if there are entries for every day:

declare @Sales table (TrDate datetime, Amount money)
insert @Sales select '20060501', 200
insert @Sales select '20060501', 400
insert @Sales select '20060502', 1200

select [day] = TrDate, sum(amount) total_sales
from @sales
group by TrDate
order by TrDate

giving:

day total_sales
2006-05-01 00:00:00.000 600.00
2006-05-02 00:00:00.000 1200.00

However, if there are some days with no transactions the result set, rather than report zero, does not give an entry for that day. To get round this we need to left join to a tally table which includes the days on which we wish to report. With a CTE this becomes simple. For example, for a year:

declare @Sales table (TrDate datetime, Amount money)
insert @Sales select '20060501', 200
insert @Sales select '20060501', 400
insert @Sales select '20060502', 1200

;with MyCTE(d)
as
(
select d = convert(datetime,'20060101')
union all
select d = d + 1 from MyCTE where d < '20061231'
)
select [day] = d.d, sum(coalesce(s.amount,0))
from MyCTE d
left join @sales s
on s.TrDate = d.d
group by d.d
order by d.d
option (maxrecursion 1000)

Note the ";" before the CTE definition – that's just a syntax requirement if the CTE declaration is not the first statement in a batch.

Parsing CSV values
It is common to pass a CSV string in to a stored procedure as a means of passing in an array of values. Often this is turned into a table via a function. Using a CTE we can now contain this code within the stored procedure:

declare @s varchar(1000)
select @s = 'a,b,cd,ef,zzz,hello'

;with csvtbl(i,j)
as
(
select i=1, j=charindex(',',@s+',')
union all
select i=j+1, j=charindex(',',@s+',',j+1) from csvtbl
where charindex(',',@s+',',j+1) 0
)
select substring(@s,i,j-i)
from csvtbl

The output is as follows:

a
b
cd
ef
zzz
hello

How does this work? The anchor member, select i=1, j=charindex(‘,’,@s+’,'), returns 1 and the location of the first comma. The recursive member gives the location of the first character after the comma and the location of the next comma (we append a comma to the string to get the last entry). The result set is then obtained by using these values in a substring.

In the previous example the CTE output was the start and end locations of each of the strings. We can instead produce the strings themselves; the CTE code becomes a little more complicated but the following query is simplified:

declare @s varchar(1000)
select @s = ‘a,b,cd,ef,zzz,hello’

;with csvtbl(i,j, s)
as
(
select i=1, s=charindex(‘,’,@s+’,'),
substring(@s, 1, charindex(‘,’,@s+’,')-1)
union all
select i=j+1, j=charindex(‘,’,@s+’,',j+1),
substring(@s, j+1, charindex(‘,’,@s+’,',j+1)-(j+1))
from csvtbl where charindex(‘,’,@s+’,',j+1) 0
)
select s from csvtbl

Beyond 32767
What happens if you want a list of numbers that extends beyond 32767? Although the recursion limit is 32767 it is possible to create extra entries. For example, the following CTE returns all the numbers from 0 to 64000. The result set produced is used to check the results. The maximum and count verify that all of the numbers are present, assuming that the result is a sequence:

with n(rc,i)
as
(
select rc = 1, i = 0
union all
select rc = 1, i = i + 1 from n where rc = 1 and i < 32000
union all
select rc = 2, i = i + 32001 from n where rc = 1 and i < 32000
)
select count_i = count(*), max_i = max(i)
from n
option (maxrecursion 32000)

To take this a step further we can accumulate values by manipulating the CTE:

with n (j)
as
(
select j = 0
union all
select j = j + 1 from n where j < 32000
)
select max_i = max (na.i), count_i = count(*)
from ( select i = j + k
from n
cross join
( select k = j * 32001
from n
where j < 32
) n2
) na
option (maxrecursion 32000)

giving:

max_i count_i
———– ——-
1024031 1024032

In other words, this returns all the numbers from 0 to 1024031. How does this work? The derived table, n2, consists of all the numbers from 0 to 32 multiplied by 32001, i.e.

0
32001
64002
96003
….

This is cross-joined with the result of the CTE, n, which consists of all the numbers from 0 to 32000, and the result is the sum of the values from n and n2. Hence we get:

from n 0- 32000 with 0 from n2 to give 0 – 32000
from n 0- 32000 with 32001 from n2 to give 32001 – 64001
from n 0- 32000 with 64002 from n2 to give 64002 – 96002
from n 0- 32000 with 96003 from n2 to give 64003 – 96003
…..

to give the values 0 – 1,024,031. If more values are required then just increase the size of n2 by increasing the maximum value from 32.

November 5, 2009

Space Saving Using VarDecimal

0
Here is a very interesting article a colleague found on reducing the size of databases in SQL using what is called a vardecimal storage format for DECIMAL and NUMERIC datatypes:

http://technet.microsoft.com/en-gb/library/bb508963(SQL.90).aspx

It basically has the same effect as using VARCHAR instead of CHAR on string fields i.e. it treats DECIMAL and NUMERIC data types as variable length instead of fixed-length. This feature is available from SQL 2005 SP 2 onward, which means we would be able to use it straight away.

As an example, taking a large table TableA, it estimates (there is a system proc that allows for this – all details are in article) that the average row length will decrease by about 148 bytes per row:

Executing the following:

EXEC sys.sp_estimated_rowsize_reduction_for_vardecimal ‘TableA’

returns the following results:

avg_rowlen_fixed_format avg_rowlen_vardecimal_format row_count
1,402.86 1,254.28 76,527,900

So, saving 1402.86 – 1254.28 = +- 148 bytes per row * 76,527,900 rows in table = 11,326,129,200 bytes = +- 10.5 GB !!!!!

Obviously, there would be some performance implications when reading data out as SQL has to explicitly convert between vardecimal and fixed format, or vice versa when writing data back to disk. So, CPU bound workloads could adversely affect performance but I/O bound workloads could result in performance gain.

November 4, 2009

IN vs. EXISTS vs. JOIN

0
These are some of the ways of checking if data exists in a table. Each way has its advantages.

1. count(*)
IF(SELECT COUNT(*) FROM [table] WHERE … ) > 0
Definitely the simplest to understand but it has terrible performance. Everytime this is done a full table scan is performed. Huge resource toll.

2. IN
SELECT
[field names]
FROM
[table]
WHERE
IN (SELECT [field names] FROM [table])
Functionally, they are the same (as compared to NOT IN vs NOT EXISTS which are functionally different in one scenario – read this post for the differences between NOT IN and NOT EXISTS clauses : http://decipherinfosys.wordpress.com/2007/01/21/32/ ). However, there are performance implications of using one over the other that one needs to be aware of. Assume that we have two tables : TABLE_A and TABLE_B and the match is being done on TABLE_A.col1 = TABLE_B.col2. In that scenario, an in statement like:

select [select column list] from TABLE_A where col1 in (Select col2 from TABLE_B)

will get processes in this way:

1) The sub-query gets evaluated first and the results are distinct’ed and indexed,

2) The output from it is then joined with TABLE_A.

Re-writing the above query using the EXISTS clause will give:

Select [select column list] from TABLE_A

where exists (select 1 from Table_B where Table_B.col2 = Table_A.col1)

This gets evaluated in this order:

1) For every value of Table_A.col1, loop through and match the values in Table_B.col2.

2) If we get a match, select that value and move on to the next one. If there is no match, discard that value.

So, where should one use an IN vs the EXISTS clause? If the result of the sub-query “Select col2 from TABLE_B” is huge and the TABLE_A is a relatively small set and executing “select 1 from Table_B where Table_B.col2 = Table_A.col1″ is very fast because of proper index on Table_B.col2, then an exists clause will be better since the optimizer can do a FTS on Table_A and then use the index to do the probe/seek operations for Table_B.

If the result of the sub-query is small, then the IN clause is much faster. If the results of the both the sub-query as well as the outer query is large, then either IN or EXISTS would work the same – it depends upon your indexing scheme.

Please do note that the example used above is a very simplistic one in order to illustrate the point – in real world, you would have queries that have additional filter criteria on those tables that narrows down the result sets. As a generic rule, if the result of the outer query is small and the result set of the inner sub-query is large, then use EXISTS – if it is the other way around, then use the IN clause.

3. EXISTS
SELECT
FROM [table]
WHERE
EXISTS (SELECT FROM [table])
This method is best to use when it comes to resource hogging, because exists stops the execution as soon as it reads the first row.

October 23, 2009

VARCHAR VS NVARCHAR

0
SQL Server provides both datatypes to store character information. For the most part the two datatypes are identical in how you would work with them within SQL Server or from an application. The difference is that nvarchar is used to store unicode data, which is used to store multilingual data in your database tables. Other languages have an extended set of character codes that need to be saved and this datatype allows for this extension. If your database will not be storing multilingual data you should use the varchar datatype instead. The reason for this is that nvarchar takes twice as much space as varchar, this is because of the need to store the extended character codes for other languages. From guyS’s WebLog