MainframeSupports
tip week 40/2004:

I have often seen DISTINCT used with UNION or UNION ALL. I have reached to the conclusion, that many SQL users doesn't fully understand how UNION works. Imagine two SQL statements, the first one called SELECT X returns rows A, B, B and C. The second one is called SELECT Y and returns rows B, C and D.

Now we make a third SQL statement, SELECT X UNION ALL SELECT Y. This SQL statement will return rows A, B, B, C, B, C and D. The result of a UNION ALL is always all rows from both involved SELECT statements even though some of the rows may be duplicates.

Next step is to execute the SQL statement SELECT X UNION SELECT Y. This SQL statement will return rows A, B, C and D. UNION without ALL removes all duplicate rows and leave one of the duplicates unremoved from the result. That is the reason why rows B and C only shows up one time each in the result. To produce this result DB2 has to scan the result of the two SELECT statements and eliminate the duplicates, which makes UNION a lot more expensive than UNION ALL.

If you know in advance that your SELECT X and SELECT Y each doesn't return the same row twice or more and furthermore knows that no rows in SELECT X is returned by SELECT Y and vice versa, I will strongly recommend you to use UNION ALL between SELECT X and SELECT Y, because UNION without ALL will create an unnessecary and expensive overhead. Surprisingly you will find that a lot of UNION's can be coded with UNION ALL and still produce the correct result.

Now what about SQL statement SELECT DISTINCT X UNION SELECT Y? This SQL statement will return exactly the same rows as SELECT X UNION SELECT Y. Again the addition of the DISTINCT will add an expensive overhead to your SQL statement as DISTINCT will remove duplicates in exactly the same manner as UNION does. UNION just removes duplicates without knowing which of the two SELECT statements the rows came from. So you should never use DISTINCT and UNION without ALL together.

Now we just have to look at the result of a SELECT DISTINCT X UNION ALL SELECT Y. This SQL statement will return rows A, B, C, B, C, D. We can conclude that DISTINCT and UNION ALL used together makes sense in order to produce the desired result.

Now I just have to add that UNION without ALL and DISTINCT are two real CPU consumers. DB2 will always make a sort when you use any of these two SQL facilities even when there is absolutely no reason for a sort to be carried out. DB2 doesn't sort anything when you use UNION ALL. It just returns all rows from the first SELECT and afterwards all the rows from the next SELECT and so on.

Previous tip in english        Sidste danske tip        Tip list