Joe Celko s SQL for Smarties - Advanced SQL Programming P50 pdf

10 103 0
Joe Celko s SQL for Smarties - Advanced SQL Programming P50 pdf

Đang tải... (xem toàn văn)

Thông tin tài liệu

462 CHAPTER 21: AGGREGATE FUNCTIONS or: SELECT P2.dept_nbr, MIN(P1.salary_amt) FROM Personnel AS P1, Personnel AS P2 WHERE P1.dept_nbr = P2.dept_nbr AND P1.salary_amt >= P2.salary_amt GROUP BY P2.dept_nbr, P2.salary_amt HAVING COUNT(DISTINCT P1.salary_amt) <= 3; 21.4.4 GREATEST() and LEAST() Functions Oracle has a proprietary pair of functions that return greatest and least values, respectively—a sort of “horizontal” MAX() and MIN(). The syntax is GREATEST (<list of values>) and LEAST (<list of values>). Awkwardly, DB2 allows MIN and MAX as synonyms for LEAST and GREATEST. If you have NULLs, then you have to decide if they sort high or low, if they will be excluded or will propagate the NULL, so that you can define this function several ways. If you don’t have NULLs in the data: CASE WHEN col1 > col2 THEN col1 ELSE col2 END If you want the highest non-NULL value: CASE WHEN col1 > col2 THEN col1 ELSE COALESCE(col2, col1) END If you want to return NULL where one of the cols is NULL: CASE WHEN col1 > col2 OR col1 IS NULL THEN col1 ELSE col2 END But for the rest of this section, let’s assume (a < b) and NULL is high: GREATEST (a, b) = b GREATEST (a, NULL) = NULL GREATEST (NULL, b) = NULL GREATEST (NULL, NULL) = NULL 21.4 Extrema Functions 463 We can write this as: GREATEST(x, y) ::= CASE WHEN (COALESCE (x, y) > COALESCE (y, x)) THEN x ELSE y END The rules for LEAST() are: LEAST (a, b) = a LEAST (a, NULL) = a LEAST (NULL, b) = b LEAST (NULL, NULL) = NULL This is written: LEAST(x, y) ::= CASE WHEN (COALESCE (x, y) <= COALESCE (y, x)) THEN COALESCE (x, y) ELSE COALESCE (y, x) END This can be done in Standard SQL, but takes a little bit of work. Let’s assume that we have a table that holds the scores for a player in a series of five games and we want to get his best score from all five games. CREATE TABLE Games (player CHAR(10) NOT NULL PRIMARY KEY, score_1 INTEGER NOT NULL DEFAULT 0, score_2 INTEGER NOT NULL DEFAULT 0, score_3 INTEGER NOT NULL DEFAULT 0, score_4 INTEGER NOT NULL DEFAULT 0, score_5 INTEGER NOT NULL DEFAULT 0); and we want to find the GREATEST (score_1, score_2, score_3, score_4, score_5). SELECT player, MAX(CASE X.seq_nbr WHEN 1 THEN score_1 WHEN 2 THEN score_2 WHEN 3 THEN score_3 WHEN 4 THEN score_4 WHEN 5 THEN score_5 ELSE NULL END) AS best_score 464 CHAPTER 21: AGGREGATE FUNCTIONS FROM Games CROSS JOIN (VALUES (1), (2), (3), (4), (5)) AS X(seq_nbr) GROUP BY player; Another approach is to use a pure CASE expression: CASE WHEN score_1 <= score_2 AND score_1 <= score_3 AND score_1 <= score_4 AND score_1 <= score_5 THEN score_1 WHEN score_2 <= score_3 AND score_2 <= score_4 AND score_2 <= score_5 THEN score_2 WHEN score_3 <= score_4 AND score_3 <= score_5 THEN score_3 WHEN score_4 <= score_5 THEN score_4 ELSE score_5 END A final trick is to use a bit of algebra. You can define: GREATEST(a, b) ::= (a + b + ABS(a - b)) / 2 LEAST(a, b) ::= (a + b - ABS(a - b)) / 2 Then iterate on it as a recurrence relation on numeric values. For example, for three items, you can use GREATEST (a, GREATEST(b, c)), which expands to: ((a + b) + ABS(a - b) + 2 * c + ABS((a + b) + ABS(a - b) - 2 * c))/4 You need to watch for possible overflow errors if the numbers are large and NULLs propagate in the math functions. Here is the answer for five scores. (score_1 + score_2 + 2*score_3 + 4*score_4 + 8*score_5 + ABS(score_1 - score_2) + ABS((score_1 + score_2) + ABS(score_1 - score_2) - 2*score_3) 21.5 The LIST() Aggregate Function 465 + ABS(score_1 + score_2 + 2*score_3 - 4*score_4 + ABS(score_1 - score_2) + ABS((score_1 + score_2 - 2*score_3) + ABS(score_1 - score_2))) + ABS(score_1 + score_2 + 2*score_3 + 4*score_4 - 8*score_5 + ABS(score_1 - score_2) + ABS((score_1 + score_2) + ABS(score_1 - score_2) - 2*score_3) + ABS(score_1 + score_2 + 2*score_3 - 4*score_4 + ABS(score_1 - score_2) + ABS((score_1 + score_2 - 2*score_3) + ABS(score_1 - score_2))) )) / 16 21.5 The LIST() Aggregate Function The LIST([DISTINCT] <string expression>) is part of Sybase’s SQL Anywhere (formerly WATCOM SQL). It is the only aggregate function to work on character strings. It takes a column of strings, removes the NULLs and merges them into a single result string with commas between each of the original strings. The DISTINCT option removes duplicates as well as NULLs before concatenating the strings. This function is a generalized version of concatenation, just as SUM() is a generalized version of addition. MySQL 4.1 extended this function into the GROUP_CONCAT() function, which does the same thing but adds options for ORDER BY and SEPARATOR. This is handy when you use SQL to write SQL queries. As one simple example, you can apply it against the schema tables and obtain the names of all the columns in a table, then use that list to expand a SELECT * into the current column list. One nonproprietary way of doing this query is with scalar subquery expressions. Assume we have these two tables: CREATE TABLE People (person_id INTEGER NOT NULL PRIMARY KEY, name CHAR(10) NOT NULL); INSERT INTO People VALUES (1, 'John'), (2, 'Mary'), (3, 'Fred'), (4, 'Jane'); CREATE TABLE Clothes (person_id INTEGER NOT NULL, seq_nbr INTEGER NOT NULL, item_name CHAR(10) NOT NULL, worn_flag CHAR(1) NOT NULL 466 CHAPTER 21: AGGREGATE FUNCTIONS CONSTRAINT worn_flag_yes_no CHECK (worn_flag IN ('Y', 'N')), PRIMARY KEY (id, seq_nbr)); INSERT INTO Clothes VALUES (1, 1, 'Hat', 'Y'), (1, 2, 'Coat', 'N'), (1, 3, 'Glove', 'Y'), (2, 1, 'Hat', 'Y'), (2, 2, 'Coat', 'Y'), (3, 1, 'Shoes', 'N'), (4, 1, 'Pants', 'N'), (4, 2, 'Socks', 'Y'); Using the LIST() function, we could get an output of the outfits of the people with the simple query: SELECT P0.person_id, P0.person_name, LIST(item_name) AS fashion FROM People AS P0, Clothes AS C0 WHERE P0.person_id = C0.clothes_id AND C0.worn_flag = 'Y' GROUP BY P0.person_id, P0.person_name; Result id name fashion ======================= 1 'John' 'Hat, Glove' 2 'Mary' 'Hat, Coat' 4 'Jane' 'Socks' 21.5.1 The LIST() Function with a Procedure To do this without an aggregate function, you must first know the highest sequence number, so you can create the query. In this case, the query is a simple “ SELECT MAX(seq_nbr) FROM Clothes” statement, but you might have to use a COUNT(*) for other tables. SELECT DISTINCT P0.person_id, P0.person_name, SUBSTRING ((SELECT CASE WHEN C1.worn_flag = 'Y' THEN (', ' || item_name) ELSE '' END FROM Clothes AS C1 WHERE C1.clothes_id = C0.clothes_id 21.5 The LIST() Aggregate Function 467 AND C1.seq_nbr = 1) || (SELECT CASE WHEN C2.worn_flag = 'Y' THEN (', ' || item_name) ELSE '' END FROM Clothes AS C2 WHERE C2.id = C0.clothes_id AND C2.seq_nbr = 2) || (SELECT CASE WHEN C3.worn_flag = 'Y' THEN (', ' || item_name) ELSE '' END FROM Clothes AS C3 WHERE C3.clothes_id = C0.clothes_id AND C3.seq_nbr = 3) FROM 3) AS list FROM People AS P0, Clothes AS C0 WHERE P0.person_id = C0.clothes_id; id name list =========================== 1 John Hat, Glove 2 Mary Hat, Coat 3 Fred 4 Jane Socks Again, the CASE expression on worn_flag can be replaced with an IS NULL to replace NULLs with an empty string. If you don’t want to see that Fred is naked—has an empty string of clothing—then change the outermost WHERE clause to read: WHERE P0.person_id = C0.clothes_id AND C0.worn_flag = 'Y'; Since you don’t want to see a leading comma, remember to TRIM() it off or to use the SUBSTRING() function to remove the first two characters. I opted for the SUBSTRING(), because the TRIM() function requires a scan of the string. 21.5.2 The LIST() Function by Crosstabs Carl Federl used this to get a similar result: CREATE TABLE Crosstabs (seq_nbr INTEGER NOT NULL PRIMARY KEY, seq_nbr_1 INTEGER NOT NULL, seq_nbr_2 INTEGER NOT NULL, 468 CHAPTER 21: AGGREGATE FUNCTIONS seq_nbr_3 INTEGER NOT NULL, seq_nbr_4 INTEGER NOT NULL, seq_nbr_5 INTEGER NOT NULL); INSERT INTO Crosstabs VALUES (1, 1, 0, 0, 0, 0), (2, 0, 1, 0, 0, 0), (3, 0, 0, 1, 0, 0), (4, 0, 0, 0, 1, 0), (5, 0, 0, 0, 0, 1); SELECT Clothes.id, TRIM (MAX(SUBSTRING(item_name FROM 1 FOR seq_nbr_1 * 10)) || ' ' || MAX(SUBSTRING(item_name FROM 1 FOR seq_nbr_2 * 10)) || ' ' || MAX(SUBSTRING(item_name FROM 1 FOR seq_nbr_3 * 10)) || ' ' || MAX(SUBSTRING(item_name FROM 1 FOR seq_nbr_4 * 10)) || ' ' || MAX(SUBSTRING(item_name FROM 1 FOR seq_nbr_5 * 10))) FROM Clothes, Crosstabs WHERE Clothes.seq_nbr = Crosstabs.seq_nbr AND Clothes.worn_flag = 'Y' GROUP BY Clothes.id; 21.6 The PRD() Aggregate Function Bob McGowan sent me a message on CompuServe asking for help with a problem. His client, a financial institution, tracks investment performance with a table something like this: CREATE TABLE Performance (portfolio_id CHAR(7) NOT NULL, execute_date DATE NOT NULL, rate_of_return DECIMAL(13,7) NOT NULL); To calculate a rate of return over a date range, you use the formula: (1 + rate_of_return [day_1]) * (1 + rate_of_return [day_2]) * (1 + rate_of_return [day_3]) * (1 + rate_of_return [day_4]) * (1 + rate_of_return [day_N]) 21.6 The PRD() Aggregate Function 469 How would you construct a query that would return one row for each portfolio’s return over the date range? What Mr. McGowan really wants is an aggregate function in the SELECT clause to return a columnar product, like the SUM() returns a columnar total. If you were a math major, you would write these functions as capital Sigma ( ∑) for summation and capital Pi for product (π). If such an aggregate function existed in SQL, the syntax for it would look something like: PRD ([DISTINCT] <expression>) While I am not sure that there is any use for the DISTINCT option, the new aggregate function would let us write his problem simply as: SELECT portfolio_id, PRD(1.00 + rate_of_return) FROM Performance WHERE execute_date BETWEEN start_date AND end_date GROUP BY portfolio_id; 21.6.1 PRD() Function by Expressions There is a trick to doing this, but you need a second table that looks like this and covers a period of five days: CREATE TABLE BigPi (execute_date DATE NOT NULL, day_1 INTEGER NOT NULL, day_2 INTEGER NOT NULL, day_3 INTEGER NOT NULL, day_4 INTEGER NOT NULL, day_5 INTEGER NOT NULL); Let’s assume we wanted to look at January 6 to 10, so we need to update the execute_date column to that range, thus: INSERT INTO BigPi VALUES ('2006-01-06', 1, 0, 0, 0, 0), ('2006-01-07', 0, 1, 0, 0, 0), ('2006-01-08', 0, 0, 1, 0, 0), ('2006-01-09', 0, 0, 0, 1, 0), ('2006-01-10', 0, 0, 0, 0, 1); 470 CHAPTER 21: AGGREGATE FUNCTIONS The idea is that there is a one in the column when BigPi.execute_date is equal to the nth date in the range, and a zero otherwise. The query for this problem is: SELECT portfolio_id, (SUM((1.00 + P1.rate_of_return) * M1.day_1) * SUM((1.00 + P1.rate_of_return) * M1.day_2) * SUM((1.00 + P1.rate_of_return) * M1.day_3) * SUM((1.00 + P1.rate_of_return) * M1.day_4) * SUM((1.00 + P1.rate_of_return) * M1.day_5)) AS product FROM Performance AS P1, BigPi AS M1 WHERE M1.execute_date = P1.execute_date AND P1.execute_date BETWEEN '2006-01-06' AND '2006-01-10' GROUP BY portfolio_id; If anyone is missing a rate_of_return entry on a date in that range, his or her product will be zero. That might be fine, but if you needed to get a NULL when you have missing data, then replace each SUM() expression with a CASE expression like this: CASE WHEN SUM((1.00 + P1.rate_of_return) * M1.day_N) = 0.00 THEN CAST (NULL AS DECIMAL(6, 4)) ELSE SUM((1.00 + P1.rate_of_return) * M1.day_N) END Alternately, if your SQL has the full SQL set of expressions, use this version: COALESCE (SUM((1.00 + P1.rate_of_return) * M1.day_N), 0.00) 21.6.2 The PRD() Aggregate Function by Logarithms Roy Harvey, another SQL guru who answered questions on CompuServe, found a different solution—one that could only come from someone old enough to remember slide rules and multiplication by adding logs. The nice part of this solution is that you can also use the DISTINCT option in the SUM() function. But there are a lot of warnings about this approach. Some older SQL implementation might have trouble with using an aggregate function result as a parameter. This has always been part of the standard, but 21.6 The PRD() Aggregate Function 471 some SQL products use very different mechanisms for the aggregate functions. Another, more fundamental problem is that a log of zero or less is undefined, so your SQL might return a NULL or an error message. You will also see some SQL products that use LN() for the natural log and LOG10() for the logarithm base ten, and some SQLs that use LOG(<parameter>, <base>) for a general logarithm function. Given all those warnings, the expression for the product of a column from logarithm and exponential functions is: SELECT ((EXP (SUM (LN (CASE WHEN nbr = 0.00 THEN CAST (NULL AS FLOAT) ELSE ABS(nbr) END)))) * (CASE WHEN MIN (ABS (nbr)) = 0.00 THEN 0.00 ELSE 1.00 END) * (CASE WHEN MOD (SUM (CASE WHEN SIGN(nbr) = -1 THEN 1 ELSE 0 END), 2) = 1 THEN -1.00 ELSE 1.00 END) AS big_pi FROM NumberTable; The nice part of this is that you can also use the SUM (DISTINCT <expression>) option to get the equivalent of PRD (DISTINCT <expression>). You should watch the data type of the column involved and use either integer 0 and 1 or decimal 0.00 and 1.00 as is appropriate in the CASE statements. It is worth studying the three CASE expressions that make up the terms of the Prod calculation. The first CASE expression is to ensure that all zeros and negative numbers are converted to a nonnegative or NULL for the SUM() function, just in case your SQL raises an exception. The second CASE expression will return zero as the answer if there is a zero in the nbr column of any selected row. The MIN(ABS(nbr)) trick is handy for detecting the existence of a zero in a list of both positive and negative numbers with an aggregate function. The third CASE expression will return −1 if there is an odd number of negative numbers in the nbr column. The innermost CASE expression uses a SIGN() function, which returns + 1 for a positive number, −1 for a negative number and 0 for a zero. The SUM() counts the −1 results, . score_2) + ABS(score_1 - score_2) - 2*score_3) + ABS(score_1 + score_2 + 2*score_3 - 4*score_4 + ABS(score_1 - score_2) + ABS((score_1 + score_2 - 2*score_3) + ABS(score_1 - score_2))) )). ABS(score_1 - score_2) + ABS((score_1 + score_2 - 2*score_3) + ABS(score_1 - score_2))) + ABS(score_1 + score_2 + 2*score_3 + 4*score_4 - 8*score_5 + ABS(score_1 - score_2) + ABS((score_1 + score_2). ABS(score_1 - score_2) + ABS((score_1 + score_2) + ABS(score_1 - score_2) - 2*score_3) 21.5 The LIST() Aggregate Function 465 + ABS(score_1 + score_2 + 2*score_3 - 4*score_4 + ABS(score_1 -

Ngày đăng: 06/07/2014, 09:20

Tài liệu cùng người dùng

Tài liệu liên quan