In Part 1 of this answer we examined averages produced by the AVG function on the following table:
subject | year | enrolled ----------+---------+------------- subject1 | 1998 | 20 subject1 | 1999 | 23 subject1 | 2000 | 16 subject2 | 1999 | 10 subject2 | 2000 | 21 subject3 | 2000 | 9
The averages for each subject were the same whether we supplied the missing years or not. Here we'll explore why, and how to work with NULLs and aggregates.
When we supplied the missing years and found that averages were not affected, we demonstrated that aggregate functions exclude NULLs. The average was calculated using the number of non-NULL values in each subject group.
To get averages calculated over all years in the given span, where enrollment rows are missing, we must assume that the number enrolled was zero. This may not be a valid assumption in all applications. Remember, NULL is not equal to any value, and in particular, NULL is not equal to zero, so we have to do something specific to make it work that way.
The obvious solution is to check when the column is NULL, and use zero instead, which is completely in line with the assumption we are making and is the best way to solve the problem. We simply use the COALESCE function:
select allyears.subject , avg(enrolled) as avgamt , avg(coalesce(enrolled,0)) as avgamtzero from ( select distinct subject , 1998+i as theyear from integers , subjects where i between 0 and 2 ) as allyears left outer join subjects on allyears.subject = subjects.subject and allyears.theyear = subjects.theyear group by allyears.subject
subject avgamt avgamtzero subject1 19.67 19.67 subject2 15.50 10.33 subject3 9.00 3.00
COALESCE is a standard SQL function. If your database does not support it, look for an equivalent function like ISNULL or NVL.
We made the correct calculation over the span of years by generating rows that were missing. Was this the easiest way? Consider the following query:
select allyears.subject , sum(enrolled) as sumamt , count(enrolled) as countamt , count(*) as countrows , avg(enrolled) as avgamt , avg(coalesce(enrolled,0)) as avgamtzero from ( select distinct subject , 1998+i as theyear from integers , subjects where i between 0 and 2 ) as allyears left outer join subjects on allyears.subject = subjects.subject and allyears.theyear = subjects.theyear group by allyears.subject
subject sumamt countamt countrows avgamt avgamtzero subject1 59 3 3 19.67 19.67 subject2 31 2 3 15.50 10.33 subject3 9 1 3 9.00 3.00
Notice that COUNT(*) counts rows, whether any particular column had NULLs or not. COUNT(*) ignores NULLs. This is the only exception to the rule that aggregate functions exclude NULLs, because COUNT(*) does not even look at columns.
So the other way of getting the desired averages is:
sum(enrolled) / count(*) as avgamtzero
I can't say whether this method is as efficient as AVG(COALESCE(xxx)), but my guess is that they are approximately the same.
Finally, there is one other approach. Instead of generating rows that were missing with a cross join to the integers table, just do this:
select subject , sum(enrolled) / 3 as avgamt from subjects group by subject
subject avgamt subject1 19.67 subject2 10.33 subject3 3.00
Is this reasonable? Can we just plug 3 into the calculation for the range of years desired? The derived table subquery was fashioned using inspection to determine the range of years. In general, it's not that easy. The generation of missing rows using the cross join would have used additional subqueries to determine the first and last years, so that inspection would not be necessary. If you can do it by inspection, okay, but it's nice to know how to attack the general problem, too.
Dig Deeper on Oracle and SQL
Have a question for an expert?
Please add a title for your question
Get answers from a TechTarget expert on whatever's puzzling you.