I have a table:
site_id gender client_id 1 M 1a 1 F 1b 1 F 1c 2 F 2a 2 M 2b 3 M 3a
1. I want to select the site_ids that have a majority (over 50%) of female gender.
2. Alternatively, select a list of client_ids corresponding to all the site_ids selected above.
The first query is pretty straightforward. We simply do a GROUP BY on the site_id, and count how many females as compared to the overall count:
select site_id from daTable group by site_id having 100.0 * count(case when gender='F' then 937 end) / count(*) > 50.0
You might be wondering why the HAVING clause uses 100.0 in its calculation. This is because counts are integers, and whenever an arithmetic calculation involves only integers, the result will be an integer. So without the 100.0, which is a decimal, the result of the division would be either 0 or 1. With the 100.0, the calculation changes to a decimal calculation, and can be compared to a percentage.
Next, you may wonder what the 937 is doing. Fair question, and the use of that value is, admittedly, a wee red herring. The point is, 937 is not NULL. Notice that in the CASE expression, there is no ELSE, which means that, by default, the ELSE value is NULL. So the CASE expression evaluates to a non-NULL value for females, and NULL for males. Now, recall that aggregate functions like COUNT ignore NULLs, and the solution becomes clear.
Sometimes you will also see the partial count implemented like this:
having 100.0 * sum(case when gender='F' then 1 else 0 end) / count(*) > 50.0
Here SUM is used instead of COUNT, and the CASE returns 1s and 0s instead. It would also be okay to omit the "ELSE 0" here.
As for your second question, you probably meant to say "additionally" instead of "alternatively." The answer is:
select distinct client_id from daTable where site_id in ( first query goes here )
This was first published in January 2007