Weighted percentiles, part 1
From this table:
SELECT item_id , units , price FROM items ORDER BY price ; ITEM_ID UNITS PRICE --------- ---------- ---------- 1 4 1.99 3 6 2.29 4 11 2.49 2 1 2.99 ...
...and so on, I can find the average price, min, max, etc. What I need to know is the fifth percentile and the 95th percentile of the price data, weighted by units. I have noticed that there are functions like percentile_disc. But my problem is that I have to weight the price by the units sold.
Do you have any suggestions in Oracle SQL?
Let's assume your data consists of just the four rows shown above. To restate the problem, you want to find the price such that 5% of the units sold had a lower (or equal) price, and the price such that 95% of the units sold had a lower (or equal) price. Since the total number of units is 22, you want the prices of the "1.1st" unit (.05 * 22 = 1.1) and the "20.9th" unit.
The analytic function percentile_disc was introduced in Oracle 9 to answer questions just like yours on unweighted data. Unfortunately, like you, I don't see how it can be used on weighted data. There is, however, another analytic function, sum, that can help by telling you how many units sold at a lower (or equal) price for each row of your table. With that number, and the grand total of the units sold, you can compute the percentiles.
SELECT item_id -- Query 1 , units , price , SUM (units) OVER (ORDER BY price RANGE UNBOUNDED PRECEDING) AS running_total , ( SELECT SUM (units) FROM items ) AS grand_total FROM items ORDER BY price ; ITEM_ID UNITS PRICE RUNNING_TOTAL GRAND_TOTAL ---------- ---------- ---------- ------------- ----------- 1 4 1.99 4 22 3 6 2.29 10 22 4 11 2.49 21 22 2 1 2.99 22 22
You can query the result set above to get the information you need, one item at a time, as shown below. Notice that the in-line view (the subquery) is (almost) exactly the same as Query 1 shown above.
SELECT price AS price_05 -- Query 2 FROM ( -- begin Sub-query 2a, same as Query 1 SELECT item_id , units , price , SUM (units) OVER (ORDER BY price RANGE UNBOUNDED PRECEDING) AS running_total , ( SELECT SUM (units) FROM items ) AS grand_total FROM items ) -- end sub-query 2a WHERE (.05 * grand_total) BETWEEN (running_total - units) AND units ; PRICE_05 ---------- 1.99
Click to jump to part 2.