Using T-SQL windowing functions to calculate the median value of a collection

December 6, 2013

T-SQL includes an AVG() function which makes it easy to calculate the average value of a collection. The average price of a collection, for instance, is the total value of the collection members divided by the number of members in the collection. So if your collection includes 3, 5 and 12 the average is (3+5+12) / 3 or 6.667.

But what if you want to calculate the median, or middle, value in the collection? In this case there are three numbers, so we simply choose the middle one, or 5. Unfortunately, T-SQL doesn’t include a built-in function to calculate the median price so we have to roll our own, using a common table expression and windowing functions.

The example code calculates the median price in a particular customer’s service line, for example lawn care or pest control. Assuming not all prices for orders in the service like are the same, the example will find the middle value in the collection.

But what happens if the collection contains an even number of rows, and there are two in the middle? If the collection contains 3, 5, 12, 25 then the numbers 5 and 12 are in the middle, the median is halfway between these two values, or 8.5

USE TempDB
GO

IF OBJECT_ID('dbo.SalesData', 'U') IS NOT NULL
  DROP TABLE dbo.SalesData
GO

CREATE TABLE dbo.SalesData
(
    Counter int NOT NULL IDENTITY (1,1), 
    AccountNum int,
    ServiceLine varchar(20),
    SaleAmount decimal (19,4)
    CONSTRAINT PK_sample_table PRIMARY KEY (Counter)
)

--  Populate table with sample sales data for two customers and multiple service lines.
INSERT INTO  dbo.SalesData (AccountNum, ServiceLine, SaleAmount)
Values  (1000, 'Lawn Care', 39.95), (1000, 'Lawn Care', 39.95), (1000, 'Lawn Care', 39.95), (1000, 'Lawn Care', 39.95),(1000, 'Lawn Care', 39.95),(1000, 'Lawn Care', 42.95),(1000, 'Lawn Care', 42.95), (1000, 'Lawn Care', 42.95)
     , (2000, 'Tree Care', 45.95), (2000, 'Tree Care', 49.95), (2000, 'Tree Care', 49.95), (2000, 'Tree Care', 49.95),(2000, 'Tree Care', 49.95),(2000, 'Tree Care', 62.95),(2000, 'Tree Care', 55.95), (2000, 'Tree Care', 52.95)
     , (2000, 'Pest Control', 89.95), (2000, 'Pest Control', 89.95), (2000, 'Pest Control', 109.95);

/* 
    Find the median sales amount for each customer and service line
    ROW_NUMBER() establishes a ranking order for the values in the partition.
    Cnt shows the number of rows in that partition.
*/

WITH Median AS
(
     SELECT AccountNum, ServiceLine, SaleAmount,
       ROW_NUMBER() OVER(PARTITION BY AccountNum, ServiceLine ORDER BY SaleAmount) AS RowNum,
       COUNT(*) OVER(PARTITION BY AccountNum, ServiceLine) AS Cnt
     FROM dbo.SalesData
)

/*
    In the case that two numbers are in the middle of the collection, the true median lies as the average between the two values.
*/

SELECT AccountNum, ServiceLine, AVG(SaleAmount) AS MedianPrice  
FROM Median
WHERE RowNum IN((Cnt + 1) / 2, (Cnt + 2) / 2)
GROUP BY AccountNum, ServiceLine
comments powered by Disqus