'Restructuring Cumulative Running Totals
Firstly apologies for my title I wasnt sure how to able this. The below code outputs the running avg of male and female salaries for each year.
SELECT
Date,
Gender,
Employee_Groups,
SUM(SUM(salary)) OVER(PARTITION BY Gender ORDER BY Date)
/
SUM(SUM(employee_cnt)) OVER(PARTITION BY Gender ORDER BY Date) AS avg_salary,
SUM(SUM(employee_cnt)) OVER(PARTITION BY Gender ORDER BY Date) AS employee_cnt
FROM(
SELECT
FORMAT_DATE('%Y', DATE(PARSE_DATE('%m/%d/%Y', Timestamp))) AS Date,
Gender,
Employee_Groups,
salary,
COUNT(DISTINCT Employee_ID) AS employee_cnt
FROM
a_table
WHERE
Gender IS NOT NULL
GROUP BY
Date,
salary,
Employee_Groups,
Gender
ORDER BY
Date,
Gender
)
GROUP BY
Date,
Gender,
Employee_Groups
ORDER BY
Date,
Gender
The code output the following table:
| Date | Gender | avg_salary | employee_cnt |
|---|---|---|---|
| 2013 | Male | 24,000 | 3 |
| 2014 | Female | 20,000 | 1 |
| 2019 | Female | 23,000 | 2 |
My issue is one of structuring I think. Their values are cumulative averages so there was 1 female employee in 2014 and an additional female employee in 2019 making the employee_cnt 2 and the avg_salary the average between the two. If I have both female and male in a year this would be easy but since it's not it's causing me some issues
I want to find the Wage Equity Ratio which is Avg Female Salary / Avg Male Salary over time. So starting from 2014 when there were both male and female employees I want the cumulative ratio over the Years.
So the desired output would be
| Date | Ratio | female_employees | male_employees |
|---|---|---|---|
| 2014 | 0.833 | 1 | 3 |
| 2019 | 0.958 | 2 | 3 |
Solution 1:[1]
Averaging the ratio over the past years can be done by a window function.
With people as (
SELECT "male" as Gender, salary , year from unnest(generate_array(2012,2020,1)) as year, unnest(generate_array(23000,25000,1000)) as salary
union all SELECT "female" , 20000 ,2013
union all SELECT "female" , 24000 ,2019
union all SELECT "female" , 22000 ,2019
)
SELECT *,
avg(ratio) over win as ratio_average,
# I would take this kpi:
sum(salary_female) over win
/sum(count_female) over win
/ sum(salary_male) over win
* sum(count_male) over win as rolling_average
From
(
SELECT *, avg_female/avg_male as ratio
from people
PIVOT(AVG(salary) avg, COUNT(1) count,sum(salary) salary FOR Gender IN ('male','female') )
)
window win as (order by year asc RANGE BETWEEN UNBOUNDED PRECEDING and CURRENT ROW)
Having small numbers in a sample and then doing an average of an average may leed to wrong results. Averaging values which are average and have different sizes is an issue, because of over and under representation.
The rolling_average calculates the average over all years and thus ignores facts like inflation over time.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Samuel |
