'Restructuring Cumulative Running Totals

Firstly apologies for my title I wasnt sure how to able this. The below code outputs the running avg of male and female salaries for each year.

SELECT 
        Date,
        Gender,
        Employee_Groups,
        SUM(SUM(salary)) OVER(PARTITION BY Gender ORDER BY Date) 
        / 
        SUM(SUM(employee_cnt)) OVER(PARTITION BY Gender ORDER BY Date) AS avg_salary,
        
        SUM(SUM(employee_cnt)) OVER(PARTITION BY Gender ORDER BY Date) AS employee_cnt
    FROM(
        SELECT 
            FORMAT_DATE('%Y', DATE(PARSE_DATE('%m/%d/%Y', Timestamp))) AS Date,
            Gender,
            Employee_Groups,
            salary,
            COUNT(DISTINCT Employee_ID) AS employee_cnt
        FROM 
            a_table
        WHERE   
            Gender IS NOT NULL
        
        GROUP BY 
            Date,
            salary,
            Employee_Groups,
            Gender
        ORDER BY 
            Date,
            Gender
        )
    GROUP BY 
        Date,
        Gender,
        Employee_Groups
    
    ORDER BY 
        Date,
        Gender

The code output the following table:

Date Gender avg_salary employee_cnt
2013 Male 24,000 3
2014 Female 20,000 1
2019 Female 23,000 2

My issue is one of structuring I think. Their values are cumulative averages so there was 1 female employee in 2014 and an additional female employee in 2019 making the employee_cnt 2 and the avg_salary the average between the two. If I have both female and male in a year this would be easy but since it's not it's causing me some issues

I want to find the Wage Equity Ratio which is Avg Female Salary / Avg Male Salary over time. So starting from 2014 when there were both male and female employees I want the cumulative ratio over the Years.

So the desired output would be

Date Ratio female_employees male_employees
2014 0.833 1 3
2019 0.958 2 3


Solution 1:[1]

Averaging the ratio over the past years can be done by a window function.

With people as (
    SELECT "male" as Gender, salary , year from unnest(generate_array(2012,2020,1)) as year, unnest(generate_array(23000,25000,1000)) as salary 
    union all  SELECT "female" , 20000 ,2013
    union all  SELECT "female" , 24000 ,2019
    union all  SELECT "female" , 22000 ,2019
    
)

SELECT *,
avg(ratio) over win as ratio_average,
# I would take this kpi:
sum(salary_female) over win
/sum(count_female) over win 
/ sum(salary_male) over win
* sum(count_male) over win  as rolling_average
From
(
SELECT *, avg_female/avg_male as ratio
 from people
  PIVOT(AVG(salary) avg, COUNT(1) count,sum(salary) salary FOR Gender IN ('male','female') )
)   
window win as (order by year asc RANGE BETWEEN UNBOUNDED PRECEDING and CURRENT ROW)

Having small numbers in a sample and then doing an average of an average may leed to wrong results. Averaging values which are average and have different sizes is an issue, because of over and under representation.

The rolling_average calculates the average over all years and thus ignores facts like inflation over time.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Samuel