'How to use NTILE function to create two groups for two columns where one is a nested NTILE of the other?
I have financial data that I want to categorise (using NTILE()) by two columns that contain percentage values (risk, debt_to_assets). My end goal is that I want to aggregate some other column (profit) by the two categories, but that's not related to my issue at hand. The data looks something like this:
| profit | risk | debt_to_assets |
|---|---|---|
| 7000 | 0.10 | 0.20 |
| 1000 | 0.40 | 0.70 |
| 3000 | 0.15 | 0.50 |
| 4000 | 0.30 | 0.30 |
| 2000 | 0.20 | 0.60 |
The issue I'm trying to solve is that I want the categories to be nested such that, in addition to the population distribution being uniform, the categories of the inner quantile are consistent across the categories of the outer quantile in terms of the range that defines the categories (i.e. I want the min and max value for the inner categories (x0, y0), (x1, y0), (x2, y0), ... to all be the same or as close as possible, where the x's are the outer category and the y's are the inner category).
Ideally if I were to aggregate the columns used for the NTILE() function (using 3 inner categories and 3 outer categories for example) I'd want a table that resembles the following:
| risk_cat | dta_cat | min_risk | max_risk | min_dta | max_dta | count |
|---|---|---|---|---|---|---|
| 1 | 1 | 0.00 | 0.33 | 0.00 | 0.33 | 100 |
| 1 | 2 | 0.00 | 0.33 | 0.34 | 0.67 | 100 |
| 1 | 3 | 0.00 | 0.33 | 0.68 | 1.00 | 100 |
| 2 | 1 | 0.34 | 0.67 | 0.00 | 0.33 | 100 |
| 2 | 2 | 0.34 | 0.67 | 0.34 | 0.67 | 100 |
| 2 | 3 | 0.34 | 0.67 | 0.68 | 1.00 | 100 |
| 3 | 1 | 0.68 | 1.00 | 0.00 | 0.33 | 100 |
| 3 | 2 | 0.68 | 1.00 | 0.34 | 0.67 | 100 |
| 3 | 3 | 0.68 | 1.00 | 0.68 | 1.00 | 100 |
These are the solutions I've tried but they only solve part of the issue, not the whole thing:
SELECT *,
NTILE(3) OVER (
ORDER BY risk
) AS risk_cat,
NTILE(3) OVER (
ORDER BY debt_to_assets
) AS dta_cat
FROM my_table
This would result in an aggregated table like this:
| risk_cat | dta_cat | min_risk | max_risk | min_dta | max_dta | count |
|---|---|---|---|---|---|---|
| 1 | 1 | 0.00 | 0.33 | 0.00 | 0.33 | 10 |
| 1 | 2 | 0.00 | 0.33 | 0.34 | 0.67 | 55 |
| 1 | 3 | 0.00 | 0.33 | 0.68 | 1.00 | 180 |
| 2 | 1 | 0.34 | 0.67 | 0.00 | 0.33 | 135 |
| 2 | 2 | 0.34 | 0.67 | 0.34 | 0.67 | 140 |
| 2 | 3 | 0.34 | 0.67 | 0.68 | 1.00 | 100 |
| 3 | 1 | 0.68 | 1.00 | 0.00 | 0.33 | 130 |
| 3 | 2 | 0.68 | 1.00 | 0.34 | 0.67 | 110 |
| 3 | 3 | 0.68 | 1.00 | 0.68 | 1.00 | 40 |
The problem is that the count across the two categories isn't uniform.
WITH outer_cat AS (
SELECT *,
NTILE(3) OVER (
ORDER BY risk
) AS risk_cat
FROM my_table
)
SELECT *,
NTILE(3) OVER(
PARTITION BY risk_cat
ORDER BY debt_to_assets
) AS dta_cat
FROM outer_cat
The aggregated table for this might resemble the following:
| risk_cat | dta_cat | min_risk | max_risk | min_dta | max_dta | count |
|---|---|---|---|---|---|---|
| 1 | 1 | 0.00 | 0.33 | 0.10 | 0.70 | 100 |
| 1 | 2 | 0.00 | 0.33 | 0.71 | 0.90 | 100 |
| 1 | 3 | 0.00 | 0.33 | 0.91 | 1.00 | 100 |
| 2 | 1 | 0.34 | 0.67 | 0.05 | 0.35 | 100 |
| 2 | 2 | 0.34 | 0.67 | 0.36 | 0.60 | 100 |
| 2 | 3 | 0.34 | 0.67 | 0.61 | 0.90 | 100 |
| 3 | 1 | 0.68 | 1.00 | 0.00 | 0.25 | 100 |
| 3 | 2 | 0.68 | 1.00 | 0.26 | 0.50 | 100 |
| 3 | 3 | 0.68 | 1.00 | 0.51 | 0.80 | 100 |
The problem this time is that the min and max values for the inner category vary to much across the outer category.
SELECT *,
NTILE(9) OVER(
ORDER BY risk, debt_to_assets
) AS dual_cat
FROM my_table
The aggregated table for this looks something like the following:
| dual_cat | min_risk | max_risk | min_dta | max_dta | count |
|---|---|---|---|---|---|
| 1 | 0.00 | 0.11 | 0.55 | 1.00 | 100 |
| 2 | 0.12 | 0.22 | 0.35 | 1.00 | 100 |
| 3 | 0.23 | 0.33 | 0.15 | 1.00 | 100 |
| 4 | 0.34 | 0.44 | 0.40 | 1.00 | 100 |
| 5 | 0.45 | 0.55 | 0.10 | 1.00 | 100 |
| 6 | 0.56 | 0.66 | 0.10 | 0.95 | 100 |
| 7 | 0.67 | 0.77 | 0.05 | 1.00 | 100 |
| 8 | 0.78 | 0.88 | 0.20 | 1.00 | 100 |
| 9 | 0.89 | 1.00 | 0.00 | 1.00 | 100 |
This was just a last attempt at a solution after the previous two didn't work. This attempt didn't capture any of the behaviour that I was looking for.
Is there a solution to my problem that I'm not seeing?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
