'Efficiently join latest entry of first table to second table depending on entity characteristics of first table
Looking for efficient solution to join two tables but with the caveat that characteristics of second table should determine what is joined to first table (in Google BigQuery).
Lets say I have two tables. One Table with events (id, session, event_date) and a second with policies applying to events (event_id, policy, create_date) and I want to determine which policy applied to an event based on the policy create date and the event date.
CREATE TEMP TABLE events AS (
SELECT *
FROM UNNEST([
STRUCT(1 AS id, "A" AS session, "2021-11-05" AS event_date),
(1, "B", "2021-12-17"),
(2, "A", "2021-08-13")
])
);
CREATE TEMP TABLE policies AS (
SELECT *
FROM UNNEST([
STRUCT(1 AS event_id, "foo" AS policy, "2021-01-01" AS create_date),
(1, "bar", "2021-12-01"),
(2, "foo", "2021-02-01")
])
)
In my example, the result should look like this if I get the latest policy_create_date that was in existence by the time of the event (enevt_date).
| id | session | policy_create_date |
|---|---|---|
| 1 | A | 2021-01-01 |
| 1 | B | 2021-12-01 |
| 2 | A | 2021-02-01 |
The following solution would provide the result I want, but it create a N:N JOIN and can become quite big and calculation intense, if both tables get large (especially if I have many of the same events and many policy changes). Hence, I'm looking for a solution that is more efficient than the solution below and avoids the N:N JOIN.
SELECT
e.id,
e.session,
MAX(p.create_date) AS policy_create_date -- get latest policy amongst all policies for an event_id that existed before the session took place
FROM events e
INNER JOIN policies p
ON e.id = p.event_id -- match event and policy based on event_id
AND p.create_date < e.event_date -- match only policies that existed before the session of the event took place
GROUP BY 1, 2
TY!!!
Edit: I adjusted the known but inefficient solution to better reflect my goal. Of course, I want the policy in the end, but that is not in focus here.
Solution 1:[1]
You can try the window function
WITH cte AS (
SELECT e.id, e.session, p.policy
, row_number() over(partition by e.id, e.session order by p.create_date desc) rn
FROM events e
INNER JOIN policies p
ON e.id = p.event_id AND p.create_date < e.event_date
)
SELECT c.id, c.session, c.policy
FROM cte c
where rn=1
Solution 2:[2]
I have tried the following code on Postgres, but there shouldn't be anything in there that is postgres specific.
Your query can be reorganised using a subquery to:
SELECT
e.id,
e.session,
(SELECT MAX(create_date) FROM policies AS p WHERE e.id = p.event_id AND p.create_date < e.event_date) AS policy_create_date
FROM events e
WHERE policy_create_date IS NOT NULL
While this query should show similar performance it makes it easier to spot the problem with the overall query: While finding the MAX the database has already found and read the row from policies with the highest date, but you are not getting the the value of the policy column out. So, you need to do a second join.
Using a lateral join you can get the complete relevant row from policies in one go.
SELECT
e.id,
e.session,
p2.policy,
p2.create_date
FROM events AS e
INNER JOIN LATERAL
(SELECT
*
FROM policies AS p
WHERE e.id = p.event_id AND p.create_date < e.event_date
ORDER BY p.create_date DESC
LIMIT 1) AS p2
ON TRUE;
This should use an index on policies. So, time should increase linear with size of events and logarithmic with size of policies.
Nevertheles, you can't expect great performance when you do this for large resultsets, because there will be lots of cache-misses while accessing the policies table.
Solution 3:[3]
Another option is to interleave the two tables, then use LAST_VALUE() to look back to find the policy data...
WITH
interleave AS
(
SELECT
id AS event_id,
event_date AS event_date,
session AS event_session,
NULL AS policy_label,
NULL AS policy_date
FROM
events
UNION ALL
SELECT
event_id,
create_date,
NULL,
policy,
create_date
FROM
policies
),
lookback AS
(
SELECT
event_id,
event_session,
event_date,
LAST_VALUE(policy_label IGNORE NULLS) OVER event_order AS policy_label,
LAST_VALUE(policy_date IGNORE NULLS) OVER event_order AS policy_date
FROM
interleave
WINDOW
event_order AS (
PARTITION BY event_id
ORDER BY event_date,
event_session NULLS FIRST
ROWS BETWEEN UNBOUNDED PRECEDING
AND 1 PRECEDING
)
)
SELECT
event_id,
event_session,
event_date,
policy_label,
policy_date
FROM
lookback
WHERE
event_session IS NOT NULL
This presumes that the events table is vastly larger than the policies table.
I'd also recommend ensuring the tables are partitioned by the event_id and clustered by their respective date column.
Solution 4:[4]
Another option is to use LEAD() to find a policy's "expiry" date, then use that in the join...
WITH
policy_range AS
(
SELECT
event_id,
policy,
create_date,
LEAD(create_date, 1, DATE '9999-12-31') OVER event_order AS expiry_date
FROM
policies
WINDOW
event_order AS (
PARTITION BY event_id
ORDER BY create_date
)
)
SELECT
e.id,
e.session,
e.event_date,
p.policy,
p.create_date
FROM
policy_range AS p
INNER JOIN
events AS e
ON e.id = p.event_id
AND e.event_date > p.create_date
AND e.event_date <= p.expiry_date
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Serg |
| Solution 2 | Hinni |
| Solution 3 | MatBailie |
| Solution 4 |
