'Efficiently join latest entry of first table to second table depending on entity characteristics of first table

Looking for efficient solution to join two tables but with the caveat that characteristics of second table should determine what is joined to first table (in Google BigQuery).

Lets say I have two tables. One Table with events (id, session, event_date) and a second with policies applying to events (event_id, policy, create_date) and I want to determine which policy applied to an event based on the policy create date and the event date.

CREATE TEMP TABLE events AS (
    SELECT *
    FROM UNNEST([
        STRUCT(1 AS id, "A" AS session, "2021-11-05" AS event_date),
        (1, "B", "2021-12-17"),
        (2, "A", "2021-08-13")
    ])
);
CREATE TEMP TABLE policies AS (
SELECT *
    FROM UNNEST([
        STRUCT(1 AS event_id, "foo" AS policy, "2021-01-01" AS create_date),
        (1, "bar", "2021-12-01"),
        (2, "foo", "2021-02-01")
    ])
)

In my example, the result should look like this if I get the latest policy_create_date that was in existence by the time of the event (enevt_date).

id session policy_create_date
1 A 2021-01-01
1 B 2021-12-01
2 A 2021-02-01

The following solution would provide the result I want, but it create a N:N JOIN and can become quite big and calculation intense, if both tables get large (especially if I have many of the same events and many policy changes). Hence, I'm looking for a solution that is more efficient than the solution below and avoids the N:N JOIN.

SELECT
  e.id,
  e.session, 
  MAX(p.create_date) AS policy_create_date -- get latest policy amongst all policies for an event_id that existed before the session took place
FROM events e
INNER JOIN policies p
  ON e.id = p.event_id -- match event and policy based on event_id
  AND p.create_date < e.event_date -- match only policies that existed before the session of the event took place
GROUP BY 1, 2

TY!!!

Edit: I adjusted the known but inefficient solution to better reflect my goal. Of course, I want the policy in the end, but that is not in focus here.



Solution 1:[1]

You can try the window function

WITH cte AS (
  SELECT e.id, e.session, p.policy
   , row_number() over(partition by e.id, e.session order by p.create_date desc) rn
  FROM events e
  INNER JOIN policies p
  ON e.id = p.event_id AND p.create_date < e.event_date
)
SELECT c.id, c.session, c.policy
FROM cte c
where rn=1

Solution 2:[2]

I have tried the following code on Postgres, but there shouldn't be anything in there that is postgres specific.

Your query can be reorganised using a subquery to:

SELECT
  e.id,
  e.session, 
  (SELECT MAX(create_date) FROM policies AS p WHERE e.id = p.event_id AND p.create_date < e.event_date) AS policy_create_date
FROM events e
WHERE policy_create_date IS NOT NULL

While this query should show similar performance it makes it easier to spot the problem with the overall query: While finding the MAX the database has already found and read the row from policies with the highest date, but you are not getting the the value of the policy column out. So, you need to do a second join.

Using a lateral join you can get the complete relevant row from policies in one go.

SELECT
  e.id,
  e.session,
  p2.policy,
  p2.create_date
FROM events AS e
INNER JOIN LATERAL
   (SELECT
     * 
    FROM policies AS p 
    WHERE e.id = p.event_id AND p.create_date < e.event_date 
    ORDER BY p.create_date DESC
    LIMIT 1) AS p2 
ON TRUE;

This should use an index on policies. So, time should increase linear with size of events and logarithmic with size of policies.

Nevertheles, you can't expect great performance when you do this for large resultsets, because there will be lots of cache-misses while accessing the policies table.

Solution 3:[3]

Another option is to interleave the two tables, then use LAST_VALUE() to look back to find the policy data...

WITH
  interleave AS
(
  SELECT
    id          AS event_id,
    event_date  AS event_date,
    session     AS event_session,
    NULL        AS policy_label,
    NULL        AS policy_date
  FROM
    events

  UNION ALL

  SELECT
    event_id,
    create_date,
    NULL,
    policy,
    create_date
  FROM
    policies
),
  lookback AS
(
  SELECT
    event_id,
    event_session,
    event_date,
    LAST_VALUE(policy_label IGNORE NULLS) OVER event_order   AS policy_label,
    LAST_VALUE(policy_date  IGNORE NULLS) OVER event_order   AS policy_date
  FROM
    interleave
  WINDOW
    event_order AS (
      PARTITION BY event_id
          ORDER BY event_date,
                   event_session NULLS FIRST
      ROWS BETWEEN UNBOUNDED PRECEDING
               AND         1 PRECEDING
    )
)
SELECT
  event_id,
  event_session,
  event_date,
  policy_label,
  policy_date
FROM
  lookback
WHERE
  event_session IS NOT NULL

This presumes that the events table is vastly larger than the policies table.

I'd also recommend ensuring the tables are partitioned by the event_id and clustered by their respective date column.

Solution 4:[4]

Another option is to use LEAD() to find a policy's "expiry" date, then use that in the join...

WITH
  policy_range AS
(
  SELECT
    event_id,
    policy,
    create_date,
    LEAD(create_date, 1, DATE '9999-12-31') OVER event_order  AS expiry_date
  FROM
    policies
  WINDOW
    event_order AS (
      PARTITION BY event_id
          ORDER BY create_date
    )
)
SELECT
  e.id,
  e.session,
  e.event_date,
  p.policy,
  p.create_date
FROM
  policy_range  AS p
INNER JOIN
  events        AS e
    ON  e.id          = p.event_id
    AND e.event_date >  p.create_date
    AND e.event_date <= p.expiry_date

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Serg
Solution 2 Hinni
Solution 3 MatBailie
Solution 4