'Moving Distinct Count in Big Query (SQL syntax)

I have sort of a weird one for all the SQL masters out there. I need to get the distinct count of items in a moving window of 14 days. I tried dense_rank but it didn't specify (or I did not know how to) specify the 14 day moving window.

For simplicity my dataset has 3 columns.

  1. store (string)
  2. item code (string)
  3. date (date)

A quick example of my endo goal would be the following:

  • Day 1 I scan items 1,2,3,4
  • Day 2 I scan items 2,3,4,5
  • Day 3 I scan items 1,6

So then for day 1 my uniques would be 4, Day 2 my uniques would be 5 and day 3 my uniques would be 6 (1,2,3,4,5,6)

Once I get to day 15 I would ignore values found in day 1 and only take days 2-15

Any and all help would be greatly appreciated.



Solution 1:[1]

Another option to consider - with use of HyperLogLog++ functions - so it consumes less resources and faster

select store, date, 
  ( select hll_count.merge(sketch)
    from t.sketches_14days sketch 
  ) distinct_items_count
from (
  select store, date, 
    array_agg(daily_sketch) over(partition by store order by unix_date(date) range between 13 preceding and current row) sketches_14days
  from (
    select store, date, hll_count.init(item_code) daily_sketch
    from your_table
    group by store, date
  )
) t      

Note:

HLL++ functions are approximate aggregate functions. Approximate aggregation typically requires less memory than exact aggregation functions, like COUNT(DISTINCT), but also introduces statistical uncertainty. This makes HLL++ functions appropriate for large data streams for which linear memory usage is impractical, as well as for data that is already approximate.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Mikhail Berlyant