'Bigquery Rolling Count Distinct

I'd like to have a rolling count for daily visitors

Example.

date visitor
2022-02-01 A
2022-02-01 B
2022-02-01 C
2022-02-02 D
2022-02-02 E
2022-02-03 C
2022-02-03 F

I want the output to be:

date count_visitor
2022-02-01 3 (ABC)
2022-02-02 5 (DE)
2022-02-03 6 (CF)

I can't seem to find the query for this. Kindly need your help/



Solution 1:[1]

Consider below approach

select date, daily_count, daily_visitors_list, 
  ( select count(distinct visitor)
    from unnest(split(rolled_visitors)) visitor
  ) rolled_count
from (
  select date, daily_count, daily_visitors_list, 
    string_agg(daily_visitors_list) over(order by unix_date(date)) rolled_visitors
  from (
    select date, count(distinct visitor) daily_count, 
      string_agg(distinct visitor) daily_visitors_list, 
    from your_table
    group by date
  )
) t      

if applied to sample data in your question - output is

enter image description here

If you have heavy amount of data - above approach might end up with resource related error - in this case you can use below approach - it is using HyperLogLog++ functions

select date, daily_count, daily_visitors_list,
  ( select hll_count.merge(sketch)
    from t.rolling_visitors_sketches sketch 
  ) rolled_count
from (
  select date, daily_count, daily_visitors_list,
    array_agg(daily_visitors_sketch) over(order by unix_date(date)) rolling_visitors_sketches
  from (
    select date, count(distinct visitor) daily_count, 
      string_agg(distinct visitor) daily_visitors_list, 
      hll_count.init(visitor) daily_visitors_sketch
    from your_table
    group by date
  )
) t      

HLL++ functions are approximate aggregate functions. Approximate aggregation typically requires less memory than exact aggregation functions, like COUNT(DISTINCT), but also introduces statistical uncertainty. This makes HLL++ functions appropriate for large data streams for which linear memory usage is impractical, as well as for data that is already approximate.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Mikhail Berlyant