'Bigquery Rolling Count Distinct
I'd like to have a rolling count for daily visitors
Example.
| date | visitor |
|---|---|
| 2022-02-01 | A |
| 2022-02-01 | B |
| 2022-02-01 | C |
| 2022-02-02 | D |
| 2022-02-02 | E |
| 2022-02-03 | C |
| 2022-02-03 | F |
I want the output to be:
| date | count_visitor |
|---|---|
| 2022-02-01 | 3 (ABC) |
| 2022-02-02 | 5 (DE) |
| 2022-02-03 | 6 (CF) |
I can't seem to find the query for this. Kindly need your help/
Solution 1:[1]
Consider below approach
select date, daily_count, daily_visitors_list,
( select count(distinct visitor)
from unnest(split(rolled_visitors)) visitor
) rolled_count
from (
select date, daily_count, daily_visitors_list,
string_agg(daily_visitors_list) over(order by unix_date(date)) rolled_visitors
from (
select date, count(distinct visitor) daily_count,
string_agg(distinct visitor) daily_visitors_list,
from your_table
group by date
)
) t
if applied to sample data in your question - output is
If you have heavy amount of data - above approach might end up with resource related error - in this case you can use below approach - it is using HyperLogLog++ functions
select date, daily_count, daily_visitors_list,
( select hll_count.merge(sketch)
from t.rolling_visitors_sketches sketch
) rolled_count
from (
select date, daily_count, daily_visitors_list,
array_agg(daily_visitors_sketch) over(order by unix_date(date)) rolling_visitors_sketches
from (
select date, count(distinct visitor) daily_count,
string_agg(distinct visitor) daily_visitors_list,
hll_count.init(visitor) daily_visitors_sketch
from your_table
group by date
)
) t
HLL++ functions are approximate aggregate functions. Approximate aggregation typically requires less memory than exact aggregation functions, like COUNT(DISTINCT), but also introduces statistical uncertainty. This makes HLL++ functions appropriate for large data streams for which linear memory usage is impractical, as well as for data that is already approximate.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Mikhail Berlyant |

