'Bigquery query to get sum of values of one column based on another column
I want to write a bigquery query to get values of a column which is the sum of values of another column, based on a "like" condition. In the below table column starts_with_count is what I want to fill. I have added the expected values for this column manually to show my expectation. Other column values are already present. The starts_with_count value is sum (full_count) where its link appears in other rows.
| company | link | full_count | starts_with_count (expected) |
|---|---|---|---|
| abc | http://www.abc.net1 | 1 | 15 (= sum (full_count) where link like 'http://www.abc.net1%') |
| abc | http://www.abc.net1/page1 | 2 | 9 (= sum (full_count) where link like 'http://www.abc.net1/page1%') |
| abc | http://www.abc.net1/page1/folder1 | 3 | 3 (= sum (full_count) where link like 'http://www.abc.net1/page1/folder1%') |
| abc | http://www.abc.net1/page1/folder2 | 4 | 4 |
| abc | http://www.abc.net1/page2 | 5 | 5 |
| xyz | http://www.xyz.net1/ | 6 | 21 |
| xyz | http://www.xyz.net1/page1/ | 7 | 15 |
| xyz | http://www.xyz.net1/page1/file1 | 8 | 8 |
Solution 1:[1]
Try this:
WITH sample AS (
SELECT * FROM UNNEST([
STRUCT('abc' AS company, 'http://www.abc.net1' AS link, 1 AS full_count),
('abc', 'http://www.abc.net1/page1', 2),
('abc', 'http://www.abc.net1/page1/folder1', 3),
('abc', 'http://www.abc.net1/page1/folder2', 4),
('abc', 'http://www.abc.net1/page2', 5),
('xyz', 'http://www.xyz.net1/', 6),
('xyz', 'http://www.xyz.net1/page1/', 7),
('xyz', 'http://www.xyz.net1/page1/file1', 8)
])
)
SELECT first.company, first.link, SUM(second.full_count) AS starts_with_count
FROM sample first, sample second
WHERE STARTS_WITH(second.link, first.link)
GROUP BY 1, 2
;
output:
Solution 2:[2]
Another option
select * except(links),
( select sum(full_count)
from t.links
where starts_with(link, t.link)
) starts_with_count
from (
select *,
array_agg(struct(link, full_count)) over(partition by company) links
from your_table
) t
if applied to sample data in your question - output is
For the simple/dummy example you provided - performance improvement is significant!
Which one to use really depends on your real data!
To analyze - use EXECUTION DETAILS tab
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 |



