'Bigquery query performance when using starts_with() on a table of 12Mil rows
I have a table company_totals, that has the following schema -
| column_name | column_data_type |
|---|---|
| company | STRING |
| link | STRING |
| full_count | FLOAT |
| starts_with_count | FLOAT |
Number of rows = 12,000,000. Table size = 1.6 GB. CLUSTERED BY = company link. SEARCH INDEX created on column = link.
I have the following select statement which is taking beyond 6 hours and the execution results in timeout - Operation timed out after 6.0 hours. Consider reducing the amount of work performed by your operation so that it can complete within this limit.)
SELECT first_table.company, first_table.link, null as full_count, SUM(second_table.full_count) AS starts_with_count
FROM company_totals first_table, company_totals second_table
WHERE STARTS_WITH(second_table.link, first_table.link)
group by first_table.company, first_table.link
The above query calculates values of the column starts_with_count which is the sum of values of another column full_count, based on a starts_with() condition. In the company_totals table, the column starts_with_count is what I want to fill. I have added the expected values for this column manually to show my expectation. Other column values are already present in the table. The starts_with_count value is sum (full_count) where its link appears in other rows.
| company | link | full_count | starts_with_count (expected) |
|---|---|---|---|
| abc | http://www.abc.net1 | 1 | 15 (= sum (full_count) where link like 'http://www.abc.net1%') |
| abc | http://www.abc.net1/page1 | 2 | 9 (= sum (full_count) where link like 'http://www.abc.net1/page1%') |
| abc | http://www.abc.net1/page1/folder1 | 3 | 3 (= sum (full_count) where link like 'http://www.abc.net1/page1/folder1%') |
| abc | http://www.abc.net1/page1/folder2 | 4 | 4 |
| abc | http://www.abc.net1/page2 | 5 | 5 |
| xyz | http://www.xyz.net1/ | 6 | 21 |
| xyz | http://www.xyz.net1/page1/ | 7 | 15 |
| xyz | http://www.xyz.net1/page1/file1 | 8 | 8 |
Highly appreciate any help in this issue.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
