'Bigquery query performance when using starts_with() on a table of 12Mil rows

I have a table company_totals, that has the following schema -

column_name column_data_type
company STRING
link STRING
full_count FLOAT
starts_with_count FLOAT

Number of rows = 12,000,000. Table size = 1.6 GB. CLUSTERED BY = company link. SEARCH INDEX created on column = link.

I have the following select statement which is taking beyond 6 hours and the execution results in timeout - Operation timed out after 6.0 hours. Consider reducing the amount of work performed by your operation so that it can complete within this limit.)

SELECT first_table.company, first_table.link, null as full_count, SUM(second_table.full_count) AS starts_with_count 
FROM company_totals first_table, company_totals second_table 
WHERE STARTS_WITH(second_table.link, first_table.link)
group by first_table.company, first_table.link

The above query calculates values of the column starts_with_count which is the sum of values of another column full_count, based on a starts_with() condition. In the company_totals table, the column starts_with_count is what I want to fill. I have added the expected values for this column manually to show my expectation. Other column values are already present in the table. The starts_with_count value is sum (full_count) where its link appears in other rows.

company link full_count starts_with_count (expected)
abc http://www.abc.net1 1 15 (= sum (full_count) where link like 'http://www.abc.net1%')
abc http://www.abc.net1/page1 2 9 (= sum (full_count) where link like 'http://www.abc.net1/page1%')
abc http://www.abc.net1/page1/folder1 3 3 (= sum (full_count) where link like 'http://www.abc.net1/page1/folder1%')
abc http://www.abc.net1/page1/folder2 4 4
abc http://www.abc.net1/page2 5 5
xyz http://www.xyz.net1/ 6 21
xyz http://www.xyz.net1/page1/ 7 15
xyz http://www.xyz.net1/page1/file1 8 8

Highly appreciate any help in this issue.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source