'Google BigQuery Query exceeded resource limits

I'm setting up a crude data warehouse for my company and I've successfully pulled contact, company, deal and association data from our CRM into bigquery, but when I join these together into a master table for analysis via our BI platform, I continually get the error:

Query exceeded resource limits. This query used 22602 CPU seconds but would charge only 40M Analysis bytes. This exceeds the ratio supported by the on-demand pricing model. Please consider moving this workload to the flat-rate reservation pricing model, which does not have this limit. 22602 CPU seconds were used, and this query must use less than 10200 CPU seconds.

As such, I'm looking to optimise my query. I've already removed all GROUP BY and ORDER BY commands, and have tried using WHERE commands to do additional filtering but this seems illogical to me as it would add processing demands.

My current query is:

SELECT 
    coy.company_id,
    cont.contact_id,
    deals.deal_id,
    {another 52 fields}
FROM `{contacts}` AS cont
LEFT JOIN `{assoc-contact}` AS ac
ON cont.contact_id = ac.to_id
LEFT JOIN `{companies}` AS coy 
ON CAST(ac.from_id AS int64)  = coy.company_id
LEFT JOIN `{assoc-deal}` AS ad
ON coy.company_id = CAST(ad.from_id AS int64) 
LEFT JOIN `{deals}` AS deals
ON ad.to_id = deals.deal_id;

FYI {assoc-contact} and {assoc-deal} are both separate views I created from the associations table for easier associations of those tables to the companies table.

It should also be noted that this query has occasionally run successfully, so I know it does work, it just fails about 90% of the time due to the query being so big.

Solution 1:^[1]

TLDR;

Check your join keys. 99% of the time the cause of the problem is a combinatoric explosion.

I can't know for sure since I don't have access to the data of the underlying table, but I will give a general resolution method which in my experience worked every time to find the root cause.

Long Answer

Investigation method

Say you are joining two tables

SELECT 
  cols
FROM L
JOIN R ON L.c1 = R.c1 AND L.c2 = R.c2

and you run into this error. The first thing you should do is check for duplicates in both tables.

SELECT 
  c1, c2, COUNT(1) as nb
FROM L
GROUP BY c1, c2
ORDER by nb DESC

And the same thing for each table involved in a join.

I bet that you will find that your join keys is duplicated. BigQuery is very scalable, so in my experience this error happens when you have a join key that repeats more than 100 000 times on both tables. It means that after your join, you will have 100000^2 = 10 billion rows !!!

Why BigQuery gives this error

In my experience, this error message means that your query does too many computation compared to the size of your inputs. No wonder you're getting this if you end up with 10 billion rows after joining tables with a few million rows each.

BigQuery's on-demand pricing model is based on the amount of data read in your tables. This means that people could try to abuse this by, say running CPU-intensive computations while reading small datasets. To give an extreme example, imagine someone makes a Javascript UDF to mine bitcoin and runs it on BigQuery

  SELECT MINE_BITCOIN_UDF()

The query will be billed $0 because it doesn't read anything, but will consume hours of Google's CPU. Of course they had to do something about this.

So this ratio exists to make sure that users don't do anything sketchy by using hours of CPUs while processing a few Mb of inputs.

Other MPP platforms with a different pricing model (e.g. Azure Synapse who charges based on the amount of bytes processed, not read like BQ) would perhaps have run without complaining, and then billed you 10Tb for reading that 40Mb table.

P.S.: Sorry for the late and long answer, it's probably too late for the person who asked, but hopefully it will help whoever runs into that error.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	FurryMachine