'Bigquery Propagate partition filter on alias of _PARTITIONTIME
I'm having an issue with partitioned tables and the propagation of the partition field.
with base_table as (
--Consumes about 3 MB
SELECT id, count(*) FROM `project.dataset.base_table`
,table1 as (
--Where partition_date=2022-04-26" consumes 600.5 MB
SELECT partition_date, id, count(*) as Amount1 FROM `project.dataset.view1` Group by partition_date,id
)
,table2 as (
--Where partition_date=2022-04-26" consumes 33.5 MB
--Without partition filter, consumes 15 GB
SELECT partition_date, id, count(*) as amount2 FROM `project.dataset.view2` Group by partition_date,id
)
Select
bt.id,t1.amount1, t2.amount2
FROM base_table bt
LEFT JOIN table1 t1 ON
bt.id=t1.id
LEFT JOIN table2 t2 ON
bt.id=t2.id AND
t1.partition_date = t2.partition_date
WHERE bt.id IS NOT NULL and t1.partition_date="2022-04-26"
This query consumes about 15,1 GB. But if I execute the query adding the following filter:
and t2.partition_date="2022-04-26"
Then the query consumes about 636 MB.
So what I can get from this, is that the partition filter is not being propagated throught the join.
Note: The view are something like this:
SELECT *, DATE(_PARTITIONTIME) AS PARTITION_DATE FROM `project.dataset.table1` WHERE DATE(_PARTITIONTIME) >= "2021-01-01"
For security reasons, I have no access directly to the tables.
Is there anything I can do To avoid writing the partition filter multiple times? (The original query has 15+ partitioned tables)
Solution 1:[1]
I think there's not you can do to avoid write the partition filter multiple times. Without specifying the partition prior the join all the partitions need to be scanned to get the rows which match the join condition.
You can use partition filter as an attribute to make easier change the date filter:
with partition_filter as (
select '2022-04-26' as start_date
)
,base_table as (
--Consumes about 3 MB
SELECT id, count(*) FROM `project.dataset.base_table`
,table1 as (
--Where partition_date=2022-04-26" consumes 600.5 MB
SELECT partition_date, id, count(*) as Amount1
FROM `project.dataset.view1`, partition_filter
WHERE partition_date = partition_filter.start_date
Group by partition_date,id
)
,table2 as (
--Where partition_date=2022-04-26" consumes 33.5 MB
--Without partition filter, consumes 15 GB
SELECT partition_date, id, count(*) as amount2
FROM `project.dataset.view2`, partition_filter
WHERE partition_date = partition_filter.start_date
Group by partition_date,id
)
Select
bt.id,t1.amount1, t2.amount2
FROM base_table bt
LEFT JOIN table1 t1 ON
bt.id=t1.id
LEFT JOIN table2 t2 ON
bt.id=t2.id AND
t1.partition_date = t2.partition_date
WHERE bt.id IS NOT NULL
Solution 2:[2]
I think this could work
for f in *read1.fastq.gz; do echo $f;zcat $f|wc -l ; done > read_count.txt
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Damião Martins |
| Solution 2 | Lino_ares |
