'How to query INFORMATION_SCHEMA view using spark bq connector?

I'm trying to identify partitions which got updated from a BQ table using the below query:

select * from PROJECT-ID.DATASET.INFORMATION_SCHEMA.PARTITIONS where 
table_name='TABLE-NAME' and 
extract(date from last_modified_time)='TODAY-DATE'

This is working fine from the BQ console. However when I use the same query from spark-bq connector it's failing.

spark.read.format("bigquery").load("PROJECT-ID.DATASET.INFORMATION_SCHEMA.PARTITIONS")

Error:

"Invalid project ID PROJECT-ID. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash."

I tried multiple combinations like by adding ` after PROJECT-ID but the API is still throwing 400 error.

What is the right way to query the INFORMATION_SCHEMA from spark-bq connector?



Solution 1:[1]

Setting the project is as parentProject is solving the issue.

spark.read
.format("bigquery")
.option('parentProject', project_id)

Solution 2:[2]

INFORMATION_SCHEMA is not a standard dataset in BigQuery, and as such is not available via the BigQuery Storage API used by the spark-bigquery connector. However, you can query it and load the data into a dataframe in the following manner:

spark.conf.set("viewsEnabled","true")
spark.conf.set("materializationDataset","<dataset>")
val tablesDF = spark.read.format("bigquery").load("select * from `<projectId>.<dataset>.__TABLES__`")

Solution 3:[3]

table = "INFORMATION_SCHEMA.TABLES"

sql = f"""SELECT * FROM {project_id}.{dataset}.{table} """

return ( spark. read. format('bigquery'). load(sql) )

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 SANN3
Solution 2 David Rabinowitz
Solution 3 Kev Pinto