'How to query INFORMATION_SCHEMA view using spark bq connector?
I'm trying to identify partitions which got updated from a BQ table using the below query:
select * from PROJECT-ID.DATASET.INFORMATION_SCHEMA.PARTITIONS where
table_name='TABLE-NAME' and
extract(date from last_modified_time)='TODAY-DATE'
This is working fine from the BQ console. However when I use the same query from spark-bq connector it's failing.
spark.read.format("bigquery").load("PROJECT-ID.DATASET.INFORMATION_SCHEMA.PARTITIONS")
Error:
"Invalid project ID PROJECT-ID. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash."
I tried multiple combinations like by adding ` after PROJECT-ID but the API is still throwing 400 error.
What is the right way to query the INFORMATION_SCHEMA from spark-bq connector?
Solution 1:[1]
Setting the project is as parentProject is solving the issue.
spark.read
.format("bigquery")
.option('parentProject', project_id)
Solution 2:[2]
INFORMATION_SCHEMA is not a standard dataset in BigQuery, and as such is not available via the BigQuery Storage API used by the spark-bigquery connector. However, you can query it and load the data into a dataframe in the following manner:
spark.conf.set("viewsEnabled","true")
spark.conf.set("materializationDataset","<dataset>")
val tablesDF = spark.read.format("bigquery").load("select * from `<projectId>.<dataset>.__TABLES__`")
Solution 3:[3]
table = "INFORMATION_SCHEMA.TABLES"
sql = f"""SELECT * FROM {project_id}.{dataset}.{table} """
return ( spark. read. format('bigquery'). load(sql) )
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | SANN3 |
| Solution 2 | David Rabinowitz |
| Solution 3 | Kev Pinto |
