'Retrieving BigQuery validation errors when loading JSONL data via the Python API
How can I retrieve more information relating to the validation errors when loading a JSONL file into BigQuery? (The question is not about solving the issue)
Example code:
from google.cloud.bigquery import (
LoadJobConfig,
QueryJobConfig,
Client,
SourceFormat,
WriteDisposition
)
# variables depending on the environment
filename = '...'
gcp_project_id = '...'
dataset_name = '...'
table_name = '...'
schema = [ ... ]
# loading data
client = Client(project=project_id)
dataset_ref = client.dataset(dataset_name)
table_ref = dataset_ref.table(table_name)
job_config = LoadJobConfig()
job_config.source_format = SourceFormat.NEWLINE_DELIMITED_JSON
job_config.write_disposition = WriteDisposition.WRITE_APPEND
job_config.schema = schema
LOGGER.info('loading from %s', filename)
with open(filename, "rb") as source_file:
job = client.load_table_from_file(
source_file, destination=table_ref, job_config=job_config
)
# Waits for table cloud_data_store to complete
job.result()
Here I am using bigquery-schema-generator to generate a schema (as BigQuery otherwise only looks at the first 100 rows).
Running that might error with the following error message (google.api_core.exceptions.BadRequest
):
400 Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.
Looking at the errors
property basically doesn't provide any new information:
[{'reason': 'invalid',
'message': 'Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.'}]
I also looked at __dict__
of the exception but that hasn't revealed any further information.
Trying to load the table using the bq
command line (in this case without explicit schema) results in a much more helpful message:
BigQuery error in load operation: Error processing job '...': Provided Schema does not match Table <table name>. Field <field name> has changed type from TIMESTAMP to DATE
My question now is how would I be able to retrieve such helpful message from the Python API?
Solution based on accepted answer
Here is a copy and past workaround that one could add in order to show more information by default. (There may be downsides to it)
import google.cloud.exceptions
import google.cloud.bigquery.job
def get_improved_bad_request_exception(
job: google.cloud.bigquery.job.LoadJob
) -> google.cloud.exceptions.BadRequest:
errors = job.errors
result = google.cloud.exceptions.BadRequest(
'; '.join([error['message'] for error in errors]),
errors=errors
)
result._job = job
return result
def wait_for_load_job(
job: google.cloud.bigquery.job.LoadJob
):
try:
job.result()
except google.cloud.exceptions.BadRequest as exc:
raise get_improved_bad_request_exception(job) from exc
Then calling wait_for_load_job(job)
instead of job.result()
directly, will result in a more useful exception (the error message and errors
property).
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|