'Google BigQuery unexpected behaviour on ignoreUnknownValues option on load job (receiving error for extra column)
I'm using Airflow to trigger a load job in Google BigQuery. The source files consist of multiple NDJSON files.
Here is the Airflow operator (not relevant I think. given for context):
load = GoogleCloudStorageToBigQueryOperator(
task_id=f"load",
bigquery_conn_id="bigquery_default",
pool="bigquery_insert",
destination_project_dataset_table="<HIDDEN>",
bucket="<HIDDEN>",
source_objects=list_files(),
source_format="NEWLINE_DELIMITED_JSON",
write_disposition="WRITE_APPEND",
autodetect=True,
ignore_unknown_values=True
)
To check that it's not Airflow fault, I have debugged and extracted exactly the payload that Airflow is sending to Google BigQuery REST API:
{
"configuration":{
"load":{
"autodetect":True,
"createDisposition":"CREATE_IF_NEEDED",
"destinationTable":{
"projectId":"<PRIVATE>",
"datasetId":"<PRIVATE>",
"tableId":"<PRIVATE>"
},
"sourceFormat":"NEWLINE_DELIMITED_JSON",
"sourceUris":[
"<PRIVATE>"
],
"writeDisposition":"WRITE_APPEND",
"ignoreUnknownValues":True
}
}
}
Since I'm setting the option ignoreUnknownValues (documentation), I would expect that JSON fields that are in my source files but not in my destination schema would be ignored, but I'm getting the following error back from BigQuery:
Exception: BigQuery job failed. Final error was: {'reason': 'invalid', 'message': 'Provided Schema does not match Table [PRIVATE]. Cannot add fields (field: source_fingerprint)'}. The job was: {'kind': 'bigquery#job', 'etag': '[PRIVATE]', 'id': '[PRIVATE]', 'selfLink': '[PRIVATE]', 'user_email': '[PRIVATE]', 'configuration': {'load': {'sourceUris': [[PRIVATE]], 'destinationTable': {'projectId': '[PRIVATE]', 'datasetId': 'airflow', 'tableId': '[PRIVATE]'}, 'createDisposition': 'CREATE_IF_NEEDED', 'writeDisposition': 'WRITE_APPEND', 'sourceFormat': 'NEWLINE_DELIMITED_JSON', 'ignoreUnknownValues': True, 'autodetect': True}, 'jobType': 'LOAD'}, 'jobReference': {'projectId': '[PRIVATE]', 'jobId': '[PRIVATE]', 'location': 'EU'}, 'statistics': {'creationTime': '1581675754961', 'startTime': '1581675755090', 'endTime': '1581675755491'}, 'status': {'errorResult': {'reason': 'invalid', 'message': 'Provided Schema does not match Table [PRIVATE]. Cannot add fields (field: source_fingerprint)'}, 'errors': [{'reason': 'invalid', 'message': 'Provided Schema does not match Table [PRIVATE]. Cannot add fields (field: source_fingerprint)'}], 'state': 'DONE'}}
Note that my ignoreUnknownValues option is also coming back on the response, so it was understood by on their side.
I would expect the extra columns to be ignored and the job to complete successfully, as per documentation:
ignoreUnknownValues: boolean
[Optional] Indicates if BigQuery should allow extra values that are not represented in the table schema. If true, the extra values are ignored. If false, records with extra columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result. The default value is false. The sourceFormat property determines what BigQuery treats as an extra value: CSV: Trailing columns JSON: Named values that don't match any column names
Does anyone know what is happening?
Note that I do not want to update my schema (and hence, I'm not using the option schemaUpdateOptions). I would like the extra columns to be ignored.
Thanks
--
Update 1: I'm using Airflow 1.10.3, which already supports this syntax for this option. Older versions of Airflow had a different way of passing this parameter, but as we can see on the payload that I posted, Airflow seems to be sending the right option to Google BigQuery API (related question does not apply).
Update 2: Also when using the CLI I'm getting the same error.
bq load --autodetect --source_format=NEWLINE_DELIMITED_JSON --noreplace --ignore_unknown_values [MY TABLE NAME] [MY GCS PATH]
Waiting on bqjob_[...]_1 ... (0s) Current status: DONE BigQuery error in load operation: Error processing job '[...]': Provided Schema does not match Table [...]. Cannot add fields (field: metadata_deposit_00_sourceId)
Update 3: It looks like the issue happens when I use both autodetect and ignore_unkown_values at the same time. If I provide the existing schema as schema_fields, then ignore_unkown_values work as I expect, but that's not so clear to me in the documentation.
Solution 1:[1]
I found a way to ignore unkown values while using autodetect, and without needing to pass the schema, essentially replicating the typical bq CLI command by using the source format specific parameters on the operator (src_fmt_configs): bq load --source_format=NEWLINE_DELIMITED_JSON --autodetect dataset.table table_source
to_bq = GCSToBigQueryOperator(
task_id="to_bq",
bucket=YOUR_BUCKET_NAME,
source_objects=[YOUR_SOURCE_OBJECTS],
destination_project_dataset_table=YOUR_TABLE,
source_format="NEWLINE_DELIMITED_JSON",
src_fmt_configs={"ignoreUnknownValues": True, "autodetect": False},
write_disposition=YOUR_WRITE_DISPOSITION,
)
If you need to specify a partitioning field, do it in the operator parameter time_partitioning.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Nicolas Anchano |
