'BigQueryOperator in spark - can't write array struct to bigquery table
In BigQuery, I have a field that is of type RECORD and in REPEATED mode, a column called actions. In Spark, I have a schema defined as
val action: StructType = (new StructType)
.add("id", StringType)
.add("name", StringType)
.add("last", StringType)
val actionsList = new ArrayType(action, true)
val finalStruct: StructType = (new StructType)
.add("record", StringType)
.add("d", StringType)
.add("actions", actionsList)
This is how my schema is defined, then I simply read it in and write it to bigquery.
val df = spark.read.schema(finalStruct).json(rdd)
df.createOrReplaceTempView("myData")
val finalDf = sqlContext.sql("SELECT record as my_rec, d as inc_date, actions from myData")
finalDf.write.mode("append").format("bigquery")...save()
However, when I attempt to write the dataframe, I get the error -
BigQuery error was provided Schema does not match Table <table_name_here>.
Cannot add fields (field: actions.list)
What's the proper way to define this schema? My data coming in is in json format like
{
"recordName":"name_here",
"date": "2020-01-01",
"actions": [
{
"id":"1",
"name":"aaa",
"last":"bbb"
},
{
"id":"2",
"name":"qqq",
"last":"www"
}
]
Solution 1:[1]
It's a known issue when connector is used on the default settings with Paruqet format used as an intermediate later (see similar bug report).
Changing the format to ORC solves the issue:
spark.conf.set("spark.datasource.bigquery.intermediateFormat", "orc")
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Mariusz |
