'How to automate casting of empty array<string> elements to array<struct<>> elements in pyspark

We have a LogicalDate as column in a pyspark dataframe, having array of struct type schema

Data:

{
    "LogicalLinks": [
        {
            "ClassId": "myclassId",
            "DbId": 1140,
            "IsPresent": true,
            "LinkAddress1": "",
            "LinkAddress2": "",
            "ObjectType": "myObjectType",
            "State": "Established",
            "Type": "KGF",
            "Uptime": "18:14:41"
        }
    ]
}


Schema:

 |-- LogicalDate: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- ClassId: string (nullable = true)
 |    |    |-- DobId: long (nullable = true)
 |    |    |-- IsPresent: boolean (nullable = true)

having multiple nested fields present, as shown above.

If the value for this column is not an empty array, it saves successfully to destination, i.e. the data saved successfully in Hudi table.

Issue is if the same column is coming as blank or empty array, it is considered as array(string) in the dataframe.

Data:

{"LogicalLinks": []}


Schema:

|-- LogicalLinks: array (nullable = true)
 |    |-- element: string (containsNull = true)

Which is not saved to Hudi table as schema is different, it is expecting(i.e array(struct)) which is different from what is received(array(string)), when the column is empty array. If the same column coming as array of struct elements then it is fine.

Could you please advise how do we handle this scenario in pyspark to avoid the error:

"Failed to merge due to data type mismatch: cannot cast array to array<struct<..... Expected instance of group converter but got "org.apache.parquet.avro.AvroConverters$FieldUTF8Converter"

How do we handle array of struct type and array of string in the same column such that even when we get empty array or blank values we can save it by casting it to array of struct?

Thanks



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source