'Extracting the value from an xml attribute while reading xml file in spark

I am loading a xml file using spark-xml connector and I want to parse everything from xml.

XML:

<order order-no="15">
    <custom-attributes>
        <custom-attribute attribute-id="ageCheck">true</custom-attribute>
        <custom-attribute attribute-id="brandCode">ideal</custom-attribute>
        <custom-attribute attribute-id="couponCodes">
            <value>{"type":"coupon","ID":"432hiu328dsbwe","couponCode":"SALE10"}</value>
            <value>{"type":"coupon","ID":"ewiuf4378498hs","couponCode":"SALE20"}</value>
        </custom-attribute>
        <custom-attribute attribute-id="discount">36.74</custom-attribute>
        <custom-attribute attribute-id="originalPrice">121.68</custom-attribute>
    </custom-attributes>
</order>

The xml data has a custom-attribute tag that contains an xml attribute: attribute-id. The problem is that, this attribute can have a child element with tag "value" as in case of couponCodes:

<custom-attribute attribute-id="couponCodes">
      <value>{"type":"coupon","ID":"432hiu328dsbwe","couponCode":"SALE10"}</value>
      <value>{"type":"coupon","ID":"ewiuf4378498hs","couponCode":"SALE20"}</value>
</custom-attribute>

or it may have just a normal value without any child elements like this :

<custom-attribute attribute-id="ageCheck">true</custom-attribute>

I created a schema to read this xml file into a spark dataframe:

customSchema = StructType([
  StructField("_order-no" ,StringType()),
  StructField("custom-attributes",StructType([
    StructField("custom-attribute",ArrayType(StructType([
      StructField("_VALUE",StringType()),
      StructField("_attribute-id",StringType()),
      StructField("value",ArrayType(StringType()))
      ]))
    )])
  )])

df = (spark.read.format('xml')
        .option("rowTag", "order")
        .schema(customSchema)
        .load("orders.xml"))

But in this case, it doesn't parse the _VALUE inside custom-attributes. The Dataframe looks like this Data

And If I try creating a schema by removing "value" array from custom-attributes, then it parses the _VALUE field correctly. However it stops at "couponCodes" and does not parse anything after that.

customSchema = StructType([
  StructField("_order-no" ,StringType()),
  StructField("custom-attributes",StructType([
    StructField("custom-attribute",ArrayType(StructType([
      StructField("_VALUE",StringType()),
      StructField("_attribute-id",StringType())
        ]))
      )])
    )])

df = (spark.read.format('xml')
      .option("rowTag", "order")
      .schema(customSchema)
      .load("orders.xml"))

In this case the data look like this: Data

Is there a way to parse all types of values from attribute-id in this xml ?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source