'Extracting the value from an xml attribute while reading xml file in spark
I am loading a xml file using spark-xml connector and I want to parse everything from xml.
XML:
<order order-no="15">
<custom-attributes>
<custom-attribute attribute-id="ageCheck">true</custom-attribute>
<custom-attribute attribute-id="brandCode">ideal</custom-attribute>
<custom-attribute attribute-id="couponCodes">
<value>{"type":"coupon","ID":"432hiu328dsbwe","couponCode":"SALE10"}</value>
<value>{"type":"coupon","ID":"ewiuf4378498hs","couponCode":"SALE20"}</value>
</custom-attribute>
<custom-attribute attribute-id="discount">36.74</custom-attribute>
<custom-attribute attribute-id="originalPrice">121.68</custom-attribute>
</custom-attributes>
</order>
The xml data has a custom-attribute tag that contains an xml attribute: attribute-id. The problem is that, this attribute can have a child element with tag "value" as in case of couponCodes:
<custom-attribute attribute-id="couponCodes">
<value>{"type":"coupon","ID":"432hiu328dsbwe","couponCode":"SALE10"}</value>
<value>{"type":"coupon","ID":"ewiuf4378498hs","couponCode":"SALE20"}</value>
</custom-attribute>
or it may have just a normal value without any child elements like this :
<custom-attribute attribute-id="ageCheck">true</custom-attribute>
I created a schema to read this xml file into a spark dataframe:
customSchema = StructType([
StructField("_order-no" ,StringType()),
StructField("custom-attributes",StructType([
StructField("custom-attribute",ArrayType(StructType([
StructField("_VALUE",StringType()),
StructField("_attribute-id",StringType()),
StructField("value",ArrayType(StringType()))
]))
)])
)])
df = (spark.read.format('xml')
.option("rowTag", "order")
.schema(customSchema)
.load("orders.xml"))
But in this case, it doesn't parse the _VALUE inside custom-attributes. The Dataframe looks like this Data
And If I try creating a schema by removing "value" array from custom-attributes, then it parses the _VALUE field correctly. However it stops at "couponCodes" and does not parse anything after that.
customSchema = StructType([
StructField("_order-no" ,StringType()),
StructField("custom-attributes",StructType([
StructField("custom-attribute",ArrayType(StructType([
StructField("_VALUE",StringType()),
StructField("_attribute-id",StringType())
]))
)])
)])
df = (spark.read.format('xml')
.option("rowTag", "order")
.schema(customSchema)
.load("orders.xml"))
In this case the data look like this: Data
Is there a way to parse all types of values from attribute-id in this xml ?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
