'Using XSD in PySpark

I am building a datawarehouse in Azure Synapse where one of the sources are about 20 different types of XML files (with a different XSD scheme) and 1 base scheme.

What I am looking for is to get all XML elements and store them in files (1 per type) in my data lake. For that I need to have unique names per element, for example the whole path as a name. I tried to define dicts per type with all element names, but this is quite some work. To automate this (XSDs are updated yearly), I tried to code this out in Excel and VBA, but the XSDs are quite complex with nested complex types etc. Below is a snippet of the baseschema.xsd:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema targetNamespace="http://www.website.org/typ/1/baseschema/schema" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:iwmo="http://www.website.org/typ/1/baseschema/schema">
    <xs:complexType name="Complex_Address">
        ...
        <xs:sequence>
            <xs:element name="Home" type="iwmo:Complex_House" minOccurs="0">
                ...
            </xs:element>
            <xs:element name="Postalcode" type="iwmo:Simple_Postalcode" minOccurs="0">
                ...
            </xs:element>
            <xs:element name="Streetname" type="iwmo:Simple_Streetname" minOccurs="0">
                ...
            </xs:element>
            <xs:element name="Areaname" type="iwmo:Simple_Areaname" minOccurs="0">
                ...
            </xs:element>
            <xs:element name="CountryCode" type="iwmo:Simple_CountryCode" minOccurs="0">
                ...
            </xs:element>
        </xs:sequence>
    </xs:complexType>
    <xs:complexType name="Complex_House">
        ...
        <xs:sequence>
            <xs:element name="Housenumber" type="iwmo:Simple_Housenumber">
                ...
            </xs:element>
            <xs:element name="Houseletter" type="iwmo:Simple_Houseletter" minOccurs="0">
                ...
            </xs:element>
            <xs:element name="HousenumberAddition" type="iwmo:Simple_HousenumberAddition" minOccurs="0">
                ...
            </xs:element>
            <xs:element name="IndicationAddress" type="iwmo:Simple_IndicationAddress" minOccurs="0">
                ...
            </xs:element>
        </xs:sequence>
    </xs:complexType>

    <xs:complexType name="Complex_MessageIdentification">
            ...
        <xs:sequence>
            <xs:element name="Identification" type="iwmo:Simple_IdentificationMessage">
                ...
            </xs:element>
            <xs:element name="Date" type="iwmo:Simple_Date">
                ...
            </xs:element>
        </xs:sequence>
    </xs:complexType>
    <xs:complexType name="Complex_Product">
            ...
        <xs:sequence>
            <xs:element name="Categorie" type="iwmo:Simple_ProductCategory">
                ...
            </xs:element>
            <xs:element name="Code" type="iwmo:Simple_ProductCode" minOccurs="0">
                ...
            </xs:element>
        </xs:sequence>
    </xs:complexType>
    <xs:complexType name="Complex_XsdVersion">
        <xs:sequence>
            <xs:element name="BaseschemaXsdVersion" type="iwmo:Simple_Version">
            </xs:element>
            <xs:element name="MessageXsdVersion" type="iwmo:Simple_Version">
            </xs:element>
        </xs:sequence>
    </xs:complexType>

And here a snippet of the xsd of 1 of the message types:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:typ="http://www.website.org/typ/1/baseschema/schema" xmlns:type1="http://www.website.org/typ/1/type1/schema" targetNamespace="http://www.website.org/typ/1/type1/schema" elementFormDefault="qualified">
    <xs:import namespace="http://www.website.org/typ/1/baseschema/schema" schemaLocation="baseschema.xsd"></xs:import>
    <xs:element name="Message" type="type1:Root"></xs:element>
    <xs:complexType name="Root">
        ...
        <xs:sequence>
            <xs:element name="Header" type="type1:Header"></xs:element>
            <xs:element name="Client" type="type1:Client"></xs:element>
        </xs:sequence>
    </xs:complexType>
    <xs:complexType name="Header">
        <xs:sequence>
            <xs:element name="Person" type="typ:Simple_SpecialCode">
                ...
            </xs:element>
            <xs:element name="MessageIdentification" type="typ:Complex_MessageIdentification">
                ...
            </xs:element>
            <xs:element name="XsdVersion" type="typ:Complex_XsdVersion">
                ...
            </xs:element>
        </xs:sequence>
    </xs:complexType>
    <xs:complexType name="Client">
                ...
        <xs:sequence>
            <xs:element name="AssignedProducts" type="type1:AssignedProducts"></xs:element>
        </xs:sequence>
    </xs:complexType>
    <xs:complexType name="AssignedProducts">
        <xs:sequence>
            <xs:element name="AssignedProduct" type="type1:AssignedProduct"  maxOccurs="unbounded"></xs:element>
        </xs:sequence>
    </xs:complexType>
    <xs:complexType name="AssignedProduct">
        ...
        <xs:sequence>
            <xs:element name="ToewijzingNummer" type="typ:Simple_Nummer">
                ...
            </xs:element>
            <xs:element name="Product" type="typ:Complex_Product" minOccurs="0">
                ...
            </xs:element>
        </xs:sequence>
    </xs:complexType>
</xs:schema>

Then this would be the desired output:

Header_Person
Header_MessageIdentification_Identification
Header_MessageIdentification_Date
Header_XsdVersion_BaseschemaXsdVersion
Header_XsdVersion_MessageXsdVersion
Client_AssignedProduct_ToewijzingNummer
Client_AssignedProduct_Product_Category
Client_AssignedProduct_Product_Code

In the baseschema I also added a nested complex type, to show the complexity.

Is there some kind of package or something in Python that can help me achieve this? Also a tool that can just create this list of elements in a text file would be great, I then can easily copy that into a variable.

I'm not sure if I'm clear about my requirements, if this is posted in the correct group with the correct tags, but I hope someone can point me into a good solution.

Ronald



Solution 1:[1]

I found a workaround after all where I put all fields from the xsds in variables. It's not ideal, but any other way would be too complex.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Ronald Hensbergen