'_VALUE column when reading XML

Given this rather funky XML structure:

<Report>
  <Table>
    <List>
      <DTL a="abc"
           b="xyz"
           .../>
      <DTL a="bcd"
           b="foo"
...

If I read that into a data frame, I end up with this schema:

root
 |-- Table: struct (nullable = true)
 |    |-- List: struct (nullable = true)
 |    |    |-- DTL: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- _a: string (nullable = true)
 |    |    |    |    |-- _b: string (nullable = true)
 |    |    |    |    |-- _VALUE: string (nullable = true)

I can't quite get where this _VALUE column is comin from. I understand that a,b, etc are attributes. The documentation says: valueTag: The tag used for the value when there are attributes in the element having no child. Default is _VALUE.

What does an attribute having no child actually mean here? Other than excluding it from a downstream dataframe, is there any way to avoid this column?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source