'What exactly are the limitations on Pandas read_xml now?

I'm not the first person to ask a question like this. However that was 7 months ago and I did some searching to see if maybe the Pandas team addressed this officially, rather than needing to use a work-around.

I first found this issue. Response from maintainer:

Thanks @wangrenz! This is a great use case. Large XML file support was in the works for both lxml and etree parsers (see #40131 ) We may not need an additional argument but have read_xml catch this exception and attempt the lxml workaround.

How large was your XML file? Can you post a reproducible example of its content (redact as needed)?

(The answer was > 10 GB)

Then the issue was closed, referencing a PR.

I read through the PR, but was not really clear on what exactly they're talking about or what the outcome was.

Can someone explain if this was supposed to solve the problem? Or what, generally speaking, the outcome was?

The Notes section of the documentation says:

This method is best designed to import shallow XML documents in following format which is the ideal fit for the two-dimensions of a DataFrame (row by column).

Which seems to suggest using a 2-d structure rather than deeply nested, but terms like "is the ideal fit" and "best designed" aren't exactly precise. I'm left wondering if my use-case (being just 3 elements deep) is a valid use-case or not. DataFrames don't seem to impose such a limitation.

My Pandas version:

1.4.1

My code:

import pandas as pd
df = pd.read_xml('./data/xml/boot.xml')
# >1GB file, sampled below

The sample shown in the documentation:

<root>
    <row>
      <column1>data</column1>
      <column2>data</column2>
      <column3>data</column3>
      ...
   </row>
   <row>
      ...
   </row>
   ...
</root>

My XML is one element deeper, but again, I don't think the documentation is clear that there is a hard limitation on this depth:

<Events>
    <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
        <System>
            <Provider Guid="{9e814aad-3204-11d2-9a82-006008a86939}" />
            <EventID>0</EventID>
            <Version>3</Version>
            <Level>0</Level>
            <Task>0</Task>
            <Opcode>10</Opcode>
            <Keywords>0x0</Keywords>
            <TimeCreated SystemTime="2022-03-15T02:38:36.618192200-04:00" />
            <Correlation ActivityID="{00000000-0000-0000-0000-000000000000}" />
            <Execution ProcessID="4294967295" ThreadID="4294967295" ProcessorID="0" KernelTime="0" UserTime="0" />
            <Channel />
            <Computer />
        </System>
        <EventData>
            <Data Name="DiskNumber">       0</Data>
            <Data Name="IrpFlags">0x60043</Data>
            <Data Name="TransferSize">   16384</Data>
            <Data Name="Reserved">       0</Data>
            <Data Name="ByteOffset">145884773376</Data>
            <Data Name="FileObject">0xFFFF8607EDCE8700</Data>
            <Data Name="Irp">0xFFFFC10F8E1C0420</Data>
            <Data Name="HighResResponseTime">1598</Data>
            <Data Name="IssuingThreadId">    4192</Data>
        </EventData>
        <RenderingInfo Culture="en-US">
            <Opcode>Read</Opcode>
            <Provider>MSNT_SystemTrace</Provider>
            <EventName xmlns="http://schemas.microsoft.com/win/2004/08/events/trace">DiskIo</EventName>
        </RenderingInfo>
        <ExtendedTracingInfo xmlns="http://schemas.microsoft.com/win/2004/08/events/trace">
            <EventGuid>{3d6fa8d4-fe05-11d0-9dda-00c04fd7ba7c}</EventGuid>
        </ExtendedTracingInfo>
    </Event>
</Events>

My error is the same as the older question about this:

MemoryError                               Traceback (most recent call last)
Input In [1], in <cell line: 3>()
      1 import pandas as pd
----> 3 df = pd.read_xml('./data/xml/boot.xml')

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\util\_decorators.py:311, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    305 if len(args) > num_allow_args:
    306     warnings.warn(
    307         msg.format(arguments=arguments),
    308         FutureWarning,
    309         stacklevel=stacklevel,
    310     )
--> 311 return func(*args, **kwargs)

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\io\xml.py:938, in read_xml(path_or_buffer, xpath, namespaces, elems_only, attrs_only, names, encoding, parser, stylesheet, compression, storage_options)
    738 @deprecate_nonkeyword_arguments(
    739     version=None, allowed_args=["path_or_buffer"], stacklevel=2
    740 )
   (...)
    757     storage_options: StorageOptions = None,
    758 ) -> DataFrame:
    759     r"""
    760     Read XML document into a ``DataFrame`` object.
    761 
   (...)
    935     2  triangle      180    3.0
    936     """
--> 938     return _parse(
    939         path_or_buffer=path_or_buffer,
    940         xpath=xpath,
    941         namespaces=namespaces,
    942         elems_only=elems_only,
    943         attrs_only=attrs_only,
    944         names=names,
    945         encoding=encoding,
    946         parser=parser,
    947         stylesheet=stylesheet,
    948         compression=compression,
    949         storage_options=storage_options,
    950     )

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\io\xml.py:733, in _parse(path_or_buffer, xpath, namespaces, elems_only, attrs_only, names, encoding, parser, stylesheet, compression, storage_options, **kwargs)
    730 else:
    731     raise ValueError("Values for parser can only be lxml or etree.")
--> 733 data_dicts = p.parse_data()
    735 return _data_to_frame(data=data_dicts, **kwargs)

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\io\xml.py:389, in _LxmlFrameParser.parse_data(self)
    380 """
    381 Parse xml data.
    382 
   (...)
    385 and parse original or transformed XML and return specific nodes.
    386 """
    387 from lxml.etree import XML
--> 389 self.xml_doc = XML(self._parse_doc(self.path_or_buffer))
    391 if self.stylesheet is not None:
    392     self.xsl_doc = XML(self._parse_doc(self.stylesheet))

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\io\xml.py:545, in _LxmlFrameParser._parse_doc(self, raw_doc)
    531 from lxml.etree import (
    532     XMLParser,
    533     fromstring,
    534     parse,
    535     tostring,
    536 )
    538 handle_data = get_data_from_filepath(
    539     filepath_or_buffer=raw_doc,
    540     encoding=self.encoding,
    541     compression=self.compression,
    542     storage_options=self.storage_options,
    543 )
--> 545 with preprocess_data(handle_data) as xml_data:
    546     curr_parser = XMLParser(encoding=self.encoding)
    548     if isinstance(xml_data, io.StringIO):

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\io\xml.py:636, in preprocess_data(data)
    627 """
    628 Convert extracted raw data.
    629 
   (...)
    632 StringIO/BytesIO) or is a string or bytes that is an XML document.
    633 """
    635 if isinstance(data, str):
--> 636     data = io.StringIO(data)
    638 elif isinstance(data, bytes):
    639     data = io.BytesIO(data)

MemoryError: 


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source