'AWS Lambda, Python3 and Large XLSX files to CSV files

I have a batch of XLSX files that range from 10Mb to 400Mb in size. They always have the same sheets and structure inside, but some contain more data than others.

I am trying to process these using AWS Lambda; its part of a submission process so the file drop in S3 is the event to the Lambda.

I'm quickly learning that XLSX is a horrid format for this but I cannot alter this. Currently I have my main Lambda working using this class I found online and slightly altered. The memory usage and speed has improved over Pandas read_excel but its still not enough. With the 400Mb files the Lambda simply times out or eats up its memory allocation (even at max).

Doing some memory profiling on the script I can see the size decreases during the pivot operation but I can't really skip that.

Pre-Pivot DF is ~1230Mb enter image description here

Post-Pivot DF is ~220Mb enter image description here

Any tips on how I can improve this to be more memory efficient?

Each sheet needs to be saved to its own CSV but it can be split into multiple CSV files if it helps, like maybe a chunksize style iterator?

import io
import zipfile
from lxml import etree
from pandas import read_csv, to_numeric


class ExcelParse:
    sheet_xslt = etree.XML('''
        <xsl:stylesheet version="1.0"
            xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
            xmlns:sp="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
            >
            <xsl:output method="text"/>
            <xsl:template match="sp:row">
               <xsl:for-each select="sp:c">
                <xsl:value-of select="parent::*/@r"/> <!-- ROW -->
                <xsl:text>,</xsl:text>
                <xsl:value-of select="@r"/> <!--REMOVEME-->
                <xsl:text>,</xsl:text>
                <xsl:value-of select="@t"/> <!-- TYPE -->
                <xsl:text>,</xsl:text>
                <xsl:value-of select="sp:v/text()"/> <!-- VALUE -->
               <xsl:text>\n</xsl:text>
               </xsl:for-each>
            </xsl:template>
        </xsl:stylesheet>
    ''')

    def __init__(self, file):
        self.fh = zipfile.ZipFile(file)
        self.ns = {
            'ns': 'http://schemas.openxmlformats.org/spreadsheetml/2006/main',
        }
        self.shared = self.load_shared()
        self.workbook = self.load_workbook()

    def load_workbook(self):
        # Load workbook
        name = 'xl/workbook.xml'
        root = etree.parse(self.fh.open(name))
        res = {}
        for el in etree.XPath("//ns:sheet", namespaces=self.ns)(root):
            res[el.attrib['name']] = str(
                int(el.attrib['sheetId']) -
                1)  # Sheet ID in the XML starts at 2 for some reason?
        return res

    def load_shared(self):
        # Load shared strings
        name = 'xl/sharedStrings.xml'
        root = etree.parse(self.fh.open(name))
        res = etree.XPath("/ns:sst/ns:si/ns:t", namespaces=self.ns)(root)
        return {str(pos): el.text for pos, el in enumerate(res)}

    def _parse_sheet(self, root):
        transform = etree.XSLT(self.sheet_xslt)
        result = transform(root)
        df = read_csv(io.StringIO(str(result)),
                      header=None,
                      names=['row', 'cell', 'type', 'value'])
        return df

    def read(self, sheet_name):
        sheet_id = self.workbook[sheet_name]
        sheet_path = f'xl/worksheets/sheet{sheet_id}.xml'
        root = etree.parse(self.fh.open(sheet_path))
        df = self._parse_sheet(root)

        # First row numbers are filled with nan
        df['row'] = to_numeric(df['row'].fillna(0))

        # Translate string contents
        cond = (df.type == 's') & (~df.value.isnull())
        df.loc[cond, 'value'] = df[cond]['value'].map(self.shared)
        # Add column number and sort rows
        df['col'] = df.cell.str.replace(r'[0-9]+', '', regex=True)

        # Pivot everything
        df = df.pivot(
            index='row', columns='col',
            values='value').reset_index(drop=True).reset_index(drop=True)
        df.columns.name = None  # pivot adds a name to the "columns" array
        # Sort columns (pivot will put AA before B)
        cols = sorted(df.columns, key=lambda x: (len(x), x))
        df = df[cols]
        df = df.dropna(how='all')  # Ignore empty lines
        df = df.dropna(how='all', axis=1)  # Ignore empty cols

        new_header = df.iloc[0]  # Grab the first row for the header
        df = df[1:]  # Take the data less the header row
        df.columns = new_header  # Set the header row as the df header

        return df


def new_method():
    xlsx = ExcelParse(
        'BigFile.xlsx'
    )
    print(xlsx.workbook)
    df = xlsx.read('Task')

    # for sheet_name, sheet_id in xlsx.workbook.items():
    #     df = xlsx.read(sheet_name)
    #     do stuff


new_method()



Solution 1:[1]

To speed up the process, you can try these things and check

  1. Try using xlsx2csv to convert each sheet in the workbook in csv then do a pd.read_csv(). Csv reads are faster than excel.
  2. As you said, the data is fixed structure and will not change try to use 'dtype' option in read_csv. This will help pandas to avoid automatically identifying data type of each column which will save some time I guess.

For the memory issue,

  1. I was not able to grasp my mind around that 10GB of RAM is fully used but in any case atleast as per my knowledge a drastic improvement may be hard. The use of 'dtype' option may help because some data types take more memory than others, you can make all columns as str and check. read_csv has chunksize option as well. You can explore that.

  2. You can also explore low_memory and memory_map options in read_csv. I haven't tried them so can't really tell you about the effectiveness.

  3. In case I got it wrong and if you are talking about storage memory, you can use storage_options in read_csv which will directly read/write from/to S3 in case if you are not doing it.

  4. If you need more storage memory locally check out EFS for Lambdas.

  5. On the other hand, you can make a Fargate image and invoke that from your lambda. With this you can get long execution times if that is not the constraint. Check out more.

Solution 2:[2]

Note that if you have less than 1.7GB of memory in a lambda, you get only a fraction of a CPU. That's why things may look surprisingly slow. By doubling memory you double CPU, which halves execution time (for CPU-bound tasks). Double memory for half duration costs the same amount (assuming it's CPU-bound).

Try bumping up the memory to 1.7GB if it's a single-threaded process. Even more if it's multi-threaded. (I think pandas uses aiobotocore under the hood to download, so more may help.)

That helps address speed.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Nishad Wadwekar
Solution 2 falsePockets