'Extracting Embedded Sheet Objects in Presentations
I need to extract embedded excel sheets from several .pptx's so that we can ultimately aggregate the data into 1 large data table. I've found a non-ideal solution to this using python and some manual formatting but am looking to fully automate this process. I'll do my best to provide all relevant information but this is my first attempt at posting a question so forgive me if it's lacking. Happy to provide snippets of my (very very very ugly) code and/or any additional information/clarification as necessary.
The workflow for extracting the embedded worksheets look like this:
- Loop through all .pptx's and extract the embedded .xlsx files using the zipfile library in the '...\ppt\embeddings' directory. The embedded .xlsx files are placed into another directory within respective folders representing each .pptx. There are anywhere between 5 to 26 .xlsx files in each folder.
- Loop through all the .xlsx files in each folder and create an aggregated .xlsx file with every active sheet from each .xlsx file. The active sheets happens to be the sheet that contains the table/data presented in the .pptx. (each book has several irrelevant sheets...not sure why this is the case)
- There's additional data in a lot of the sheets that need to be removed so the sheets are reviewed against the .pptx's and manually cleaned (all irrelevant data is deleted and the relevant data is moved to the top left corner of the spreadsheet).
- The aggregated .xlsx files are looped through and the data tables are converted to pandas dataframes.
I'm looking for alternative automated methods for step 3. I've considered converting the embedded sheets to table objects and extracting the data as I already have code that can parse through table objects but I've not been able to find anything regarding converting to table objects. This would greatly streamline my code.
I was also thinking that there must be some kind of indicator telling the .pptx file to display just the selected data but I don't know how to find it, let alone search for that concept. If there was, I'm wondering if I could use that to automatically extract the relevant data from the sheets instead of manually cleaning them.
Not sure if this is relevant, but I'm using xlwings to interface with excel.
Thanks in advance for any insight/help you guys can provide.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
