'Is there a function to scrape the notes sections of Powerpoint Slides?

I am attempting to read through ~ 100 powerpoint slides and read the notes sections of each slide. I will do some text wrangling and write to csv after the fact, but need to get the notes in a workable format first.

I am working with the officer package, read_pptx function right now, but am open to whatever packages needed. It doesn't seem to pull in notes, but I may just be looking at this wrong.

To show a bit of what I've tried -->

library(officer)

ppt_var <- read_pptx('test_presentation.pptx')
view(ppt_var)

Ideally, I could get the text of each notes slide added to individual variables to write to a csv. I am confident that I can handle the manipulation once I get the notes read in, but cannot seem to get that part down.

Thank you for any pointers or support!



Solution 1:[1]

How do do that is shown in the code here: https://github.com/davidgohel/officer/issues/117 .

The following is based on that code:

library(magrittr)
library(officer)
library(xml2)

p <- read_pptx("mypresentation.pptx")
notes_dir <- file.path(p$package_dir, "ppt", "notesSlides")
files <- list.files(pattern = ".xml$", path = notes_dir, full.names = TRUE)

Notes <- lapply(files,
 . %>% 
   read_xml %>%
   xml_find_all("//a:t") %>%
   xml_text
)

Solution 2:[2]

Assuming you are using the Document.OpenXML dependencies in C#, a more native way would be:

    public static SlidePart GetSlidePart(PresentationDocument pptxDoc, int index)
    {
        // Get the relationship ID of the first slide.
        PresentationPart presentationPart = pptxDoc.PresentationPart;
        OpenXmlElementList slideIds = presentationPart.Presentation.SlideIdList.ChildElements;
        string relId = (slideIds[index] as SlideId).RelationshipId;

        // Get the slide part from the relationship ID.
        return (SlidePart)presentationPart.GetPartById(relId);
    }

    public static string GetNoteText(PresentationDocument pptxDoc, int index)
    {
        //Get the Slide Part
        SlidePart slidePart = GetSlidePart(pptxDoc, index);
        //Extract the Note text
        return slidePart.NotesSlidePart.NotesSlide.InnerText.ToString();
    }

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 jmerrill2001