'How to subset text from a word docx AFTER a matching phrase

I would like to subset text from an original word docx ("original.docx") into a new word docx ("desired.docx"), AFTER the match of the phrase "Drop Text Before Here", but keeping the formatting of the original (for the retained text).

I have modified the example from the {officer} package documentation for body_remove() to show the original and desired results (in docx form). The difference is that the example in the documentation keeps the portion of text before, and I would like to keep the text after the matched phrase.

library(officer)

# Original text
str1 <- rep("Lorem ipsum dolor sit amet, consectetur adipiscing elit. ", 3)
str1 <- paste(str1, collapse = "")

str2 <- "Drop Text Before Here"

str3 <- rep("Aenean venenatis varius elit et fermentum vivamus vehicula. ", 3)
str3 <- paste(str3, collapse = "")

# Create original_docx prior to subset
original_docx <- read_docx()
original_docx <- body_add_par(original_docx, value = str1, style = "Normal")
original_docx <- body_add_par(original_docx, value = str2, style = "centered")
original_docx <- body_add_par(original_docx, value = str3, style = "Normal")

# Save original docx in local directory
print(original_docx, "original.docx")

# Desired docx after subset starting at "Drop Text Before Here"
desired_docx <- read_docx()
desired_docx <- body_add_par(desired_docx, value = str2, style = "centered")
desired_docx <- body_add_par(desired_docx, value = str3, style = "Normal")

# Save desired docx in local directory
print(desired_docx, "desired.docx")

Created on 2022-04-09 by the reprex package (v2.0.1)



Solution 1:[1]

You might use a custom function that tries to step backwards through the document from the current cursor position removing the body at each step and halting on the error that signifies the beginning of the document.

body_remove_before_cursor <- function(x) {
  tryCatch(
    {
      x <- officer::cursor_backward(x)
      x <- officer::body_remove(x)
      body_remove_before_cursor(x)
    },
    error = function(e) { 
      return(x)
    }
  )
}

desired_2_docx <- read_docx('original.docx')
desired_2_docx <- cursor_reach(desired_2_docx, str2)
desired_2_docx <- body_remove_before_cursor(desired_2_docx)
print(desired_2_docx, 'desired_2.docx')

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1