'Extracting last element of a list with last()

I identify some chapter headings in a text. These headings are later used as starting points to extract subsequent text.

The starting points should be the first appearance of the headings chapterX or chapterY, or - if none of these headings is present-, the last chapter heading.

In the reprex below, I manage to identify the first appearance of the headings chapterX or chapterY, but for some reason I get in doc1 chapterA and not chapterC. As far as I understood, dplyr's last would be the means of choice.

Any help? Guessing from the warning messages, I think I am doing something wrong with the 'level' (list vs character vector) I am operating on.

library(tidyverse)
#> Warning: package 'dplyr' was built under R version 4.1.3

df_docs <- tibble::tribble(
  ~doc_id,                                            ~doc_text,
   "doc1", "chapterA txt txt chapterB txt txt chapterC txt txt",
   "doc2", "chapterA txt txt chapterX txt txt chapterY txt txt",
   "doc3", "chapterA txt txt chapterY txt txt chapterX txt txt"
  )


pattern_headings <- "chapterA|chapterB|chapterC|chapterY|chapterX"

df_docs <- df_docs %>% 
  mutate(headings=str_extract_all(doc_text, regex(pattern_headings))) %>% 
  mutate(extraction_start2=case_when(
    str_detect(headings, regex("chapterX|chapterY"))==T ~ str_extract(headings, regex("chapterX|chapterY")),
    !str_detect(headings, regex("chapterX|chapterY")) ~ dplyr::last(headings)))
#> Warning in stri_detect_regex(string, pattern, negate = negate, opts_regex =
#> opts(pattern)): argument is not an atomic vector; coercing
#> Warning in stri_extract_first_regex(string, pattern, opts_regex =
#> opts(pattern)): argument is not an atomic vector; coercing
#> Warning in stri_detect_regex(string, pattern, negate = negate, opts_regex =
#> opts(pattern)): argument is not an atomic vector; coercing

df_docs$headings
#> [[1]]
#> [1] "chapterA" "chapterB" "chapterC"
#> 
#> [[2]]
#> [1] "chapterA" "chapterX" "chapterY"
#> 
#> [[3]]
#> [1] "chapterA" "chapterY" "chapterX"
df_docs
#> # A tibble: 3 x 4
#>   doc_id doc_text                                      headings extraction_star~
#>   <chr>  <chr>                                         <list>   <chr>           
#> 1 doc1   chapterA txt txt chapterB txt txt chapterC t~ <chr>    chapterA        
#> 2 doc2   chapterA txt txt chapterX txt txt chapterY t~ <chr>    chapterX        
#> 3 doc3   chapterA txt txt chapterY txt txt chapterX t~ <chr>    chapterY

vec <- c("A", "B", "C")
dplyr::last(vec)
#> [1] "C"

Created on 2022-04-27 by the reprex package (v2.0.1)



Solution 1:[1]

Try this ...

library(tidyverse)

df_docs <- tribble(
  ~doc_id,
  ~doc_text,
  "doc1",
  "chapterA txt txt chapterB txt txt chapterC txt txt",
  "doc2",
  "chapterA txt txt chapterX txt txt chapterY txt txt",
  "doc3",
  "chapterA txt txt chapterY txt txt chapterX txt txt"
)

pattern <- "chapterA|chapterB|chapterC|chapterY|chapterX"

df_docs <- df_docs %>%
  mutate(
    headings = map_chr(str_extract_all(doc_text, pattern), ~ str_c(.x, collapse = ", ")),
    extraction_start2 = case_when(
      str_detect(headings, "terX|terY") ~ str_extract(headings, "chapterX|chapterY"),
      TRUE                              ~ word(headings, -1)
    )
  )


df_docs
#> # A tibble: 3 × 4
#>   doc_id doc_text                                      headings extraction_star…
#>   <chr>  <chr>                                         <chr>    <chr>           
#> 1 doc1   chapterA txt txt chapterB txt txt chapterC t… chapter… chapterC        
#> 2 doc2   chapterA txt txt chapterX txt txt chapterY t… chapter… chapterX        
#> 3 doc3   chapterA txt txt chapterY txt txt chapterX t… chapter… chapterY

Created on 2022-04-27 by the reprex package (v2.0.1)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Carl