'Extracting last element of a list with last()
I identify some chapter headings in a text. These headings are later used as starting points to extract subsequent text.
The starting points should be the first appearance of the headings chapterX or chapterY, or - if none of these headings is present-, the last chapter heading.
In the reprex below, I manage to identify the first appearance of the headings chapterX or chapterY, but for some reason I get in doc1 chapterA and not chapterC. As far as I understood, dplyr's last would be the means of choice.
Any help? Guessing from the warning messages, I think I am doing something wrong with the 'level' (list vs character vector) I am operating on.
library(tidyverse)
#> Warning: package 'dplyr' was built under R version 4.1.3
df_docs <- tibble::tribble(
~doc_id, ~doc_text,
"doc1", "chapterA txt txt chapterB txt txt chapterC txt txt",
"doc2", "chapterA txt txt chapterX txt txt chapterY txt txt",
"doc3", "chapterA txt txt chapterY txt txt chapterX txt txt"
)
pattern_headings <- "chapterA|chapterB|chapterC|chapterY|chapterX"
df_docs <- df_docs %>%
mutate(headings=str_extract_all(doc_text, regex(pattern_headings))) %>%
mutate(extraction_start2=case_when(
str_detect(headings, regex("chapterX|chapterY"))==T ~ str_extract(headings, regex("chapterX|chapterY")),
!str_detect(headings, regex("chapterX|chapterY")) ~ dplyr::last(headings)))
#> Warning in stri_detect_regex(string, pattern, negate = negate, opts_regex =
#> opts(pattern)): argument is not an atomic vector; coercing
#> Warning in stri_extract_first_regex(string, pattern, opts_regex =
#> opts(pattern)): argument is not an atomic vector; coercing
#> Warning in stri_detect_regex(string, pattern, negate = negate, opts_regex =
#> opts(pattern)): argument is not an atomic vector; coercing
df_docs$headings
#> [[1]]
#> [1] "chapterA" "chapterB" "chapterC"
#>
#> [[2]]
#> [1] "chapterA" "chapterX" "chapterY"
#>
#> [[3]]
#> [1] "chapterA" "chapterY" "chapterX"
df_docs
#> # A tibble: 3 x 4
#> doc_id doc_text headings extraction_star~
#> <chr> <chr> <list> <chr>
#> 1 doc1 chapterA txt txt chapterB txt txt chapterC t~ <chr> chapterA
#> 2 doc2 chapterA txt txt chapterX txt txt chapterY t~ <chr> chapterX
#> 3 doc3 chapterA txt txt chapterY txt txt chapterX t~ <chr> chapterY
vec <- c("A", "B", "C")
dplyr::last(vec)
#> [1] "C"
Created on 2022-04-27 by the reprex package (v2.0.1)
Solution 1:[1]
Try this ...
library(tidyverse)
df_docs <- tribble(
~doc_id,
~doc_text,
"doc1",
"chapterA txt txt chapterB txt txt chapterC txt txt",
"doc2",
"chapterA txt txt chapterX txt txt chapterY txt txt",
"doc3",
"chapterA txt txt chapterY txt txt chapterX txt txt"
)
pattern <- "chapterA|chapterB|chapterC|chapterY|chapterX"
df_docs <- df_docs %>%
mutate(
headings = map_chr(str_extract_all(doc_text, pattern), ~ str_c(.x, collapse = ", ")),
extraction_start2 = case_when(
str_detect(headings, "terX|terY") ~ str_extract(headings, "chapterX|chapterY"),
TRUE ~ word(headings, -1)
)
)
df_docs
#> # A tibble: 3 × 4
#> doc_id doc_text headings extraction_star…
#> <chr> <chr> <chr> <chr>
#> 1 doc1 chapterA txt txt chapterB txt txt chapterC t… chapter… chapterC
#> 2 doc2 chapterA txt txt chapterX txt txt chapterY t… chapter… chapterX
#> 3 doc3 chapterA txt txt chapterY txt txt chapterX t… chapter… chapterY
Created on 2022-04-27 by the reprex package (v2.0.1)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Carl |
