'Spacy (in R) - How to save full name of a person

I have a large dataframe with articles from German newspapers. I want to loop through all articles and save the full name (if available) of each person mentioned in the text in a row of the corresponding article. I am using the spacyr-package in R for the NER.

The function spacy_extract_entity() actually does this job and gives me the full names of the person mentioned. Unfortunately it returns to many false-positive entities in PER column. Because of that, I tried to use spacy_parse and filter the person entities for myself.

library("spacyr")
spacy_initialize(model = "de_core_news_lg")

for (i in 1:nrow(data)){
  data[i,]<-data[i,] %>%
    mutate(
      persons =  spacy_parse( data[i,]$text, dependency = FALSE, lemma = FALSE)  %>%
        filter(
          entity == "PER_B" &  pos == "PROPN" | entity == "PER_I" &  pos == "PROPN"
        )  %>%
        pull(token) %>%
        toString(.)
    )
}

The results are much better for my purpose. For the text example below this post I get this output:

Kenji, Mizoguchi, Alain, Resnais, Rainer, Werner, Fassbinder, Alfred, Hitchcock, Werner, Herzog, Aguirre, Claude, Chabrol, Hannelore, Elsner, Steven, Spielberg

Now I want to achieve that the full name of a person isn't comma-separated (Rainer Werner Fassbinder Alfred, Hitchcock etc.).

In theory I should check for each row, if the next row has the same sentence_id and if that is TRUE if the token_id of the next row is one number higher and if this is TRUE and so on till FALSE. Then I would save these tokens as one full name comma seperated from the next name. I have a hard time writing a code chuck who does it and seems also slow to do it like this for a bigger corpus. I am really glad about any idea how to this or an alternative solution. Thanks a lot!

example-scrapy_parse output

structure(list(doc_id = c("text1", "text1", "text1", "text1", 
"text1", "text1", "text1", "text1", "text1", "text1", "text1", 
"text1", "text1", "text1", "text1", "text1", "text1", "text1"
), sentence_id = c(49L, 49L, 49L, 49L, 49L, 49L, 49L, 49L, 49L, 
49L, 49L, 49L, 53L, 53L, 53L, 53L, 55L, 55L), token_id = c(13L, 
14L, 16L, 17L, 19L, 20L, 21L, 23L, 24L, 26L, 27L, 32L, 15L, 16L, 
18L, 19L, 19L, 20L), token = c("Kenji", "Mizoguchi", "Alain", 
"Resnais", "Rainer", "Werner", "Fassbinder", "Alfred", "Hitchcock", 
"Werner", "Herzog", "Aguirre", "Claude", "Chabrol", "Hannelore", 
"Elsner", "Steven", "Spielberg"), pos = c("PROPN", "PROPN", "PROPN", 
"PROPN", "PROPN", "PROPN", "PROPN", "PROPN", "PROPN", "PROPN", 
"PROPN", "PROPN", "PROPN", "PROPN", "PROPN", "PROPN", "PROPN", 
"PROPN"), entity = c("PER_B", "PER_I", "PER_B", "PER_I", "PER_B", 
"PER_I", "PER_I", "PER_B", "PER_I", "PER_B", "PER_I", "PER_B", 
"PER_B", "PER_I", "PER_B", "PER_I", "PER_B", "PER_I")), row.names = c(NA, 
-18L), class = c("spacyr_parsed", "data.frame")) 

example-Text:

Es gibt immer Klassiker von großen Regisseuren, aktuell unter anderem von Kenji Mizoguchi, Alain Resnais, Rainer Werner Fassbinder, Alfred Hitchcock und Werner Herzog - dessen großer "Aguirre" ist beispielsweise noch sieben Tage zu sehen. In ähnlichen Bereichen wie Mubi sind auch lokale deutsche Anbieter aktiv, so wie zum Beispiel Alles Kino oder Filmfriend. Letztere kooperiert mit Bibliotheken in ganz Deutschland, deren Mitglieder die Mediathek mit ihrem Bibliotheksausweis nutzen können. Die Betreiber teilen mit, dass sie bereits eine leicht höhere Nutzung ihres Portals feststellen können, wobei noch nicht ganz klar sei, ob die wirklich auf die Epidemie zurückgehe. Filmfriend hat keine Eigenproduktionen, bietet aber Werkschauen einzelner Filmemacher (derzeit zum Beispiel Claude Chabrol und Hannelore Elsner) und hat eine gute Auswahl an deutschen Kinoklassikern - auch viele Produktionen der DDR-Filmschmiede Defa. Und wenn das immer noch nicht genug Futter ist? Dann startet am 6. April in den USA die Plattform Quibi, auf der Stars und Meister wie Steven Spielberg in handytauglichen Kurzformaten die Zukunft des Entertainment-Häppchens erproben, mit Studio-Unterstützung und Milliardenbudget.

EDIT: With a slightly adaption of Ivans suggetion I implemented that in a way that fits my purpose. Because PER_I and PER_B are not always detected right, you rely on the sentence and word count. My adaptaion:

 fix_per_names <- function(entity_names){
  a <- entity_names$token
  b <- entity_names$token_id
  c <- entity_names$sentence_id
  d <- entity_names$entity
  e <- NULL
  i <- 0
  
  while(i < (length(a))){
    i  <- i+1
    if(i == length(a)){
      e[i] <- a[i]
    }else if(c[i] == c[i+1] & ( b[i+1] - b[i] == 1 ) ){
      e[i] <- paste(a[i],a[i+1])
    }else if(d[i] == 'PER_I' & d[i+1] == 'PER_B'){
      e[i] <- NA
    }else {
      e[i] <- a[i]
    }
  }
  e <-  toString(e[!is.na(e)])
  return(e)
}


Solution 1:[1]

It's not the fifth wonder of the world, but it helps.

dad = structure(list(doc_id = c("text1", "text1", "text1", "text1", 
"text1", "text1", "text1", "text1", "text1", "text1", "text1", 
"text1", "text1", "text1", "text1", "text1", "text1", "text1"
), sentence_id = c(49L, 49L, 49L, 49L, 49L, 49L, 49L, 49L, 49L, 
49L, 49L, 49L, 53L, 53L, 53L, 53L, 55L, 55L), token_id = c(13L, 
14L, 16L, 17L, 19L, 20L, 21L, 23L, 24L, 26L, 27L, 32L, 15L, 16L, 
18L, 19L, 19L, 20L), token = c("Kenji", "Mizoguchi", "Alain", 
"Resnais", "Rainer", "Werner", "Fassbinder", "Alfred", "Hitchcock", 
"Werner", "Herzog", "Aguirre", "Claude", "Chabrol", "Hannelore", 
"Elsner", "Steven", "Spielberg"), pos = c("PROPN", "PROPN", "PROPN", 
"PROPN", "PROPN", "PROPN", "PROPN", "PROPN", "PROPN", "PROPN", 
"PROPN", "PROPN", "PROPN", "PROPN", "PROPN", "PROPN", "PROPN", 
"PROPN"), entity = c("PER_B", "PER_I", "PER_B", "PER_I", "PER_B", 
"PER_I", "PER_I", "PER_B", "PER_I", "PER_B", "PER_I", "PER_B", 
"PER_B", "PER_I", "PER_B", "PER_I", "PER_B", "PER_I")), row.names = c(NA, 
-18L), class = c("spacyr_parsed", "data.frame"))

a <- dad$token
b <- dad$entity
d <- NULL

i <- 0

while(i < (length(a)-1)){
      i  <- i+1
      print(i)
      if(b[i] == 'PER_B' & b[i+1] == 'PER_I'){
         d[i] <- paste(a[i],a[i+1])
         }else if(b[i] == 'PER_I' & b[i+1] == 'PER_B'){
         d[i] <- NA
         }else if(b[i] == 'PER_B' & b[i+1] == 'PER_B'){
         d[i] <- a[i]
         }else{
         d[i] <- a[i+1]
         }
      }
      
e <- d[!is.na(d)]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Ivan Bezerra Allaman