'Slicing dataset for relevant information

I have a large dataset showing the social network links in a corpus. I want to extract just the entities from this corpus. Within the dataset (sample below), the entities can be extracted by capturing the first set of values in the entity2 column for the first entity in each paragraph.

My sample dataset:

structure(list(X = c(6166L, 6168L, 6170L, 6175L, 6177L, 6180L, 
34062L, 34063L, 34064L, 34065L, 34066L), entity1 = c("Epicurus", 
"Epicurus", "Epicurus", "Charles Lamb", "Charles Lamb", "Roman", 
"Egypt", "Egypt", "Egypt", "India", "India"), type1 = c("person", 
"person", "person", "person", "person", "group", "geopolitical area", 
"geopolitical area", "geopolitical area", "geopolitical area", 
"geopolitical area"), entity2 = c("Epic", "Charles Lamb", "Roman", 
"Charles Lamb", "Roman", "Roman", "Egypt", "India", "Arabia", 
"India", "Arabia"), type2 = c("person", "person", "group", "person", 
"group", "group", "geopolitical area", "geopolitical area", "geopolitical area", 
"geopolitical area", "geopolitical area"), text = c("plutarch.txt", 
"plutarch.txt", "plutarch.txt", "plutarch.txt", "plutarch.txt", 
"plutarch.txt", "civilization.txt", "civilization.txt", "civilization.txt", 
"civilization.txt", "civilization.txt"), paragraph = c(49L, 49L, 
49L, 49L, 49L, 49L, 15L, 15L, 15L, 15L, 15L)), class = "data.frame", row.names = c(NA, 
-11L))

would just include the rows for Epicurus and Egypt. Dataset is 150,000 lines, so will need to be done programmatically. The paragraphs are numbered in the respective column, and these numbers reset for each work, so they won't be unique. I'm not sure if tidyverse has anything for this, or if I need to do something like extracting the first set of rows, with duplicated values in the entity1 column for each paragraph.

Any help is appreciated

r


Solution 1:[1]

library(data.table)
setDT(df)
df[, .SD[1], by = paragraph]

for big data, this option may be effective

df[df[, .I[1], by = paragraph]$V1, ]

       X  entity1             type1 entity2             type2             text paragraph
1:  6166 Epicurus            person    Epic            person     plutarch.txt        49
2: 34062    Egypt geopolitical area   Egypt geopolitical area civilization.txt        15

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Yuriy Saraykin