'Slicing dataset for relevant information
I have a large dataset showing the social network links in a corpus. I want to extract just the entities from this corpus. Within the dataset (sample below), the entities can be extracted by capturing the first set of values in the entity2 column for the first entity in each paragraph.
My sample dataset:
structure(list(X = c(6166L, 6168L, 6170L, 6175L, 6177L, 6180L,
34062L, 34063L, 34064L, 34065L, 34066L), entity1 = c("Epicurus",
"Epicurus", "Epicurus", "Charles Lamb", "Charles Lamb", "Roman",
"Egypt", "Egypt", "Egypt", "India", "India"), type1 = c("person",
"person", "person", "person", "person", "group", "geopolitical area",
"geopolitical area", "geopolitical area", "geopolitical area",
"geopolitical area"), entity2 = c("Epic", "Charles Lamb", "Roman",
"Charles Lamb", "Roman", "Roman", "Egypt", "India", "Arabia",
"India", "Arabia"), type2 = c("person", "person", "group", "person",
"group", "group", "geopolitical area", "geopolitical area", "geopolitical area",
"geopolitical area", "geopolitical area"), text = c("plutarch.txt",
"plutarch.txt", "plutarch.txt", "plutarch.txt", "plutarch.txt",
"plutarch.txt", "civilization.txt", "civilization.txt", "civilization.txt",
"civilization.txt", "civilization.txt"), paragraph = c(49L, 49L,
49L, 49L, 49L, 49L, 15L, 15L, 15L, 15L, 15L)), class = "data.frame", row.names = c(NA,
-11L))
would just include the rows for Epicurus and Egypt. Dataset is 150,000 lines, so will need to be done programmatically. The paragraphs are numbered in the respective column, and these numbers reset for each work, so they won't be unique. I'm not sure if tidyverse has anything for this, or if I need to do something like extracting the first set of rows, with duplicated values in the entity1 column for each paragraph.
Any help is appreciated
Solution 1:[1]
library(data.table)
setDT(df)
df[, .SD[1], by = paragraph]
for big data, this option may be effective
df[df[, .I[1], by = paragraph]$V1, ]
X entity1 type1 entity2 type2 text paragraph
1: 6166 Epicurus person Epic person plutarch.txt 49
2: 34062 Egypt geopolitical area Egypt geopolitical area civilization.txt 15
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Yuriy Saraykin |
