'Avoiding a Loop to Extract Vectors From Rows in Dataframe in R
I have data in a text file in which the cases are stacked in a single column. I need to extract selected lines from each case into vectors. I know how to do this with a loop that parses every line, but I'd like to know if this can be done in R without using a loop.
Here's a demo data frame:
demodat <- data.frame(V1 = c(
"case01",
"sid: 112905",
"form3: 2",
"form2: 0",
"form1: An interesting comment",
"form0: 8",
"case02",
"sid: 132788",
"form3: 1",
"form2: 1",
"form1: Not sure about this",
"form0: 17",
"case03",
"sid: 102296",
"form3: 1",
"form2: 0",
"form1: This is obvious",
"form0: 12"))
Here's an example of the loop I'm using to extract case, form0, and form1 into vectors:
library(tidyverse)
datlines <- 6 # Number of rows per case
case <- NA
form0 <- NA
form1 <- NA
j <- 1
for(i in 1:nrow(demodat)) {
if (str_sub(demodat[i,1],1,4)=="case") case[j] <- demodat[i,1]
#
if (str_sub(demodat[i,1],1,6)=="form0:") form0[j] <- str_replace(demodat[i,1],"form0: ","")
if (str_sub(demodat[i,1],1,6)=="form1:") form1[j] <- str_replace(demodat[i,1],"form1: ","")
#
if(i%%datlines == 0) j <- j + 1
}
case
form0
form1
This approach works, but the real data frame has tens of thousands of rows and I need to extract many vectors from each case. I'm hoping to find a more efficient approach that avoids looping through every row of the data frame.
I would apreciate advice.
Solution 1:[1]
Here is a simple base R way with scan and grep.
s <- scan(textConnection(demodat$V1), what = character(), sep = ":")
s <- trimws(s)
case <- grep("case", s, value = TRUE)
form0 <- s[grep("form0", s) + 1L]
form1 <- s[grep("form1", s) + 1L]
rm(s)
case
#> [1] "case01" "case02" "case03"
form0
#> [1] "8" "17" "12"
form1
#> [1] "An interesting comment" "Not sure about this" "This is obvious"
Created on 2022-05-06 by the reprex package (v2.0.1)
Solution 2:[2]
along these lines?
library(dplyr)
library(tidyr)
demodat %>%
separate_rows(V1, sep = ',') %>% ## one row per ','-separated term
separate(V1, into = c('parameter', 'value'), sep = ':') ## (1)
## (1) now you can filter for parameter, e.g. 'sid' or grepl('case', parameter)
output:
## # A tibble: 18 x 2
## parameter value
## <chr> <chr>
## 1 case01 NA
## 2 sid " 112905"
## 3 form3 " 2"
## 4 form2 " 0"
## 5 form1 " An interesting comment"
## 6 form0 " 8"
## 7 case02 NA
## ...
edit to keep track of the case ID, add the following to the pipeline:
## ... %>%
mutate(case_id = ifelse(grepl('case', parameter),
gsub('^case(.*)$','\\1',parameter),
NA)
) %>%
fill(case_id, .direction = 'down')
Solution 3:[3]
another base:
demo2 <- read.dcf(textConnection(gsub('case*', 'case*: ', demodat$V1)), all = TRUE)
> demo2
case* sid form3 form2
1 01, 02, 03 112905, 132788, 102296 2, 1, 1 0, 1, 0
form1 form0
1 An interesting comment, Not sure about this, This is obvious 8, 17, 12
> class(demo2)
[1] "data.frame"
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Rui Barradas |
| Solution 2 | |
| Solution 3 | Chris |
