'Avoiding a Loop to Extract Vectors From Rows in Dataframe in R

I have data in a text file in which the cases are stacked in a single column. I need to extract selected lines from each case into vectors. I know how to do this with a loop that parses every line, but I'd like to know if this can be done in R without using a loop.

Here's a demo data frame:

demodat <- data.frame(V1 = c(
"case01",
"sid: 112905",
"form3: 2",
"form2: 0",
"form1: An interesting comment",
"form0: 8",
"case02",
"sid: 132788",
"form3: 1",
"form2: 1",
"form1: Not sure about this",
"form0: 17",
"case03",
"sid: 102296",
"form3: 1",
"form2: 0",
"form1: This is obvious",
"form0: 12"))

Here's an example of the loop I'm using to extract case, form0, and form1 into vectors:

library(tidyverse)

datlines <- 6  # Number of rows per case
case <- NA
form0 <- NA
form1 <- NA
j <- 1
for(i in 1:nrow(demodat)) {
   if (str_sub(demodat[i,1],1,4)=="case") case[j] <- demodat[i,1]
   #
   if (str_sub(demodat[i,1],1,6)=="form0:") form0[j] <- str_replace(demodat[i,1],"form0: ","")   
   if (str_sub(demodat[i,1],1,6)=="form1:") form1[j] <- str_replace(demodat[i,1],"form1: ","")      
   #
   if(i%%datlines == 0) j <- j + 1
   }

case
form0
form1

This approach works, but the real data frame has tens of thousands of rows and I need to extract many vectors from each case. I'm hoping to find a more efficient approach that avoids looping through every row of the data frame.

I would apreciate advice.



Solution 1:[1]

Here is a simple base R way with scan and grep.

s <- scan(textConnection(demodat$V1), what = character(), sep = ":")
s <- trimws(s)
case <- grep("case", s, value = TRUE)
form0 <- s[grep("form0", s) + 1L]
form1 <- s[grep("form1", s) + 1L]
rm(s)

case
#> [1] "case01" "case02" "case03"
form0
#> [1] "8"  "17" "12"
form1
#> [1] "An interesting comment" "Not sure about this"    "This is obvious"

Created on 2022-05-06 by the reprex package (v2.0.1)

Solution 2:[2]

along these lines?

library(dplyr)
library(tidyr)

demodat %>%
    separate_rows(V1, sep = ',') %>% ## one row per ','-separated term
    separate(V1, into = c('parameter', 'value'), sep = ':') ## (1) 

## (1) now you can filter for parameter, e.g. 'sid' or grepl('case', parameter)

output:


## # A tibble: 18 x 2
##    parameter value                    
##    <chr>     <chr>                    
##  1 case01     NA                      
##  2 sid       " 112905"                
##  3 form3     " 2"                     
##  4 form2     " 0"                     
##  5 form1     " An interesting comment"
##  6 form0     " 8"                     
##  7 case02     NA       
## ...

edit to keep track of the case ID, add the following to the pipeline:

## ... %>%
mutate(case_id = ifelse(grepl('case', parameter),
                        gsub('^case(.*)$','\\1',parameter),
                        NA)
       ) %>%
fill(case_id, .direction = 'down')

Solution 3:[3]

another base:

demo2 <- read.dcf(textConnection(gsub('case*', 'case*: ', demodat$V1)), all = TRUE)
> demo2
       case*                    sid   form3   form2
1 01, 02, 03 112905, 132788, 102296 2, 1, 1 0, 1, 0
                                                         form1     form0
1 An interesting comment, Not sure about this, This is obvious 8, 17, 12
> class(demo2)
[1] "data.frame"

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Rui Barradas
Solution 2
Solution 3 Chris