'if, ifelse, or apply in R loop to split and format datasets

I am reformatting ~150 files in R. I have been doing this with code provided below, but I now realize that I need to split each of these files into 2 files based on a column variable, and then do the reformatting. I am stuck with the splitting! Here's the process:

I have a folder of files and I am reading file names into R like so:

setwd("/home/intersected_beds")
path <-("/home/intersected_beds")
data <- dir(path)

All of the files look something like this, with varying number of rows but columns matching exactly:

   chr     start       end dir subfamily  family
1 chr1  87144764  87150794   C      L1HS LINE/L1
2 chr2 173179999 173186025   +      L1HS LINE/L1
3 chr2 181698389 181704416   C      L1HS LINE/L1
4 chr3 108468248 108474272   +      L1HS LINE/L1
5 chr3 132664851 132670878   C      L1HS LINE/L1
6 chr4  53682624  53688653   +      L1HS LINE/L1

I need to reformat the data so that it looks like this:

    chr    start      end                  unique
1 chr11 20363925 20370314 chr11_20363925-20370314
2 chr13 46788764 46795064 chr13_46788764-46795064
etc

Which I have been doing like this:

for(i in data){
  t <- read.table(i, header = FALSE, stringsAsFactors = FALSE) # load file
  colnames(t) <- c("chr", "start", "end", "dir", "subfamily", "family")
  fmt <- paste(t$chr,"_", t$start,"-", t$end)
  fmt <- gsub(" ","", fmt)
  fmt <- as.data.frame(fmt)
  new.t <- cbind(t, fmt)
  colnames(new.t) <- c("chr", "start", "end", "dir", "subfamily", "family", "unique")
  new.t.bed <- new.t %>% select(chr, start, end, unique)
  write.table(new.t.bed, paste0(tools::file_path_sans_ext(i), "_2bit", ".bed"), col.names=FALSE, row.names=FALSE, quote = FALSE)
}

However, it turns out that I need to split my data based on the "dir" column in the input data, and then do this reformatting. The dir column consists of + or c values. I think I need an ifelse statement because I ran into "the condition has length > 1 and only the first element will be used" errors with a regular if statement. I've attempted to split and format them like so:

for(i in data){
  t <- read.table(i, header = FALSE, stringsAsFactors = FALSE)
  colnames(t) <- c("chr", "start", "end", "dir", "subfamily", "family")
  if(t$dir =="+"){
    fmt <- paste(t$chr,"_", t$start,"-", t$end)
    fmt <- gsub(" ","", fmt)
    fmt <- as.data.frame(fmt)
    new.t <- cbind(t, fmt)
    colnames(new.t) <- c("chr", "start", "end", "dir", "subfamily", "family", "unique")
    new.t.bed <- new.t %>% select(chr, start, end, unique)
    write.table(new.t.bed, paste0(tools::file_path_sans_ext(i), "_2bit_plus", ".bed"), col.names=FALSE, row.names=FALSE, quote = FALSE)
  } else {
    fmt <- paste(t$chr,"_", t$start,"-", t$end)
    fmt <- gsub(" ","", fmt)
    fmt <- as.data.frame(fmt)
    new.t <- cbind(t, fmt)
    colnames(new.t) <- c("chr", "start", "end", "dir", "subfamily", "family", "unique")
    new.t.bed <- new.t %>% select(chr, start, end, unique)
    write.table(new.t.bed, paste0(tools::file_path_sans_ext(i), "_2bit_c", ".bed"), col.names=FALSE, row.names=FALSE, quote = FALSE)
  }
}

but I am still getting a warning "the condition has length > 1 and only the first element will be used" and the output does not have the data split by + and c. I thought about trying to make plussplit and csplit functions to use in an ifelse statement, but I got the error "Error in dir[[t]] : object of type 'closure' is not subsettable"


for(i in data){
  t <- read.table(i, header = FALSE, stringsAsFactors = FALSE)
  colnames(t) <- c("chr", "start", "end", "dir", "subfamily", "family")
  ifelse(dir[[t]] =="+", plussplit, csplit)
}
  
plussplit <- function(t){
  fmt <- paste(t$chr,"_", t$start,"-", t$end)
  fmt <- gsub(" ","", fmt)
  fmt <- as.data.frame(fmt)
  new.t <- cbind(t, fmt)
  colnames(new.t) <- c("chr", "start", "end", "dir", "subfamily", "family", "unique")
  new.t.bed <- new.t %>% select(chr, start, end, unique)
  write.table(new.t.bed, paste0(tools::file_path_sans_ext(i), "_2bit_plus", ".bed"), col.names=FALSE, row.names=FALSE, quote = FALSE)
}

csplit <- function(t){
  fmt <- paste(t$chr,"_", t$start,"-", t$end)
  fmt <- gsub(" ","", fmt)
  fmt <- as.data.frame(fmt)
  new.t <- cbind(t, fmt)
  colnames(new.t) <- c("chr", "start", "end", "dir", "subfamily", "family", "unique")
  new.t.bed <- new.t %>% select(chr, start, end, unique)
  write.table(new.t.bed, paste0(tools::file_path_sans_ext(i), "_2bit_c", ".bed"), col.names=FALSE, row.names=FALSE, quote = FALSE)
}

I've looked through the posts on these topics and see the apply functions being suggested, but I can't get anything to work. I imagine there is something fundamental I am missing about this process, and I bet you royal beings of stackoverflow have the answers! Thanks!



Solution 1:[1]

Not sure if I got it right. You could use

library(dplyr)
library(purrr)

for(i in data) {
  t <- read.table(i, header = FALSE, stringsAsFactors = FALSE)
  
  colnames(t) <- c("chr", "start", "end", "dir", "subfamily", "family")
  
  t %>% 
    mutate(unique = gsub(" ", "", paste(chr, start, end, sep = "_"))) %>% 
    select(dir, chr, start, end, unique) %>% 
    split(f = .$dir) %>% 
    walk(~ifelse(
      all(.x$dir == "+"), 
      .x %>% 
        select(-dir) %>% 
        write.table(paste0(tools::file_path_sans_ext(i), "_2bit_plus", ".bed"), 
                    col.names = FALSE, 
                    row.names = FALSE, 
                    quote = FALSE), 
      .x %>% 
        select(-dir) %>% 
        write.table(paste0(tools::file_path_sans_ext(i), "_2bitc", ".bed"), 
                    col.names = FALSE, 
                    row.names = FALSE, 
                    quote = FALSE)))
}
  • The mutate-function is basically your creating of fmt, just in one step.
  • split splits the data.frame by the values of a given variable, in this case dir. Since there are two unique values of dir, you get a list of two data.frames: one for + and another one for c.
  • walk from package purrr now applies an ifelse-function on those data.frame: if all values of dir are + then a file name containing _2bit_plus.bed is created else _2bit_c.bed.
  • Be careful to run this code. You might overwrite existing files that you wanted to keep.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Martin Gal