'if, ifelse, or apply in R loop to split and format datasets
I am reformatting ~150 files in R. I have been doing this with code provided below, but I now realize that I need to split each of these files into 2 files based on a column variable, and then do the reformatting. I am stuck with the splitting! Here's the process:
I have a folder of files and I am reading file names into R like so:
setwd("/home/intersected_beds")
path <-("/home/intersected_beds")
data <- dir(path)
All of the files look something like this, with varying number of rows but columns matching exactly:
chr start end dir subfamily family
1 chr1 87144764 87150794 C L1HS LINE/L1
2 chr2 173179999 173186025 + L1HS LINE/L1
3 chr2 181698389 181704416 C L1HS LINE/L1
4 chr3 108468248 108474272 + L1HS LINE/L1
5 chr3 132664851 132670878 C L1HS LINE/L1
6 chr4 53682624 53688653 + L1HS LINE/L1
I need to reformat the data so that it looks like this:
chr start end unique
1 chr11 20363925 20370314 chr11_20363925-20370314
2 chr13 46788764 46795064 chr13_46788764-46795064
etc
Which I have been doing like this:
for(i in data){
t <- read.table(i, header = FALSE, stringsAsFactors = FALSE) # load file
colnames(t) <- c("chr", "start", "end", "dir", "subfamily", "family")
fmt <- paste(t$chr,"_", t$start,"-", t$end)
fmt <- gsub(" ","", fmt)
fmt <- as.data.frame(fmt)
new.t <- cbind(t, fmt)
colnames(new.t) <- c("chr", "start", "end", "dir", "subfamily", "family", "unique")
new.t.bed <- new.t %>% select(chr, start, end, unique)
write.table(new.t.bed, paste0(tools::file_path_sans_ext(i), "_2bit", ".bed"), col.names=FALSE, row.names=FALSE, quote = FALSE)
}
However, it turns out that I need to split my data based on the "dir" column in the input data, and then do this reformatting. The dir column consists of + or c values. I think I need an ifelse statement because I ran into "the condition has length > 1 and only the first element will be used" errors with a regular if statement. I've attempted to split and format them like so:
for(i in data){
t <- read.table(i, header = FALSE, stringsAsFactors = FALSE)
colnames(t) <- c("chr", "start", "end", "dir", "subfamily", "family")
if(t$dir =="+"){
fmt <- paste(t$chr,"_", t$start,"-", t$end)
fmt <- gsub(" ","", fmt)
fmt <- as.data.frame(fmt)
new.t <- cbind(t, fmt)
colnames(new.t) <- c("chr", "start", "end", "dir", "subfamily", "family", "unique")
new.t.bed <- new.t %>% select(chr, start, end, unique)
write.table(new.t.bed, paste0(tools::file_path_sans_ext(i), "_2bit_plus", ".bed"), col.names=FALSE, row.names=FALSE, quote = FALSE)
} else {
fmt <- paste(t$chr,"_", t$start,"-", t$end)
fmt <- gsub(" ","", fmt)
fmt <- as.data.frame(fmt)
new.t <- cbind(t, fmt)
colnames(new.t) <- c("chr", "start", "end", "dir", "subfamily", "family", "unique")
new.t.bed <- new.t %>% select(chr, start, end, unique)
write.table(new.t.bed, paste0(tools::file_path_sans_ext(i), "_2bit_c", ".bed"), col.names=FALSE, row.names=FALSE, quote = FALSE)
}
}
but I am still getting a warning "the condition has length > 1 and only the first element will be used" and the output does not have the data split by + and c. I thought about trying to make plussplit and csplit functions to use in an ifelse statement, but I got the error "Error in dir[[t]] : object of type 'closure' is not subsettable"
for(i in data){
t <- read.table(i, header = FALSE, stringsAsFactors = FALSE)
colnames(t) <- c("chr", "start", "end", "dir", "subfamily", "family")
ifelse(dir[[t]] =="+", plussplit, csplit)
}
plussplit <- function(t){
fmt <- paste(t$chr,"_", t$start,"-", t$end)
fmt <- gsub(" ","", fmt)
fmt <- as.data.frame(fmt)
new.t <- cbind(t, fmt)
colnames(new.t) <- c("chr", "start", "end", "dir", "subfamily", "family", "unique")
new.t.bed <- new.t %>% select(chr, start, end, unique)
write.table(new.t.bed, paste0(tools::file_path_sans_ext(i), "_2bit_plus", ".bed"), col.names=FALSE, row.names=FALSE, quote = FALSE)
}
csplit <- function(t){
fmt <- paste(t$chr,"_", t$start,"-", t$end)
fmt <- gsub(" ","", fmt)
fmt <- as.data.frame(fmt)
new.t <- cbind(t, fmt)
colnames(new.t) <- c("chr", "start", "end", "dir", "subfamily", "family", "unique")
new.t.bed <- new.t %>% select(chr, start, end, unique)
write.table(new.t.bed, paste0(tools::file_path_sans_ext(i), "_2bit_c", ".bed"), col.names=FALSE, row.names=FALSE, quote = FALSE)
}
I've looked through the posts on these topics and see the apply functions being suggested, but I can't get anything to work. I imagine there is something fundamental I am missing about this process, and I bet you royal beings of stackoverflow have the answers! Thanks!
Solution 1:[1]
Not sure if I got it right. You could use
library(dplyr)
library(purrr)
for(i in data) {
t <- read.table(i, header = FALSE, stringsAsFactors = FALSE)
colnames(t) <- c("chr", "start", "end", "dir", "subfamily", "family")
t %>%
mutate(unique = gsub(" ", "", paste(chr, start, end, sep = "_"))) %>%
select(dir, chr, start, end, unique) %>%
split(f = .$dir) %>%
walk(~ifelse(
all(.x$dir == "+"),
.x %>%
select(-dir) %>%
write.table(paste0(tools::file_path_sans_ext(i), "_2bit_plus", ".bed"),
col.names = FALSE,
row.names = FALSE,
quote = FALSE),
.x %>%
select(-dir) %>%
write.table(paste0(tools::file_path_sans_ext(i), "_2bitc", ".bed"),
col.names = FALSE,
row.names = FALSE,
quote = FALSE)))
}
- The
mutate-function is basically your creating offmt, just in one step. splitsplits the data.frame by the values of a given variable, in this casedir. Since there are two unique values ofdir, you get a list of two data.frames: one for+and another one forc.walkfrom packagepurrrnow applies anifelse-function on those data.frame: if all values ofdirare+then a file name containing_2bit_plus.bedis created else_2bit_c.bed.- Be careful to run this code. You might overwrite existing files that you wanted to keep.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Martin Gal |
