'Concatenate values in each column to their column names in a data.frame

I want to substitute all the attribute column values in gtf.top.thyroid.gene dataframe such that I append:

  • the string "gene_id" to the front of the existing column value
  • a semicolon ";" to the end of the existing column value

For one column, I can do the following:

gtf.top.thyroid.gene$attribute <- paste0('gene_id "', gtf.top.thyroid.gene$attribute, '";')

But what if I want to write a for loop to simplify the following:

gtf.top.thyroid.gene$attribute <- paste0('gene_id "', gtf.top.thyroid.gene$attribute, '";')
gtf.top.thyroid.gene$transcript_id <- paste0('transcript_id "', gtf.top.thyroid.gene$transcript_id, '";')
gtf.top.thyroid.gene$gene_name <- paste0('gene_name "', gtf.top.thyroid.gene$gene_name, '";')
gtf.top.thyroid.gene$transcript_name <- paste0('transcript_name "', gtf.top.thyroid.gene$transcript_name, '";')
write.table(gtf.top.thyroid.gene, file="topgene.gtf", row.names=F, col.names=F, quote=F, sep="\t")

My attempt:

for (i in gtf.top.thyroid.gene[,9:12]) {
  for (j in colnames(gtf.top.thyroid.gene)[9:12]) {
    i <- paste(j, ' "', i, '"; ')
  }
}

..it did not change any of the column values.

> dput(gtf.top.thyroid.gene)
structure(list(seqid = c("NC_000001.11", "NC_000001.11", "NC_000001.11"
), source = c("BestRefSeq", "BestRefSeq", "BestRefSeq"), feature = c("exon", 
"exon", "exon"), start = c(11874L, 12613L, 13221L), end = c(12227L, 
12721L, 14409L), score = c(".", ".", "."), strand = c("+", "+", 
"+"), frame = c(".", ".", "."), attribute = c("gene0", "gene0", 
"gene0"), transcript_id = c("rna0", "rna0", "rna0"), gene_name = c("DDX11L1", 
"DDX11L1", "DDX11L1"), transcript_name = c("NR_046018.2", "NR_046018.2", 
"NR_046018.2")), class = "data.frame", row.names = c("1", "2", 
"3"))


Solution 1:[1]

It's an issue that across() + cur_column() in dplyr can handle.

library(dplyr)

gtf.top.thyroid.gene %>%
  rename(gene_id = attribute) %>%
  mutate(across(c(gene_id, transcript_id, gene_name, transcript_name),
                ~ paste0(cur_column(), ' "', .x, '";')))

Output:

#          seqid     source feature start   end score strand frame          gene_id         transcript_id            gene_name                transcript_name
# 1 NC_000001.11 BestRefSeq    exon 11874 12227     .      +     . gene_id "gene0"; transcript_id "rna0"; gene_name "DDX11L1"; transcript_name "NR_046018.2";
# 2 NC_000001.11 BestRefSeq    exon 12613 12721     .      +     . gene_id "gene0"; transcript_id "rna0"; gene_name "DDX11L1"; transcript_name "NR_046018.2";
# 3 NC_000001.11 BestRefSeq    exon 13221 14409     .      +     . gene_id "gene0"; transcript_id "rna0"; gene_name "DDX11L1"; transcript_name "NR_046018.2";

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Darren Tsai