'Grab specific values from columns with disordered rows

As the consequence of a misformatted file (which is unfortunately the only file available) I have a few thousand columns each with 9 rows of data in them. Unfortunately, the actual values are in a different order in each column.

I need to extract matched locus_tag= and gene= and/or product= values for each column whilst keeping the column order intact so that these do not get mismatched. Another complication is that these are formatted as "gene=ltas" so I had thought some kind of grepl would be useful.

However, I also need them ordered so that each row only contains one either the correct value (e.g. gene=) or NA:

Column A Column B
gene = ltas NA
NA product = hypothetical protein
locus_tag = RAS_R12345 locus_tag = RAS_R14053

Here is an example of the data that I am working with:

header 1 header 2
Parent=gene-SAS_RS00035 Name=hutH
gbkey=CDS gene_biotype=protein_coding
inference=COORDINATES: similar to AA sequence:RefSeq:WP_002461649.1 locus_tag=SAS_RS00040
Dbxref=Genbank:WP_000449218.1 gbkey=Gene
locus_tag=SAS_RS00035 old_locus_tag=SAS0008
Name=WP_000449218.1 gene=hutH
cds-WP_000449218.1 gene-SAS_RS00040
protein_id=WP_000449218.1
product=NAD(P)H-hydrate dehydratase

I'n not sure where to start with coding this as it is so disordered and poorly formatted, so any advice would be very welcomed.



Solution 1:[1]

How about this:

  dat <- structure(list(`header 1` = c("Parent=gene-SAS_RS00035", "gbkey=CDS", 
                                     "inference=COORDINATES: similar to AA sequence:RefSeq:WP_002461649.1", 
                                     "Dbxref=Genbank:WP_000449218.1", "locus_tag=SAS_RS00035", "Name=WP_000449218.1", 
                                     "cds-WP_000449218.1", "protein_id=WP_000449218.1", "product=NAD(P)H-hydrate dehydratase"
), `header 2` = c("Name=hutH", "gene_biotype=protein_coding", 
                  "locus_tag=SAS_RS00040", "gbkey=Gene", "old_locus_tag=SAS0008", 
                  "gene=hutH", "gene-SAS_RS00040", "", "")), row.names = c(NA, 
                                                                           9L), class = "data.frame")


prod <- apply(dat, 2, function(x){
  prod_ind <- grep("^product", x)
  if(length(prod_ind == 1)){
    out <- gsub("(product=.*)", "\\1", x[prod_ind])
  }else{
    out <- NA
  }
  out
})

gene <- apply(dat, 2, function(x){
  gene_ind <- grep("^gene=", x)
  if(length(gene_ind == 1)){
    out <- gsub("(gene=.*)", "\\1", x[gene_ind])
  }else{
    out <- NA
  }
  out
})

locus <- apply(dat, 2, function(x){
  locus_tag_ind <- grep("^locus_tag=", x)
  if(length(locus_tag_ind == 1)){
    out <- gsub("(locus_tag=.*)", "\\1", x[locus_tag_ind])
  }else{
    out <- NA
  }
  out
})

dplyr::bind_rows(gene, prod, locus)
#> # A tibble: 3 × 2
#>   `header 1`                          `header 2`           
#>   <chr>                               <chr>                
#> 1 <NA>                                gene=hutH            
#> 2 product=NAD(P)H-hydrate dehydratase <NA>                 
#> 3 locus_tag=SAS_RS00035               locus_tag=SAS_RS00040

Created on 2022-04-05 by the reprex package (v2.0.1)

The example above does this in three parts, each time searching for one of the things you're interested in. Then, it combines all the results together.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 DaveArmstrong