'Grab specific values from columns with disordered rows

As the consequence of a misformatted file (which is unfortunately the only file available) I have a few thousand columns each with 9 rows of data in them. Unfortunately, the actual values are in a different order in each column.

I need to extract matched locus_tag= and gene= and/or product= values for each column whilst keeping the column order intact so that these do not get mismatched. Another complication is that these are formatted as "gene=ltas" so I had thought some kind of grepl would be useful.

However, I also need them ordered so that each row only contains one either the correct value (e.g. gene=) or NA:

`Column A`	`Column B`
`gene = ltas`	`NA`
`NA`	`product = hypothetical protein`
`locus_tag = RAS_R12345`	`locus_tag = RAS_R14053`

Here is an example of the data that I am working with:

`header 1`	`header 2`
`Parent=gene-SAS_RS00035`	`Name=hutH`
`gbkey=CDS`	`gene_biotype=protein_coding`
`inference=COORDINATES: similar to AA sequence:RefSeq:WP_002461649.1`	`locus_tag=SAS_RS00040`
`Dbxref=Genbank:WP_000449218.1`	`gbkey=Gene`
`locus_tag=SAS_RS00035`	`old_locus_tag=SAS0008`
`Name=WP_000449218.1`	`gene=hutH`
`cds-WP_000449218.1`	`gene-SAS_RS00040`
`protein_id=WP_000449218.1`
`product=NAD(P)H-hydrate dehydratase`

I'n not sure where to start with coding this as it is so disordered and poorly formatted, so any advice would be very welcomed.

r ncbi

Solution 1:^[1]

How about this:

  dat <- structure(list(`header 1` = c("Parent=gene-SAS_RS00035", "gbkey=CDS", 
                                     "inference=COORDINATES: similar to AA sequence:RefSeq:WP_002461649.1", 
                                     "Dbxref=Genbank:WP_000449218.1", "locus_tag=SAS_RS00035", "Name=WP_000449218.1", 
                                     "cds-WP_000449218.1", "protein_id=WP_000449218.1", "product=NAD(P)H-hydrate dehydratase"
), `header 2` = c("Name=hutH", "gene_biotype=protein_coding", 
                  "locus_tag=SAS_RS00040", "gbkey=Gene", "old_locus_tag=SAS0008", 
                  "gene=hutH", "gene-SAS_RS00040", "", "")), row.names = c(NA, 
                                                                           9L), class = "data.frame")


prod <- apply(dat, 2, function(x){
  prod_ind <- grep("^product", x)
  if(length(prod_ind == 1)){
    out <- gsub("(product=.*)", "\\1", x[prod_ind])
  }else{
    out <- NA
  }
  out
})

gene <- apply(dat, 2, function(x){
  gene_ind <- grep("^gene=", x)
  if(length(gene_ind == 1)){
    out <- gsub("(gene=.*)", "\\1", x[gene_ind])
  }else{
    out <- NA
  }
  out
})

locus <- apply(dat, 2, function(x){
  locus_tag_ind <- grep("^locus_tag=", x)
  if(length(locus_tag_ind == 1)){
    out <- gsub("(locus_tag=.*)", "\\1", x[locus_tag_ind])
  }else{
    out <- NA
  }
  out
})

dplyr::bind_rows(gene, prod, locus)
#> # A tibble: 3 × 2
#>   `header 1`                          `header 2`           
#>   <chr>                               <chr>                
#> 1 <NA>                                gene=hutH            
#> 2 product=NAD(P)H-hydrate dehydratase <NA>                 
#> 3 locus_tag=SAS_RS00035               locus_tag=SAS_RS00040

^{Created on 2022-04-05 by the reprex package (v2.0.1)}

The example above does this in three parts, each time searching for one of the things you're interested in. Then, it combines all the results together.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	DaveArmstrong

'Grab specific values from columns with disordered rows

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]