'Rearranging cells in a .tsv file
I have a .tsv file which I have attached along with this post. I have rows(cells) in the format of A1,A2,A3...A12 , B1..B2, .... H1..H12. I need to re-arrange this to a format like A1,B1,C1,D1,...H1 , A2,B2,C2,...H2 ..... A12,B12,C12,...H12. I need to do this using Python.
I have another .tsv file that allows me to compare it with this file. It is called flipped.tsv . The flipped.tsv file contains the accurate well values corresponding to the cells. In other words, I must map the well values with their accurate cell-lines.
From what I have understood is that the cell line of the meta-data is incorreclty arranged in column-major whereas it has to be arranged in a row-major format like how it is in flipped.tsv file.
For example : "A2 of flipped_metadata.tsv has the same well values as that of B1 of metadata.tsv."
What is the logic that I can carry out to perform this in Python?
Solution 1:[1]
You could do the following:
import csv
# Read original file
with open("file.tsv", "r") as file:
rows = list(csv.reader(file, delimiter="\t"))
# Key function for sorting
def key_func(row):
""" Transform row in sort key, e.g. ['A7', 1, 2] -> (7, 'A') """
return int(row[0][1:]), row[0][0]
# Write `flipped“ file
with open("file_flipped.tsv", "w") as file:
csv.writer(file, delimiter="\t").writerows(
row[:1] + flipped[1:]
for row, flipped in zip(rows, sorted(rows, key=key_func))
)
The flipping is done by sorting the original rows by
- first the integer part of their first row entry
int(row[0][1:]), and - then the character part of their first entry
row[0][0].
See tio.run illustration here.
If the effect of the sorting isn't obvious, take a look at the result of the same operation, just without the relabelling of the first column:
with open("file_flipped.tsv", "w") as file:
csv.writer(file, delimiter="\t").writerows(
sorted(rows, key=key_func)
)
Output:
A1 26403 23273
B1 27792 8805
C1 5668 19510
...
F12 100 28583
G12 18707 14889
H12 13544 7447
The blocks are build based on the number part first, and within each block the lines run through the sorted characters.
This only works as long as the non-number part has always exactly one character.
If the non-number part has always exactly 2 characters then the return of the key function has to be adjusted to int(row[0][2:]), row[0][:2] etc.
If there's more variability allowed, e.g. between 1 and 5 characters, then a regex approach would be more appropriate:
import re
re_key = re.compile(r"([a-zA-Z]+)(\d+)")
def key_func(row):
""" Transform row in sort key, e.g. ['Aa7', 10, 20] -> (7, 2, 'Aa') """
word, number = re_key.match(row[0]).group(1, 2)
return int(number), len(word), word
Here's a regex demo.
And, depending on how the words have to be sorted, it might be necessary to include the length of the word into the sort key: Python sorts ['B', 'AA', 'A'] naturally into ['A', 'AA', 'B'] and not ['A', 'B', 'AA']. The addition of the length, like in the function, does achieve that.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
