'(bash/Linux) most efficient way to subset rows and columns based on row and column indices in other files
I have a data file with millions of rows and thousands of columns. I need the data for certain rows and certain columns. I have a text file with the line numbers I need, and another text file with the column headers I need. I have to subset a couple hundred data files like this, so I'm trying to figure out how to do it quickly, without creating huge intermediate files.
I'm not very experienced in bash scripting. My best approach so far has been to use awk to extract the necessary lines to create an intermediate file, and pipe the results to cut so that the output file only contains the necessary columns:
COLS=($(zcat big_zipped_file.txt.gz | head -n 1 | tr '\t' '\n' | grep -nf column_headers.txt | sed 's/:.*$//'))
awk 'NR==FNR{a[$0];next} FNR in a' line_numbers.txt <(zcat big_zipped_file.txt.gz) |
cut -f 1$(printf ",%s" "${COLS[@]}") >> output_file
Is there a way to do this that's faster, and doesn't create large temp files?
Solution 1:[1]
The solution is to scan each file once.
We use gawk standard Linux awk script.
Reading first 2 files are columns.txt and rows.txt (order not important).
Reading 1 or more data files one after the other, without any temporary files.
Input files
rows.txt
1
7
5
2
columns.txt
col5
col3
col2
data.txt
col1 col2 col3 col4 col5
11 12 13 14 15
21 22 23 24 25
31 32 33 34 35
41 42 43 44 45
51 52 53 54 55
61 62 63 64 65
71 72 73 74 75
script.awk
BEGIN {
rowsNumbersFileName="rows.txt";
clumnNamesFileName="columns.txt";
}
ENDFILE { # mark data files when inputFilesCount > 1
++inputFilesCount;
}
FILENAME == rowsNumbersFileName { # read rows number
rowsArr[$0] = ++rowsCount;
}
FILENAME == clumnNamesFileName { # read desired column headers
columnHeadersArr[$0] = ++columnHeadersCount;
}
inputFilesCount > 1 && FNR == 1 { # caluculate column headers markers
OFS = "";
for (i = 1; i <= NF; i++) { # scan all header names
if ($i in columnHeadersArr) { # if current field is in columnHeadersArr
columnNumbersArr[++columnNumbersCount] = i; # collect field name in columnNumbersArr
printf ("%s %s", OFS, $i); # Ouput current field as header
OFS = ",";
}
}
printf("\n");
next;
}
inputFilesCount > 1 && (FNR - 1) in rowsArr {
OFS = "";
for (currColumnNum in columnNumbersArr) {
printf ("%s %s", OFS, $columnNumbersArr[currColumnNum]);
OFS = ",";
}
printf("\n");
}
Output
awk -f script.awk rows.txt columns.txt data.txt
col2, col3, col5
12, 13, 15
22, 23, 25
52, 53, 55
72, 73, 75
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Dudi Boy |
