'(bash/Linux) most efficient way to subset rows and columns based on row and column indices in other files

I have a data file with millions of rows and thousands of columns. I need the data for certain rows and certain columns. I have a text file with the line numbers I need, and another text file with the column headers I need. I have to subset a couple hundred data files like this, so I'm trying to figure out how to do it quickly, without creating huge intermediate files.

I'm not very experienced in bash scripting. My best approach so far has been to use awk to extract the necessary lines to create an intermediate file, and pipe the results to cut so that the output file only contains the necessary columns:

COLS=($(zcat big_zipped_file.txt.gz | head -n 1 | tr '\t' '\n' | grep -nf column_headers.txt | sed 's/:.*$//'))

awk 'NR==FNR{a[$0];next} FNR in a' line_numbers.txt <(zcat big_zipped_file.txt.gz) | 
  cut -f 1$(printf ",%s" "${COLS[@]}") >> output_file

Is there a way to do this that's faster, and doesn't create large temp files?



Solution 1:[1]

The solution is to scan each file once.

We use gawk standard Linux awk script.

Reading first 2 files are columns.txt and rows.txt (order not important).

Reading 1 or more data files one after the other, without any temporary files.

Input files

rows.txt

1
7
5
2

columns.txt

col5
col3
col2

data.txt

col1 col2 col3 col4 col5
11   12    13    14   15
21   22    23    24   25
31   32    33    34   35
41   42    43    44   45
51   52    53    54   55
61   62    63    64   65
71   72    73    74   75

script.awk

BEGIN {
    rowsNumbersFileName="rows.txt";
    clumnNamesFileName="columns.txt";
}

ENDFILE { # mark data files when inputFilesCount > 1
    ++inputFilesCount;
 }
 
FILENAME == rowsNumbersFileName { # read rows number
    rowsArr[$0] = ++rowsCount;
}

FILENAME == clumnNamesFileName { # read desired column headers
    columnHeadersArr[$0] = ++columnHeadersCount;
}

inputFilesCount > 1 && FNR == 1 { # caluculate column headers markers
    OFS = "";
    for (i = 1; i <= NF; i++) { # scan all header names
        if ($i in columnHeadersArr) { # if current field is in columnHeadersArr
            columnNumbersArr[++columnNumbersCount] = i; # collect field name in columnNumbersArr
            printf ("%s   %s", OFS, $i); # Ouput current field as header
            OFS = ","; 
        }
    }
    printf("\n");
    next;
}

inputFilesCount > 1  && (FNR - 1) in rowsArr {
    OFS = "";
     for (currColumnNum in columnNumbersArr) {
        printf ("%s     %s", OFS, $columnNumbersArr[currColumnNum]);
        OFS = ",";
    }
    printf("\n");
}

Output

awk -f script.awk rows.txt columns.txt data.txt
   col2,   col3,   col5
     12,     13,     15
     22,     23,     25
     52,     53,     55
     72,     73,     75

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Dudi Boy