'Extract two consecutive lines that have non-consecutive strings

I have a very large text file with 2 columns and more than 10 mio of lines. Most lines have in column 2 a number that is the number of column 2 of the previous line +1. However, few thousands of lines behave differently (see example below).

Input file:

A  1
A  2
A  3
A  10
A  11
A  12
A  40
A  41

I would like to extract the pair of two lines that do not respect the +1 increment in column 2.

Desired output file:

A  3
A  10
A  12
A  40

Is there (preferentially) an awk command that allows to do that? I tried several codes comparing column 2 of two consecutive lines but unfortunately I fail until now (see the code below).

awk 'FNR==1 {print; next} $2==p2+1 {print p $0; p=""; next} {p=$0 ORS; p2=$2}' input.txt > output.txt

Thanks for your help. Best,



Solution 1:[1]

Would you please try the following:

awk 'NR>1 {if ($2!=p2+1) print p ORS $0} {p=$0; p2=$2}' input.txt > output.txt

Output:

A  3
A  10
A  12
A  40
  • The variables names are similar to yours: p holds the previous line and p2 holds the second column of the previous line.
  • The condition NR>1 suppresses to print on the 1st line.
  • if ($2!=p2+1) print p ORS $0 prints the pairs of two lines which meet the condition.
  • The block {p=$0; p2=$2} preserves values of current line for the next iteration.

Solution 2:[2]

I like perl for the text processing that needs arithmetic.

$ perl -ane 'print and next if $.<3; print $p and print if $F[3]!=$fp+1; $fp=$F[3]; $p=$_' input.txt
| COLUMN 1 | COLUMN 2 |
| -------- | -------- |
| A | 3 |
| A | 10 |
| A | 12 |
| A | 40 |
  • This is using -a to autosplit into @F.
  • Prints first 2 lines: print and next if $.<3
  • On subsequent lines, prints previous line and current line if the 4th field isn't exactly one more than the prior 4th field: print $p and print if $F[3]!=$fp+1
  • Saves the 4th field as $fp and the entire line as $p: $fp=$F[3]; $p=$_

Solution 3:[3]

Assumptions:

  • columns are tab-delimited
  • the 1st column may contain white space (this isn't demonstrated in the sample provided by OP but it also hasn't been ruled out)
  • lines of interest must have the same value in the 1st column (ie, if the values in the 1st column differ then we don't bother with comparing the values in the 2nd column and instead proceed to the next input line)
  • if 3 consecutive lines meet the criteria, the 2nd/middle line is only printed once

Setup:

$ cat input.txt
A       1
A       2
A       3           # match
A       10          # match
A       11
A       12          # match
A       23          # match
A       40          # match
A       41
X to Z  101
X to Z  102         # match
X to Z  104         # match
X to Z  105

NOTE: comments only added here to highlight the lines that match the search criteria

One awk idea:

awk -F'\t' '
FNR==1 { prevline=$0 }
FNR>1  { if ($1 == prev1 && $2+0 != prev2+1) {
            if (prevline) print prevline
            print
            prevline=""                          # make sure this line is not printed again if next line also meets criteria
         }
         else 
            prevline=$0
       }
       { prev1=$1; prev2=$2 }
' input.txt

This generates:

A       3
A       10
A       12
A       23
A       40
X to Z  102
X to Z  104

Solution 4:[4]

This might work for you (GNU sed):

sed -nE 'N;h
         s/.*\s+(.*)\n.*(\s.*)/echo "$((\1+1))\2"/e;/^(.*)\s\1$/!{x;p;x};x;D' file 

Open a two line window throughout the length of the file.

Make a copy of the window and increment the 2nd column of the first line by one. If this amended value is equal to the 2nd column of the second line then print both unadulterated lines.

Delete the first line and repeat.

N.B. This may print the second of these lines twice if the following line meets the same criteria.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 stevesliva
Solution 3
Solution 4 potong