'Splitting a file based on tags - grep, sed?

I have a file that consists of tags and content descriptions, e.g.:

@ABC-1111 @ANYTAG
Content: description
content1
content2
@ABC-2222 @ABC-0000 @ANYTAG
Content: another description
content1
content2
@ANYTAG @ABC-1111 @ABC-0000
Content: yet another description
content1
content2
@ABC-0000
Content: anything here
content1
content2

I would like to split this file based on the tags it contains with a certain prefix (e.g "ABC") along with its content below it. So from the example file above, it will be split into 3 files (since there are 3 tags with "ABC" prefix).

File "ABC-0000" (found 3 instances in the file):

@ABC-2222 @ABC-0000 @ANYTAG
Content: another description
content1
content2
@ANYTAG @ABC-1111 @ABC-0000
Content: yet another description
content1
content2
@ABC-0000
Content: anything here
content1
content2

File "ABC-1111" (found two instances in the file):

@ABC-1111 @ANYTAG
Content: description
content1
content2
@ANYTAG @ABC-1111 @ABC-0000
Content: yet another description
content1
content2

File "ABC-2222" (found 1 instance in the file):

@ABC-2222 @ABC-1010 @ANYTAG
Content: another description
content1
content2

I was trying to use bash script with sed:

  for i in $(grep -Eo ‘@ABC-[0-9]+’ $file | sort -u); do
    sed -n -r "/${i}/,/^\s*$/p" $file >> $i.out
  done

seems it works only if there is a blank line between the content of the tag with the next tag line.

Is there a way to do this with grep, sed, or awk? Or maybe in python?

Thanks!!



Solution 1:[1]

You can use csplit to split up the files into sections with their tags:

csplit --quiet -f xx ./input.txt '/^@/' '{*}'
TAGS=$(grep -o '@[^ ]*' ./input.txt | sort | uniq)
for TAG in $TAGS
do
  grep -l $TAG xx* | xargs cat > $(echo $TAG | tr -d '@')
done

Solution 2:[2]

Assumptions:

  • all tag lines start with a @ in column #1
  • all lines that start with @ in column #1 are tag lines

One awk idea:

awk '
$1 ~ /^@/ { delete flist                     # delete array of output files
            for (i=1;i<=NF;i++) {            # loop through list of tags
                if ($i ~ "^@ABC-") {         # if tag starts with "@ABC-" then ..
                   flist[substr($i,2)]       # strip off the "@" and save result as name of an output file
                }
            }
          }

          { for (file in flist)              # for each file in our array ...
                print $0 >> file             # append the current line
          }
' tag.dat

NOTES:

  • as currently coded awk will maintain an open file descriptor for each tag/file processed
  • for a smallish number of tags/files this likely won't be a problem for most awk implementations
  • if running GNU awk you should be able to maintain a sizeable number of open file descriptors
  • if receiving a message that awk has exceeded the max number of open file descriptors there are a couple ideas that come to mind:
    • before delete files run for (file in flist) close(file); this will likely slow down the overall speed of the script due to an excessive number of open/close file operations
    • store each tag's data in memory (there are a few ways to do this) and in END {...} processing loop through a master list of tags, performing a single open/write-all-data-from-memory/close operation for each tag; assumes the entire file can fit in memory

Results:

for f in ABC-*
do
    printf "\n############# $f\n"
    cat $f
done

############# ABC-0000
@ABC-2222 @ABC-0000 @ANYTAG
Content: another description
content1
content2
@ANYTAG @ABC-1111 @ABC-0000
Content: yet another description
content1
content2
@ABC-0000
Content: anything here
content1
content2

############# ABC-1111
@ABC-1111 @ANYTAG
Content: description
content1
content2
@ANYTAG @ABC-1111 @ABC-0000
Content: yet another description
content1
content2

############# ABC-2222
@ABC-2222 @ABC-0000 @ANYTAG
Content: another description
content1
content2

Solution 3:[3]

$ cat tst.awk
/^@/ { prt(); rec=tags=$0; next }
{ rec=rec ORS $0 }
END { prt() }

function prt(   i,n,t) {
    n = split(tags,t,/ *@/)
    for ( i=2; i<=n; i++ ) {
        print rec > (t[i])
    }
}

$ awk -f tst.awk file

$ head -50 A*
==> ABC-0000 <==
@ABC-2222 @ABC-0000 @ANYTAG
Content: another description
content1
content2
@ANYTAG @ABC-1111 @ABC-0000
Content: yet another description
content1
content2
@ABC-0000
Content: anything here
content1
content2

==> ABC-1111 <==
@ABC-1111 @ANYTAG
Content: description
content1
content2
@ANYTAG @ABC-1111 @ABC-0000
Content: yet another description
content1
content2

==> ABC-2222 <==
@ABC-2222 @ABC-0000 @ANYTAG
Content: another description
content1
content2

==> ANYTAG <==
@ABC-1111 @ANYTAG
Content: description
content1
content2
@ABC-2222 @ABC-0000 @ANYTAG
Content: another description
content1
content2
@ANYTAG @ABC-1111 @ABC-0000
Content: yet another description
content1
content2

The above assumes you either don't have so many output files as to trigger the too many open files error or are using an awk version such as GNU awk that can handle arbitrary numbers of simultaneously open files. If that's not the case then change print rec > (t[i]) to print rec >> (t[i]); close(t[i]).

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Gavin Haynes
Solution 2
Solution 3 Ed Morton