'Splitting a file based on tags - grep, sed?
I have a file that consists of tags and content descriptions, e.g.:
@ABC-1111 @ANYTAG
Content: description
content1
content2
@ABC-2222 @ABC-0000 @ANYTAG
Content: another description
content1
content2
@ANYTAG @ABC-1111 @ABC-0000
Content: yet another description
content1
content2
@ABC-0000
Content: anything here
content1
content2
I would like to split this file based on the tags it contains with a certain prefix (e.g "ABC") along with its content below it. So from the example file above, it will be split into 3 files (since there are 3 tags with "ABC" prefix).
File "ABC-0000" (found 3 instances in the file):
@ABC-2222 @ABC-0000 @ANYTAG
Content: another description
content1
content2
@ANYTAG @ABC-1111 @ABC-0000
Content: yet another description
content1
content2
@ABC-0000
Content: anything here
content1
content2
File "ABC-1111" (found two instances in the file):
@ABC-1111 @ANYTAG
Content: description
content1
content2
@ANYTAG @ABC-1111 @ABC-0000
Content: yet another description
content1
content2
File "ABC-2222" (found 1 instance in the file):
@ABC-2222 @ABC-1010 @ANYTAG
Content: another description
content1
content2
I was trying to use bash script with sed:
for i in $(grep -Eo ‘@ABC-[0-9]+’ $file | sort -u); do
sed -n -r "/${i}/,/^\s*$/p" $file >> $i.out
done
seems it works only if there is a blank line between the content of the tag with the next tag line.
Is there a way to do this with grep, sed, or awk? Or maybe in python?
Thanks!!
Solution 1:[1]
You can use csplit to split up the files into sections with their tags:
csplit --quiet -f xx ./input.txt '/^@/' '{*}'
TAGS=$(grep -o '@[^ ]*' ./input.txt | sort | uniq)
for TAG in $TAGS
do
grep -l $TAG xx* | xargs cat > $(echo $TAG | tr -d '@')
done
Solution 2:[2]
Assumptions:
- all tag lines start with a
@in column #1 - all lines that start with
@in column #1 are tag lines
One awk idea:
awk '
$1 ~ /^@/ { delete flist # delete array of output files
for (i=1;i<=NF;i++) { # loop through list of tags
if ($i ~ "^@ABC-") { # if tag starts with "@ABC-" then ..
flist[substr($i,2)] # strip off the "@" and save result as name of an output file
}
}
}
{ for (file in flist) # for each file in our array ...
print $0 >> file # append the current line
}
' tag.dat
NOTES:
- as currently coded
awkwill maintain an open file descriptor for each tag/file processed - for a smallish number of tags/files this likely won't be a problem for most
awkimplementations - if running
GNU awkyou should be able to maintain a sizeable number of open file descriptors - if receiving a message that
awkhas exceeded the max number of open file descriptors there are a couple ideas that come to mind:- before
delete filesrunfor (file in flist) close(file); this will likely slow down the overall speed of the script due to an excessive number of open/close file operations - store each tag's data in memory (there are a few ways to do this) and in
END {...}processing loop through a master list of tags, performing a singleopen/write-all-data-from-memory/closeoperation for each tag; assumes the entire file can fit in memory
- before
Results:
for f in ABC-*
do
printf "\n############# $f\n"
cat $f
done
############# ABC-0000
@ABC-2222 @ABC-0000 @ANYTAG
Content: another description
content1
content2
@ANYTAG @ABC-1111 @ABC-0000
Content: yet another description
content1
content2
@ABC-0000
Content: anything here
content1
content2
############# ABC-1111
@ABC-1111 @ANYTAG
Content: description
content1
content2
@ANYTAG @ABC-1111 @ABC-0000
Content: yet another description
content1
content2
############# ABC-2222
@ABC-2222 @ABC-0000 @ANYTAG
Content: another description
content1
content2
Solution 3:[3]
$ cat tst.awk
/^@/ { prt(); rec=tags=$0; next }
{ rec=rec ORS $0 }
END { prt() }
function prt( i,n,t) {
n = split(tags,t,/ *@/)
for ( i=2; i<=n; i++ ) {
print rec > (t[i])
}
}
$ awk -f tst.awk file
$ head -50 A*
==> ABC-0000 <==
@ABC-2222 @ABC-0000 @ANYTAG
Content: another description
content1
content2
@ANYTAG @ABC-1111 @ABC-0000
Content: yet another description
content1
content2
@ABC-0000
Content: anything here
content1
content2
==> ABC-1111 <==
@ABC-1111 @ANYTAG
Content: description
content1
content2
@ANYTAG @ABC-1111 @ABC-0000
Content: yet another description
content1
content2
==> ABC-2222 <==
@ABC-2222 @ABC-0000 @ANYTAG
Content: another description
content1
content2
==> ANYTAG <==
@ABC-1111 @ANYTAG
Content: description
content1
content2
@ABC-2222 @ABC-0000 @ANYTAG
Content: another description
content1
content2
@ANYTAG @ABC-1111 @ABC-0000
Content: yet another description
content1
content2
The above assumes you either don't have so many output files as to trigger the too many open files error or are using an awk version such as GNU awk that can handle arbitrary numbers of simultaneously open files. If that's not the case then change print rec > (t[i]) to print rec >> (t[i]); close(t[i]).
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Gavin Haynes |
| Solution 2 | |
| Solution 3 | Ed Morton |
