'Removing duplicate lines with sed in batch

I'm working on cleaning up a few hundred files on a Windows machine and one of things I need done to them is removing some duplicate lines. So an example file might look like this:

foo=false    
bar=true
baz=false
baz=false
baz=false

So in working with sed I came across this site that showcased a line that removes duplicate lines.

sed "$!N; /^\(.*\)\n\1$/!P; D" textfile.txt

So I went and plugged it into a command window to see if it works and the console window showed the duplicate lines removed. After that I plugged that line into my batch script to run it against my list of files that needed to be edited.

FOR /F %%a IN ('listfile.txt') DO (
  sed "$!N; /^\(.*\)\n\1$/!P; D" %%a
)

But when I ran this against my test list of files it removed every line from the file except for one of the duplicate lines.

I'm not familiar with sed enough to know for sure what all the stuff that line is doing but my test of it showed it doing what I wanted. So what gives? Am I missing something in the way sed works in a batch file?


Based on the comments I tried:

gawk "!a[$0]++" textfile.txt

and once again it works on the command line but not in the script. So there is definitely some issue with the way the batch file is running this command but I'm unable to figure out what that is.



Solution 1:[1]

After doing some more testing on the original sed statement I found that it was getting hung up on the ! in the command. So I started some digging along that route and found that EnableDelyedExpansion was causing the ! and everything between them to be removed even within the sed statement.

So my options were to escape the ! or narrow the scope of the EnableDelayedExpansion. Since escaping didn't seem to be working I just narrowed the scope to right around the specific variable that needed it and then the sed statement seemed to work correctly after that.

Solution 2:[2]

On the Windows platform, it is straightforward using PowerShell:

get-content "textfile.txt" | sort-object -unique

Bill

Solution 3:[3]

For removing duplicates lines with sed, consider the code below. Note that heading and trailing lines will be disregarded and removed in the output.

#  make a initial mark in order to work for second line
#+ duplicated case with a simple regex
1{ x; s/^/\n/; x; }
# trimming
s/^\s*//
s/\s*$//
# main
H
x
s/\(\n.*\)\(\n.*\)*\1$/\1\2/
x
# print hold space at the end
$bItsOver
d
:ItsOver
x;
s/^\n*//
s/\n*$//

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Matthew Green
Solution 2 Bill_Stewart
Solution 3 Daniel Bandeira