'PowerShell search script that ignores binary files
I am really used to doing grep -iIr
on the Unix shell but I haven't been able to get a PowerShell equivalent yet.
Basically, the above command searches the target folders recursively and ignores binary files because of the "-I" option. This option is also equivalent to the --binary-files=without-match
option, which says "treat binary files as not matching the search string"
So far I have been using Get-ChildItems -r | Select-String
as my PowerShell grep replacement with the occasional Where-Object
added. But I haven't figured out a way to ignore all binary files like the grep -I
command does.
How can binary files be filtered or ignored with Powershell?
So for a given path, I only want Select-String
to search text files.
EDIT: A few more hours on Google produced this question How to identify the contents of a file is ASCII or Binary. The question says "ASCII" but I believe the writer meant "Text Encoded", like myself.
EDIT: It seems that an isBinary()
needs to be written to solve this issue. Probably a C# commandline utility to make it more useful.
EDIT: It seems that what grep
is doing is checking for ASCII NUL Byte or UTF-8 Overlong. If those exists, it considers the file binary. This is a single memchr() call.
Solution 1:[1]
Ok, after a few more hours of research I believe I've found my solution. I won't mark this as the answer though.
Pro Windows Powershell had a very similar example. I had completely forgot that I had this excellent reference. Please buy it if you are interested in Powershell. It went into detail on Get-Content and Unicode BOMs.
This Answer to a similar questions was also very helpful with the Unicode identification.
Here is the script. Please let me know if you know of any issues it may have.
# The file to be tested
param ($currFile)
# encoding variable
$encoding = ""
# Get the first 1024 bytes from the file
$byteArray = Get-Content -Path $currFile -Encoding Byte -TotalCount 1024
if( ("{0:X}{1:X}{2:X}" -f $byteArray) -eq "EFBBBF" )
{
# Test for UTF-8 BOM
$encoding = "UTF-8"
}
elseif( ("{0:X}{1:X}" -f $byteArray) -eq "FFFE" )
{
# Test for the UTF-16
$encoding = "UTF-16"
}
elseif( ("{0:X}{1:X}" -f $byteArray) -eq "FEFF" )
{
# Test for the UTF-16 Big Endian
$encoding = "UTF-16 BE"
}
elseif( ("{0:X}{1:X}{2:X}{3:X}" -f $byteArray) -eq "FFFE0000" )
{
# Test for the UTF-32
$encoding = "UTF-32"
}
elseif( ("{0:X}{1:X}{2:X}{3:X}" -f $byteArray) -eq "0000FEFF" )
{
# Test for the UTF-32 Big Endian
$encoding = "UTF-32 BE"
}
if($encoding)
{
# File is text encoded
return $false
}
# So now we're done with Text encodings that commonly have '0's
# in their byte steams. ASCII may have the NUL or '0' code in
# their streams but that's rare apparently.
# Both GNU Grep and Diff use variations of this heuristic
if( $byteArray -contains 0 )
{
# Test for binary
return $true
}
# This should be ASCII encoded
$encoding = "ASCII"
return $false
Save this script as isBinary.ps1
This script got every text or binary file I tried correct.
Solution 2:[2]
i agree that the other answers are more 'complete' but - because i do not know what file extensions i will encounter within a folder and i want to look thru them all, this is the easiest solution for me.
how about instead of avoiding searching thru binary files you just ignore the errors that you get from searching thru binary files?
it doesn't take long to run a search even if there are binary files within the folder being searched.
in the end, all that you care about is the strings that match the pattern (which there is next to no chance of it would find a string that matches the pattern inside of a binary file).
GCI -Recurse -Force -ErrorAction SilentlyContinue | ForEach-Object { GC $_ -ErrorAction SilentlyContinue | Select-String -Pattern "Pattern" } | Out-File -FilePath C:\temp\grep.txt -Width 999999
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Community |
Solution 2 | Scott Ferrell |