'Search MS word files in a directory for specific content in Linux
I have a directory structure full of MS word files and I have to search the directory for particular string. Until now I was using the following command to search files for in a directory
find . -exec grep -li 'search_string' {} \;
find . -name '*' -print | xargs grep 'search_string'
But, this search doesn't work for MS word files.
Is it possible to do string search in MS word files in Linux?
Solution 1:[1]
I'm a translator and know next to nothing about scripting but I was so pissed off about grep not being able to scan inside Word .doc files that I worked out how to make this little shell script to use catdoc and grep to search a directory of .doc files for a given input string.
You need to install catdocand docx2txt packages
#!/bin/bash
echo -e "\n
Welcome to scandocs. This will search .doc AND .docx files in this directory for a given string. \n
Type in the text string you want to find... \n"
read response
find . -name "*.doc" |
while read i; do catdoc "$i" |
grep --color=auto -iH --label="$i" "$response"; done
find . -name "*.docx" |
while read i; do docx2txt < "$i" |
grep --color=auto -iH --label="$i" "$response"; done
All improvements and suggestions welcome!
Solution 2:[2]
Here's a way to use "unzip" to print the entire contents to standard output, then pipe to "grep -q" to detect whether the desired string is present in the output. It works for docx format files.
#!/bin/bash
PROG=`basename $0`
if [ $# -eq 0 ]
then
echo "Usage: $PROG string file.docx [file.docx...]"
exit 1
fi
findme="$1"
shift
for file in $@
do
unzip -p "$file" | grep -q "$findme"
[ $? -eq 0 ] && echo "$file"
done
Save the script as "inword" and search for "wombat" in three files with:
$ ./inword wombat file1.docx file2.docx file3.docx
file2.docx
Now you know file2.docx contains "wombat". You can get fancier by adding support for other grep options. Have fun.
Solution 3:[3]
The more recent versions of MS Word intersperse ascii[0] in between each of the letters of the text for purposes I cannot yet understand. I have written my own MS Word search utilities that insert ascii[0] in between each of the characters in the search field and it just works fine. Clumsy but OK. A lot of questions remain. Perhaps the junk characters are not always the same. More tests need to be done. It would be nice if someone could write a utility that would take all this into account. On my windows machine the same files respond well to searches. We can do it!
Solution 4:[4]
In a .doc file the text is generally present and can be found by grep, but that text is broken up and interspersed with field codes and formatting information so searching for a phrase you know is there may not match. A search for something very short has a better chance of matching.
A .docx file is actually a zip archive collecting several files together in a directory structure (try renaming a .docx to .zip then unzipping it!) -- with zip compression it's unlikely that grep will find anything at all.
Solution 5:[5]
The opensource command line utility crgrep will search most MS document formats (I'm the author).
Solution 6:[6]
Have you tried with awk ‘/Some|Word|In|Word/’ document.docx ?
Solution 7:[7]
If it's not too many files you can write a script that incorporates something like catdoc: http://manpages.ubuntu.com/manpages/gutsy/man1/catdoc.1.html , by looping over each file, perfoming a catdoc and grep, storing that in a bash variable, and outputting it if it's satisfactory.
Solution 8:[8]
If you have installed program called antiword you can use this command:
find -iname "*.doc" |xargs -I {} bash -c 'if (antiword {}|grep "string_to_search") > /dev/null 2>&1; then echo {} ; fi'
replace "string_to_search" in above command with your text. This command spits file name(s) of files containing "string_to_search"
The command is not perfect because works weird on small files (the result can be untrustful), becasue for some reseaon antiword spits this text:
"I'm afraid the text stream of this file is too small to handle."
if file is small (whatever it means .o.)
Solution 9:[9]
The best solution I came upon was to use unoconv to convert the word documents to html. It also has a .txt output, but that dropped content in my case.
Solution 10:[10]
I've found a way of searching Word files (doc and docx) that uses the preprocessor functionality of ripgrep.
This depends on the following being installed:
- ripgrep (more information about the preprocessor here)
- LibreOffice
- docx2txt
- this catdoc2 script, which I've added to my
$PATH:
#!/bin/bash
temp_dir=$(mktemp -d)
trap "rm $temp_dir/* && rmdir $temp_dir" 0 2 3 15
libreoffice --headless --convert-to "txt:Text (encoded):UTF8" --outdir ${temp_dir} $1 1>/dev/null
cat ${temp_dir}/$(basename -s .doc $1).txt
The command pattern tor a one-level recursive search is:
$ rg --pre <preprocessor> --glob <glob with filetype> <search string>
Example:
$ ls *
one:
a.docx
two:
b.docx c.doc
$ rg --pre docx2txt --glob *.docx This
two/b.docx
1:This is file b.
one/a.docx
1:This is file a.
$ rg --pre catdoc2 --glob *.doc This
two/c.doc
1:This is file c.
Solution 11:[11]
Here's the full script I use on macOS (Catalina, Big Sur, Monterey). It's based on Ralph's suggestion, but using built-in textutil for .doc
#!/bin/bash
searchInDoc() {
# in .doc
find "$DIR" -name "*.doc" |
while read -r i; do
textutil -stdout -cat txt "$i" | grep --color=auto -iH --label="$i" "$PATTERN"
done
}
searchInDocx() {
for i in "$DIR"/*.docx; do
#extract
docx2txt.sh "$i" 1> /dev/null
#point, grep, remove
txtExtracted="$i"
txtExtracted="${txtExtracted//.docx/.txt}"
grep -iHn "$PATTERN" "$txtExtracted"
rm "$txtExtracted"
done
}
askPrompts() {
local i
for i in DIR PATTERN; do
#prompt
printf "\n%s to search: \n" "$i"
#read & assign
read -e REPLY
eval "$i=$REPLY"
done
}
makeLogs() {
local i
for i in results errors; do
# extract dir for log name
dirNAME="${DIR##*/}"
# set var
eval "${i}LOG=$HOME/$i-$PATTERN-$dirNAME.log"
local VAR="${i}LOG"
# remove if existant
if [ -f "${!VAR}" ]; then
printf "WARNING: %s will be overwriten.\n" "${!VAR}"
fi
# touch file
touch "${!VAR}"
done
}
checkDocx2txt() {
#see if soft exists
if ! command -v docx2txt.sh 1>/dev/null; then
printf "\nWARNING: docx2txt is required.\n"
printf "Use \e[3mbrew install docx2txt\e[0m.\n\n"
exit
else
printf "\n~~~~~~~~~~~~~~~~~~~~~~~~\n"
printf "Welcome to scandocs macOS.\n"
printf "~~~~~~~~~~~~~~~~~~~~~~~~\n"
fi
}
parseLogs() {
# header
printf "\n------\n"
printf "Scandocs finished.\n"
# results
if [ ! -s "$resultsLOG" ]; then
printf "But no results were found."
printf "\"%s\" did not match in \"%s\"" "$PATTERN" "$DIR" > "$resultsLOG"
else
printf "See match results in %s" "$resultsLOG"
fi
# errors
if [ ! -s "$errorsLOG" ]; then
rm -f "$errorsLOG"
else
printf "\nWARNING: there were some errors. See %s" "$errorsLOG"
fi
# footer
printf "\n------\n\n"
}
#the soft
checkDocx2txt
askPrompts
makeLogs
{
searchInDoc
searchInDocx
} 1>"$resultsLOG" 2>"$errorsLOG"
parseLogs
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
