'Search MS word files in a directory for specific content in Linux

I have a directory structure full of MS word files and I have to search the directory for particular string. Until now I was using the following command to search files for in a directory

find . -exec grep -li 'search_string' {} \;

find . -name '*' -print | xargs grep 'search_string'

But, this search doesn't work for MS word files.

Is it possible to do string search in MS word files in Linux?



Solution 1:[1]

I'm a translator and know next to nothing about scripting but I was so pissed off about grep not being able to scan inside Word .doc files that I worked out how to make this little shell script to use catdoc and grep to search a directory of .doc files for a given input string.

You need to install catdocand docx2txt packages

#!/bin/bash
   echo -e "\n
Welcome to scandocs. This will search .doc AND .docx files in this directory for a given string. \n
Type in the text string you want to find... \n"
   read response
   find . -name "*.doc" | 
       while read i; do catdoc "$i" | 
                 grep --color=auto -iH --label="$i" "$response"; done
   find . -name "*.docx" | 
       while read i; do docx2txt < "$i" | 
                 grep --color=auto -iH --label="$i" "$response"; done

All improvements and suggestions welcome!

Solution 2:[2]

Here's a way to use "unzip" to print the entire contents to standard output, then pipe to "grep -q" to detect whether the desired string is present in the output. It works for docx format files.

#!/bin/bash
PROG=`basename $0`

if [ $# -eq 0 ]
then
  echo "Usage: $PROG string file.docx [file.docx...]"
  exit 1
fi

findme="$1"
shift

for file in $@
do
  unzip -p "$file" | grep -q "$findme"
  [ $? -eq 0 ] && echo "$file"
done

Save the script as "inword" and search for "wombat" in three files with:

$ ./inword wombat file1.docx file2.docx file3.docx
file2.docx

Now you know file2.docx contains "wombat". You can get fancier by adding support for other grep options. Have fun.

Solution 3:[3]

The more recent versions of MS Word intersperse ascii[0] in between each of the letters of the text for purposes I cannot yet understand. I have written my own MS Word search utilities that insert ascii[0] in between each of the characters in the search field and it just works fine. Clumsy but OK. A lot of questions remain. Perhaps the junk characters are not always the same. More tests need to be done. It would be nice if someone could write a utility that would take all this into account. On my windows machine the same files respond well to searches. We can do it!

Solution 4:[4]

In a .doc file the text is generally present and can be found by grep, but that text is broken up and interspersed with field codes and formatting information so searching for a phrase you know is there may not match. A search for something very short has a better chance of matching.

A .docx file is actually a zip archive collecting several files together in a directory structure (try renaming a .docx to .zip then unzipping it!) -- with zip compression it's unlikely that grep will find anything at all.

Solution 5:[5]

The opensource command line utility crgrep will search most MS document formats (I'm the author).

Solution 6:[6]

Have you tried with awk ‘/Some|Word|In|Word/’ document.docx ?

Solution 7:[7]

If it's not too many files you can write a script that incorporates something like catdoc: http://manpages.ubuntu.com/manpages/gutsy/man1/catdoc.1.html , by looping over each file, perfoming a catdoc and grep, storing that in a bash variable, and outputting it if it's satisfactory.

Solution 8:[8]

If you have installed program called antiword you can use this command:

find -iname "*.doc" |xargs -I {} bash -c 'if (antiword {}|grep "string_to_search") > /dev/null 2>&1; then echo {} ; fi'

replace "string_to_search" in above command with your text. This command spits file name(s) of files containing "string_to_search"

The command is not perfect because works weird on small files (the result can be untrustful), becasue for some reseaon antiword spits this text:

"I'm afraid the text stream of this file is too small to handle."

if file is small (whatever it means .o.)

Solution 9:[9]

The best solution I came upon was to use unoconv to convert the word documents to html. It also has a .txt output, but that dropped content in my case.

http://linux.die.net/man/1/unoconv

Solution 10:[10]

I've found a way of searching Word files (doc and docx) that uses the preprocessor functionality of ripgrep.

This depends on the following being installed:

  • ripgrep (more information about the preprocessor here)
  • LibreOffice
  • docx2txt
  • this catdoc2 script, which I've added to my $PATH:
#!/bin/bash

temp_dir=$(mktemp -d)
trap "rm $temp_dir/* && rmdir $temp_dir" 0 2 3 15

libreoffice --headless --convert-to "txt:Text (encoded):UTF8" --outdir ${temp_dir} $1 1>/dev/null
cat ${temp_dir}/$(basename -s .doc $1).txt

The command pattern tor a one-level recursive search is:

$ rg --pre <preprocessor> --glob <glob with filetype> <search string> 

Example:

$ ls *
one:
a.docx

two:
b.docx  c.doc
$ rg --pre docx2txt --glob *.docx This
two/b.docx
1:This is file b.

one/a.docx
1:This is file a.
$ rg --pre catdoc2 --glob *.doc This
two/c.doc
1:This is file c.

Solution 11:[11]

Here's the full script I use on macOS (Catalina, Big Sur, Monterey). It's based on Ralph's suggestion, but using built-in textutil for .doc

#!/bin/bash

searchInDoc() {
    # in .doc
    find "$DIR" -name "*.doc" |
        while read -r i; do
            textutil -stdout -cat txt "$i" | grep --color=auto -iH --label="$i" "$PATTERN"
        done
}

searchInDocx() {
    for i in "$DIR"/*.docx; do
        #extract
        docx2txt.sh "$i" 1> /dev/null
        #point, grep, remove
        txtExtracted="$i"
        txtExtracted="${txtExtracted//.docx/.txt}"
        grep -iHn "$PATTERN" "$txtExtracted"
        rm "$txtExtracted"
    done
}

askPrompts() {
    local i
    for i in DIR PATTERN; do
        #prompt
        printf "\n%s to search: \n" "$i"
        #read & assign
        read -e REPLY
        eval "$i=$REPLY"
    done
}

makeLogs() {
    local i
    for i in results errors; do
        
        # extract dir for log name
        dirNAME="${DIR##*/}"

        # set var
        eval "${i}LOG=$HOME/$i-$PATTERN-$dirNAME.log"
        local VAR="${i}LOG"

        # remove if existant
        if [ -f "${!VAR}" ]; then
            printf "WARNING: %s will be overwriten.\n" "${!VAR}"
        fi

        # touch file
        touch "${!VAR}"
    done
}

checkDocx2txt() {
    #see if soft exists
    if ! command -v docx2txt.sh 1>/dev/null; then
        printf "\nWARNING: docx2txt is required.\n"
        printf "Use \e[3mbrew install docx2txt\e[0m.\n\n"
        exit
    else
        printf "\n~~~~~~~~~~~~~~~~~~~~~~~~\n"
        printf "Welcome to scandocs macOS.\n"
        printf "~~~~~~~~~~~~~~~~~~~~~~~~\n"
    fi
}

parseLogs() {
    # header
    printf "\n------\n"
    printf "Scandocs finished.\n"

    # results
    if [ ! -s "$resultsLOG" ]; then
        printf "But no results were found."
        printf "\"%s\" did not match in \"%s\"" "$PATTERN" "$DIR" > "$resultsLOG" 
    else
        printf "See match results in %s" "$resultsLOG"
    fi

    # errors
    if [ ! -s "$errorsLOG" ]; then
        rm -f "$errorsLOG"
    else
        printf "\nWARNING: there were some errors. See %s" "$errorsLOG"
    fi

    # footer
    printf "\n------\n\n"
}



#the soft
checkDocx2txt
askPrompts
makeLogs
{
    searchInDoc
    searchInDocx
} 1>"$resultsLOG" 2>"$errorsLOG"
parseLogs

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 ijoseph
Solution 2 DanB
Solution 3 Dan
Solution 4 Stephen P
Solution 5 Craig
Solution 6 Marjan Nikolovski
Solution 7 Arcymag
Solution 8
Solution 9 jtpereyda
Solution 10
Solution 11 Tyler2P