'Unzip to pipe and then run PDF info on the files in the stream

I want to unzip a LOT of files and then run pdfinfo to get the page count for each file and the sum those page counts.

I came across a command that will sum the pages of all pages in a directory.

find . -name \*.pdf -exec pdfinfo {} \; | grep Pages | sed -e "s/Pages:\s*//g" | awk '{ sum += $1;} END { print sum; }'

I then thought to pipe that into #unzip -p

unzip -p '*.zip' | find . -name \*.pdf -exec pdfinfo {} \; | grep Pages | sed -e "s/Pages:\s*//g" | awk '{ sum += $1;} END { print sum; }'

However it's not working as I expect it to. I suspect that my unzip stream is interacting poorly with the find.

Any Thoughts?



Solution 1:[1]

Here is a way to do it that doesn't write anything to the filesystem. This code should work if any of the filenames in the zip files contain embedded whitespace. The code assumes that filenames ending in "pdf" are valid PDF files.

This is the test zip file I'm going to use. Note that the first filename in the zip archive, "zlib 3.pdf", contains a space character.

$ unzip -l aaa.zip 
Archive:  aaa.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
    19318  2018-02-19 22:49   zlib 3.pdf
   442780  2018-02-28 15:32   file2.pdf
      757  2018-02-28 15:22   try.sh
---------                     -------
   462855                     3 files

It turns out that pdfinfo can read from stdin, so the command below shows how to get the number of pages from a pdf stored in a zip without writing anything to disk.

$ unzip -p aaa.zip file2.pdf | pdfinfo - | grep Pages
Pages:          94

$ unzip -p aaa.zip "zlib 3.pdf" | pdfinfo - | grep Pages
Pages:          2

For this to work though, you need to know the names of the PDF files stored in the zip archive.

The next step then is to get a list of the PDF files and the names of the zip files they are stored in. That's what this code does

for zip in *.zip ; do 
    echo $zip
    zipinfo -1 "$zip" | grep 'pdf$'| while read pdf
    do
        echo "  '$pdf'" 
    done  
done 

That outputs this for me

aaa.zip
  'zlib 3.pdf'
  'file2.pdf'

Finally add the code to call pdfinfo and the awk code snippet to work out the total number of pages.

for zip in *.zip ; do 
    zipinfo -1 "$zip" | grep 'pdf$'| while read pdf
    do
        unzip  -p "$zip" "$pdf" | pdfinfo - | grep Pages | sed -e "s/Pages:\s*//g"
    done  
done | awk '{ sum += $1;} END { print sum; }'

That outputs 96 for my test zip file.

Solution 2:[2]

If diskspace is your main concern, this will probably help:

for zip in *.zip ;do
    for pdf in $(unzip -l "$zip"  | grep 'pdf$' | cut -c31-) ; do
        unzip "$zip" "$pdf"
        pdfinfo "$pdf" | sed -n "s/Pages:\s*//p"
        rm "$pdf"
     done | paste -s -d+ - | bc
done

Solution 3:[3]

Similar to my need to extract .FLAC audio files from a zip archive and convert to .OPUS on the fly. This worked for me. First I had to make a separate text file of the filenames to extract from each zip archive. There is no easy way around this since piping works but doesn't pass file names. Once you have the list you just extract/convert each FLAC by name from the zip files, that way you know what to name each OPUS file.

This relies on the unzip -p option to pipe output.

for zip in *.zip
  do 
    zipinfo -1 "$zip" | grep flac > "$zip"_flacs.txt;
    printf -- zip\:...."$zip\n"flac\:..."$flac\n";
    cat "$zip"_flacs.txt | while read flac
      do 
        printf "extracting $flac \n";
        unzip -p "$zip" "$flac" | ffmpeg -i - -ab 256k "${flac%.*}.opus";
      done
    done

Don't forget to extract everything else.

for zip in *zip; do unzip "$zip" -x *.flac *.mp4; done

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Ljm Dullaart
Solution 3 fieldlab