'Unzip to pipe and then run PDF info on the files in the stream
I want to unzip a LOT of files and then run pdfinfo to get the page count for each file and the sum those page counts.
I came across a command that will sum the pages of all pages in a directory.
find . -name \*.pdf -exec pdfinfo {} \; | grep Pages | sed -e "s/Pages:\s*//g" | awk '{ sum += $1;} END { print sum; }'
I then thought to pipe that into #unzip -p
unzip -p '*.zip' | find . -name \*.pdf -exec pdfinfo {} \; | grep Pages | sed -e "s/Pages:\s*//g" | awk '{ sum += $1;} END { print sum; }'
However it's not working as I expect it to. I suspect that my unzip stream is interacting poorly with the find.
Any Thoughts?
Solution 1:[1]
Here is a way to do it that doesn't write anything to the filesystem. This code should work if any of the filenames in the zip files contain embedded whitespace. The code assumes that filenames ending in "pdf" are valid PDF files.
This is the test zip file I'm going to use. Note that the first filename in the zip archive, "zlib 3.pdf", contains a space character.
$ unzip -l aaa.zip
Archive: aaa.zip
Length Date Time Name
--------- ---------- ----- ----
19318 2018-02-19 22:49 zlib 3.pdf
442780 2018-02-28 15:32 file2.pdf
757 2018-02-28 15:22 try.sh
--------- -------
462855 3 files
It turns out that pdfinfo can read from stdin, so the command below shows how to get the number of pages from a pdf stored in a zip without writing anything to disk.
$ unzip -p aaa.zip file2.pdf | pdfinfo - | grep Pages
Pages: 94
$ unzip -p aaa.zip "zlib 3.pdf" | pdfinfo - | grep Pages
Pages: 2
For this to work though, you need to know the names of the PDF files stored in the zip archive.
The next step then is to get a list of the PDF files and the names of the zip files they are stored in. That's what this code does
for zip in *.zip ; do
echo $zip
zipinfo -1 "$zip" | grep 'pdf$'| while read pdf
do
echo " '$pdf'"
done
done
That outputs this for me
aaa.zip
'zlib 3.pdf'
'file2.pdf'
Finally add the code to call pdfinfo and the awk code snippet to work out the total number of pages.
for zip in *.zip ; do
zipinfo -1 "$zip" | grep 'pdf$'| while read pdf
do
unzip -p "$zip" "$pdf" | pdfinfo - | grep Pages | sed -e "s/Pages:\s*//g"
done
done | awk '{ sum += $1;} END { print sum; }'
That outputs 96 for my test zip file.
Solution 2:[2]
If diskspace is your main concern, this will probably help:
for zip in *.zip ;do
for pdf in $(unzip -l "$zip" | grep 'pdf$' | cut -c31-) ; do
unzip "$zip" "$pdf"
pdfinfo "$pdf" | sed -n "s/Pages:\s*//p"
rm "$pdf"
done | paste -s -d+ - | bc
done
Solution 3:[3]
Similar to my need to extract .FLAC audio files from a zip archive and convert to .OPUS on the fly. This worked for me. First I had to make a separate text file of the filenames to extract from each zip archive. There is no easy way around this since piping works but doesn't pass file names. Once you have the list you just extract/convert each FLAC by name from the zip files, that way you know what to name each OPUS file.
This relies on the unzip -p option to pipe output.
for zip in *.zip
do
zipinfo -1 "$zip" | grep flac > "$zip"_flacs.txt;
printf -- zip\:...."$zip\n"flac\:..."$flac\n";
cat "$zip"_flacs.txt | while read flac
do
printf "extracting $flac \n";
unzip -p "$zip" "$flac" | ffmpeg -i - -ab 256k "${flac%.*}.opus";
done
done
Don't forget to extract everything else.
for zip in *zip; do unzip "$zip" -x *.flac *.mp4; done
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Ljm Dullaart |
| Solution 3 | fieldlab |
