'Compare PDF Content With Ruby

I am in the process of writing a Ruby script/app that helps me compiling LaTeX to (at least) PDF. One feature I want it to have is that it should run pdflatex iteratively until the PDF converges (as it should, I guess).

The idea is to compare the PDF generated in one iteration against the one from the former iteration using their fingerprints. In particular, I currently use Digest::MD5.file(.).

The problem now is that this never converges. A (The, hopefully) culprit is the PDF's timestamp that is set to the seconds at least by pdflatex. Since runs of pdflatex take typically longer than one second, the result keeps changing. That is, I expect the PDF's to be equal up to the timestamp(s) after some point. This assumption might be wrong; hints appreciated.

What can I do about this? My basic ideas so far:

  • Use a library capable of doing the job
  • Strip meta data away and only hash PDF content
  • Overwrite timestamps by a fixed value before comparing

Do you have more ideas or even solutions? Solutions should only use free software that runs on Linux. Such that only use Ruby are preferred, but using external software is perfectly acceptable.

By the way, I do not exactly know how PDF is encoded but I suspect that merely comparing the contained text won't work for me since only graphics or links might change in later iterations.

Possibly related:



Solution 1:[1]

[Disclaimer: I'm the author of Identikal]

For a project we had a requirement to compare two PDFs in pure Ruby. Ended up writing a gem called identikal. This gem compares two unencrypted PDF files and returns true if they are identical and false otherwise.

Once you install the gem you can compare two PDFs as shown below:

$ identikal file_a.pdf file_b.pdf
true

Solution 2:[2]

This isn't an answer to your question, but are you familiar with latexmk? It's a perl script that does exactly what you're after, but achieves it in a very different way. It does so by examining all the different .log and .aux files left around from each tex run, and then has heuristics about what needs to happen in each case (which may be more complicated than simply re-running tex -- mkindex or xindy may need to be run, as well).

You could either mimic its usage (although with 3546 sloc, I don't particularly recommend it) or simply call it from your Ruby script/app.

Solution 3:[3]

Since a latex run does not have access to its previous runs, and is only dependent, (besides system parameters such as the current time), on the text files generated (such as tex, aux, bib, ...), the resulting pdf file converges once all those text files converges (disregarding dependency on system paramters sudh as time).

In short, you should check the convergence of the text files (tex, aux, bib, ...) rather than the convergence of the pdf file.

  1. Make directory A, where you run latex.
  2. Make directory B, where you keep a copy of the text files resulting from the previous latex run.
  3. Run latex within A
  4. If the contents of all the files in B are the same as the contents of the corresponding files in A, then stop. Otherwise, copy all the text files generated in A (aux, bib, ...) to B, excluding the original tex file if you know that it didn't change. You can also exclude log from the copy list. And then, return to 3.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Jahangir
Solution 2 mbauman
Solution 3