Google
 

« An Ajax Search Engine Without Servers (Almost)         Another Book Search Engine Experiment »

Converting from DJVU to PDF

Posted July 25, 2007 – 10:58 am by Yakov Shafranovich in Programming

One of the more mundane tasks that faces every publishing business like mine is data conversion. Recently, I have been involved in a major project which seeks to make available several hundred titles in print on demand format. Unfortunatly, the library that scanned these titles did not use PDF - rather they used a more obscure format called DJVU (see Wikipedia for more information). This format was invented by AT&T Labs (which also invented VNC). It claims to compress data better than PDF but in a weird fashion. Unlike PDF which stores most documents in one layer, DjVu actually uses 3 layers - background, foreground and mask. The mask layer usually has the text, the background has the picture of the page it was scanned from and the foreground has the rest. Some fancy protocol is used to determine what goes where when the scan is originally encoded.

However, in the printing business DjVu is not used - rather everything needs to be in PDF. So in this post I will outline how I was able to sucessfully convert DjVu files to PDF using freely available tools. But first, here are some things that DID NOT work:
1. Converting DjVu to Postscript and then to PDF - takes too long.
2. Converting only the foreground layer or the mask layer in DjVu - loses some of the data.

Here is the software that is needed for the conversion to take place:
1. DjVuLibre .
2. LibTiff .

If you are running Windows, then you will need Cygwin and a Cygwin version of DjVuLibre. The compiled Windows version does not include TIFF support (although you can get this package from a site in Russia which included the TIFF support). LibTiff comes in a native Windows version.

Here are the conversion steps:
1. Convert DjVu file to TIFF using ddjvu.
2. Convert TIFF to PDF using tiff2pdf.

Assuming the input DJVU file is called “input.djvu” here are the steps:

djvu -verbose -format=tif input.djvu output.tif
tiff2pdf output.tif -o output.pdf

The ddjvu utility has an option to convert specific layers. One common mistake is to convert only the mask layer or the foreground layer . Technically speaking, the mask layer is the one that should have the actual text but in practice I have seen that the the DjVu encoder occasionally puts portions of the text in the background layer. Thus, if you only take the foreground or mask layers, you will lose those bits in the background. If your specific files don’t have that issue, that you should use the layer switch since it reduces file size and increases readibility.

Tags: , ,

Permalink | Trackback URL | This post has

Post a Comment