How do I make a PDF from a book?

While eBooks are great for travel, I love the feel of holding a physical book. The five bookshelves at Turtle House are all sagging with the weight of the books we have shelved on them. Some aspects of physical books, however, can really hold us back:

  • Searchable: If I want to find where a particular book mentioned a particular family, the index might or might not be able to help me!
  • Loanable: In normal times, lending a friend a book in the same city is as easy as dropping it off. During the COVID-19 pandemic, though, that is a very iffy proposition. Also, many of my friends live in other countries.
  • Translatable: If I am using an Afrikaans book as a source, I will probably need to run it through Google Translate. That can mean a fair amount of typing to find the right section.

To resolve these problems, I’ve produced PDFs from some of the older books I have on hand. Some of the steps, however, can be a bit complex, so I thought I would share some of the ways I have addressed them.

To photograph or to scan?

Flatbed scanners are frequently found in offices, but they seem less common in homes. Today one can easily acquire a high-resolution scanner for less than $100; essentially any scanner on the market can achieve a resolution of 600 dpi (dots per inch), which is probably the highest dpi at which you would want to scan a book page. My Canon CanoScan 8600F may date from 2007, but it’s still capable of 4800 dpi! If you have a flatbed scanner, it’s the most common route to book scanning for these reasons:

  • Even lighting: The scanner will move a light bar slowly across the document, with the CCD (detector) moving in sync. As a result, all parts of the page will be uniformly lit in the resulting image.
  • Sharp focus: The CCD knows that the image it’s trying to capture is just above the glass, and so very small details can still be clear.
  • High resolution: To help you visualize how much detail a 600 dpi scan captures, I am showing a sentence from a book on my shelf:
A sentence from Mrs. Frisby and the Rats of NIMH, by Robert C. O’Brien (1971) shows the detail of 600 dpi (when image is viewed at full size).

I would offer a bit of advice for getting top performance from your scanner. First off, if the lid has any weight to it, that can help you ensure that as much of your image is flat on the glass as possible. Any parts of the page that aren’t flat on the glass will likely be darker and less focused. Some scanner lids are on hinges that can be raised so that the lid is pressing more or less uniformly across the book.

In the top scan, only one side of the book is wearing the weight of the scanner lid. On the bottom, both sides are carrying that weight for a more uniform scan. The difference is that the hinge has been lifted for the bottom image.

In most cases, the text pages of books are entirely in black and white. By scanning them in grayscale, you can save quite a lot of space (one byte per pixel instead of three bytes). At 600 dpi, it is quite likely that you will be able to resolve the black dots used to half-tone printed black-and-white photographs. That is less likely to be true at 300 dpi or less. It is okay to use a color scan only on the pages that actually use color! Mixing grayscale and color images won’t cause problems for the later steps.

When scanning books, in particular, you are likely to encounter real challenges when the book binding does not allow the pages to be opened completely flat (as seen in the areas closest to the middle in the image below). The effect this has on the resulting scan can be very problematic, depending on the extent to which the book lifts away from the glass and how closely the publisher has crowded the text to the inner edges of the page. The following image from Cassell’s Concise French-English Dictionary illustrates the problem nicely.

If the middle of the book is raised from the glass, it will appear darker and unfocused. Image from Cassel’s French-English Dictionary (1968)

The raised part of the text is further from the glass, so it receives less light, is not at the right distance for sharp focus, and receives a distorting curve from perspective. In the image above, I have added some other problems that may affect image quality. The book has been held in straight alignment by abutting the lower edge of the pages against one edge of the scanner surface. As a result, the bottom of those pages are cropped slightly. Next, if I am scanning hundreds of pages, I want each scan to go quickly. If I’m doing just one scan, I can use the “Preview” feature to select the part of the flatbed that contains my book. If I’m doing a whole book, I’m likely to skip the Preview step and just scan the entire flatbed. That increases the likelihood that I’ll have a black or white region surrounding my pages. It’s also quite likely that the image has been turned at 90 degrees. We’ll deal with both these problems in the next section.

Some books are simply not going to be a good fit for a flatbed scanner. You might be pressed into using a camera or <shudder> even a cell phone to photograph a book. Why might you choose this strategy?

  1. You may have only a short time with the book in question, or it may be so many pages that scanner processing would simply take too long (the best case for me is around a minute a page with my flatbed).
  2. The size of pages is too large; some coffee-table books, for example, are much bigger than A4/Letter size, and your scanner may not have a big enough bed to capture a whole page.
  3. You don’t have a scanner or can’t bring the book to it (person sitting in a hidden alcove at the special collections library, I am looking at you).

This is the set of things I worry about for photographing book pages:

  • I don’t have enough pixels. If I use my Canon EOS-M 100, I have a 24 megapixel sensor (6000 x 4000 pixels), If the page is 80% the width of the frame and each page measures 6.5″ wide, I am effectively producing a 370 dpi image (6000 pixels x 80%) / (6.5″ x 2 pages per image).
  • I don’t have enough light. Ideally, I would be able to photograph the book with a great light source such as the sun. If I am relying on a lamp inside, I am going to need a large sensor to capture as much as I can. This is the category where having at least a “mirrorless” or better camera is important. A “point-and-shoot” or cell phone is going to have a much smaller sensor to capture light, meaning it will need a higher ISO to make an effective photo. How steady can you hold your phone while touching the button on-screen? See the image below for example of the blurring that results from a camera shot of an open book.
  • My pages are trapezoids. Your only chance to get a perfectly rectangular page is to position the camera on the vector pointing directly up from the center of the page. We’re going to miss that, generally, so your pages will be photographed a bit to one side or the other. If you shoot both pages in one image, they’ll look like a butterfly, as in the example below.
This snap shot by my Samsung J-4’s 13 MP camera illustrates the reduced resolution and light characteristics of a mobile phone. Image from Oudtshoorn and Its Farms (1913) by Goodlonton.

I offer a couple of more “advanced” strategies for this type of photography. It may be that the pages of the book are bending a fair bit when opened; this bending will complicate text recognition at a later step. Try to find a small pane of glass (such as the glass sheet found in a photo frame). Lay the glass with its edge in the middle of the book, covering the individual page you want to photograph. Now that page should be mostly flat. If your light source is in the wrong position, you may now need to worry about glare / reflections! Once you have a book cradle, the glass, and perhaps a weight on the opposite page to hold it out of the way, you have what we call “infrastructure.” It may become simpler to shoot all the even pages and then all the odd pages once you have this cumbersome stuff in place. It’s okay to shoot out of order; we can straighten that out later.

Image manipulation

Whether your images are coming from a scanner or a camera, you will almost assuredly have some image manipulation to handle. For the last several years, I have made considerable use of Paint.NET, a free bitmap editor for Microsoft Windows that handles almost everything I need. Some folks may be scandalized that I would choose something so simple and would argue forcefully for GIMP or other software. This is an area with a huge number of good options, so find the one you want.

The following breaks down the most common steps you’ll need.

I use Paint.NET software every week.

Reorientation and Rotation

Frequently our cameras and cell phones get confused which side of a picture is “up.” Helpfully, almost every bitmap editor in existence can manage a 90 degree or 180 degree rotation. It’s likely, though, that some images will require a smaller tilt this way or that to keep the lines of text straight (but heaven help you if you’re dealing with trapezoids from an off-angle photograph). For many software tools, a tilt of 1 degree or even 1/2 a degree is a different option than a 90 degree turn. In Paint.NET, you can handle fine tilts in Layers: Rotate while the large scale turns are in Image: Rotate. While turning by 90 degrees does not involve image detail loss, you don’t want to repeatedly adjust your image by 1 degree increments. If you don’t get the fine rotation right on your first try, undo it and then try a different amount of rotation.

Cropping

It seems that bitmap editors have come to a common mechanism for cropping the image. Select the tool that looks like a dotted line in a rectangle. Click down on a point in the image that represents the upper-left corner, then drag down to the lower-right corner. Let go of the button. Now use the “Crop” option (Image: Crop to Selection in Paint.NET).

Saving and Renaming

If you told your scanner that you wanted to capture a grayscale image of the page, the software probably uses a byte for each pixel of the image; if you specified a color scan, it probably uses three bytes for each pixel. Saving a 600 dpi 24-bit (three byte) version of the image will eventually take quite a lot of space, especially if you are using a format like BMP or TIFF without compression. I frequently save in PNG format at first, expecting to reduce the size of the files at a later step. Books that contain color images are not printed in millions of colors, though they may appear that way at a distance. A 600 dpi scan of a page should be able to resolve the three or four different colors used for “offset printing” of color images. At 600 dpi, you can generally use a reduced palette format like GIF to save the images with very little loss of quality. At 300 dpi, your image is less likely to separate the dots of ink. I would recommend that you save each image with a name expressing its position in the book, such as p035.png; the leading zero is important to ensure that p35 does not erroneously end up between p349 and p350! Let the operating system sort them for you. It’s also easier in this way to detect that you had multiple images of a particular page.

ImageMagick can pull off some serious wizardry to help you!

Palette Reduction and Normalizing

If you are working with a black and white page of text, you’ll want letters to be as dark as possible and their background to be as light as possible. In many cases, this would require fiddling with the “brightness and contrast” or “gamma curves” for an image. I frequently make use of ImageMagick to batch process a large number of scanned images all at once. Its “-normalize” option in “mogrify” mode causes the software to push the light and dark parts of the image away from each other. It can be convenient to read images in an uncompressed bitmap format (whether TIF or BMP or PNG) while writing normalized images in a reduced palette format such as GIF.

Concatenating to PDF

A directory of images is less convenient for reading than a PDF that stitches them all together. Most Windows computers will have the ability to “print” a directory of images to PDF without any added software. This option can have some undesirable outcomes, though, such as cropping bits off the edges and padding them to A4 or Letter size. Again, I have found ImageMagick to be quite helpful. Its “convert” mode with the “-adjoin” option can create a PDF from a series of images, even specifying the DPI at which it should be viewed.

The magic of Optical Character Recognition and compression

At this stage you will likely notice two things you don’t like about your PDF. First, it’s quite likely to be large, especially if you stayed at 600 dpi. The second is that it shows an image of text, but the text that the image represents cannot be copied as text on the clipboard. The fine folks at OCRmyPDF have created software that can manage both tasks for you, so long as you have access to a computer running Linux. The free software uses the “Tesseract” library to recognize text from an image, and one can even specify the language appearing on-page so that it can recognize words and accented letters more easily. Since PDF format supports some highly-compressed ways to represent each page, OCRmyPDF is able to recompress images much more than GIF can offer. I frequently see the PDF file that emerges from the software to be three or four times smaller than the input PDF. The “sidecar” option can even kick out a text file storing all the text inferred from the entire PDF!

Communicating to others

To this point I have detailed the method to produce a high-quality PDF from a book you own (or that you have fair use right to support your academic research). Distributing that PDF to other people is quite a different topic. It is definitely worth you while to learn about copyright and intellectual property before you send a copy of a PDF to someone else. Consider this quote from the page I linked there: “Everything published after 1977 is protected for the duration of the author’s life and another 70 years after their death.”

With that said, I want to mention Library Genesis, an international service to make full-length copies of books available in electronic format. My most common use of “LibGen” is that I have a copy of a book that I own in physical format that I want to take with me on a trip. Given that airlines are now squeezing passengers out of checking baggage for the cheap seats, cutting back on the weight of my baggage has become quite important. Rather than bring the physical copy of the book with me, I might put the PDF or EPUB version on my cell phone; both these formats are frequently available for books at LibGen. The service also offers the ability for users to upload new works anonymously. You as an individual still bear the responsibility of ensuring the documents you upload are not covered by copyright!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s