Reading Old Books with Tesseract


Our current project is in conjunction with the NYPL concerning the transcription of digital archives. Due to the fact the particular resource we are working with is typeset we will be making use of OCR, Optical Character Recognition, on these digital images in order to assist the transcriptionist. A bitmap image such as .png, .jpg, or .tiff is just information about the color of pixels in the image, and it takes some interesting programing to get an understanding what the pixels in an image means. The two basic types of OCR processing are Matrix Matching and Feature Extraction. Matrix Matching has a lower computational cost and works best with reproducible typefaces. A phone bill could be scanned quickly and well with a Matrix Matching engine. The computer metaphorically overlays a stencil of a letter on a grouping of pixels and records the letter with the closest match. Feature Extraction works much more like the human visual system and searches for, well, features. It looks for edges, monochrome fields, line intersections and other such topography. Feature Extraction is more versatile than Matrix Matching for unusual typefaces, different sizes of the same type, or uneven backgrounds.

For this project we will be using Tesseract, an OCR engine developed by HP and made open source in 2005. It is a feature detecting engine with a couple of optimization options. Before embedding a language specific hidden Markov Model or training a convolutional neural network with an evolutionary optimization algorithm such as Particle Swarm Optimization there are some more basic steps you can take to improve your OCR results.

  • First use a good resolution copy. 300dpi is about the minimum requirement to ride the ride. Grey-scale or color is better than black and white.
  • Tesseract does a lot of background and contrast adjustments itself so trying to anticipate what it wants is not very likely to to help much.
  • If the background of the image is known to be unevenly aged, setting the background adjustments to "tile" local adjustments may work better than averaging across the entire image.
  • Tesseract has different language files, so be sure to use the one appropriate to your particular document. This will be important for Tesseract to anticipate what characters it might come across.
  • It is possible to train Tesseract on a particular font by correcting and saving a early attempts at OCR. A good tutorial is located here