Becoming Digital: Preparing Historical Materials for the Web

How to Make Text Digital: Scanning, OCR, and Typing

hether you envision simple page images or elaborately marked-up text, you will begin the transformation from analog to digital by scanning or digitally photographing the original text. For many digital projects, scanning will turn out to be one of the easiest tasks that you do. Operating a flatbed scanner is not much harder than using a photocopier. Put down the document, press a button (on your computer or your scanner), and you’re done. (At least with that one page; the instructions from there become more like shampooing: Lather, Rinse, Repeat.) Moreover, for a modest-sized text-based project, you can get by with a low-end scanner of the sort that currently sells for less than $100. (Consumer digital cameras that capture at least three megapixels of data can work equally well. although they tend to be slower to set up and harder to frame precisely over a page or book.)28

Measures of scanning quality such as resolution (the density of information that the scanner samples, generally expressed in dots per inch, dpi) and bit depth (the amount of information gathered from one dot, which generally ranges from 1 bit per dot for black and white images to 24 bits per dot for high-quality color) matter more for scans of images than texts—and thus we explain them in greater depth in the next section. Nevertheless, there are some general rules for scanning texts. If you plan to lift the text off the page using optical character recognition (OCR) software (more on that shortly) rather than displaying the scans as page images, you need only 1-bit black-and-white scans, although you should probably scan at a fairly high resolution of 300 to 600 dpi. If you plan to display page images, most experts recommend high-resolution, high-quality (perhaps 300 dpi and 24-bit color) files for archiving purposes. But you can get even this quality with an entry-level scanner.29

Nevertheless, you might want to spend more for a faster scanner or a model that includes an automatic sheet feeder. Automatic sheet feeders greatly expedite and simplify scanning because they can reach speeds of twenty-five pages per minute compared to two to three pages per minute for manual page swapping, or flipping the pages of a book and returning the text to the surface of the scanner. Of course, you can’t use them with rare or fragile materials. But projects that don’t care about saving the originals “disbind” or “guillotine” the books for auto-feeding, vastly accelerating the process while making book lovers like Nicholson Baker cringe.30

In general, as one handbook puts it, “scale matters—a lot” for digitizing projects.31 If your project is small, it doesn’t matter that much if you scan in a time-consuming way. It will still only be a relatively insignificant factor in your overall project. But if you are scanning thousands of pages, you need to carefully consider your equipment choices, plan your workflow, and contemplate whether a professional service might be more economical.

Some more specialized projects require considerably more expensive equipment, and as a result, it is often more economical to outsource such work (discussed later in the chapter). Projects that start with microfilm rather than texts, for example, need expensive microfilm scanners. Many rare books cannot be opened more than 120 degrees and would be damaged by being placed on a flatbed scanner. The University of Virginia’s Early American Fiction Project, for example, has digitized 583 volumes of rare first editions using overhead digital cameras and book cradles specially designed for rare books. The Beowulf project required a high-end digital camera manufactured for medical imaging.32 Such approaches are common only in well-funded and specialized projects. What is more remarkable is how very inexpensive equipment can produce very high-quality results for most ordinary projects.

So far, however, we have only discussed digital “photocopies.” How do we create machine-readable text that is used either separately or in conjunction with these page images? Those who like computers and technology will find the process known as optical character recognition (OCR)—through which a piece of software converts the picture of letters and words created by the scanner into machine-readable text—particularly appealing because of the way it promises to take care of matters quickly, cheaply, and, above all, automatically. Unfortunately technology is rarely quite so magical. Even the best OCR software programs have limitations. They don’t, for example, do well with non-Latin characters, small print, certain fonts, complex page layouts or tables, mathematical or chemical symbols, or most texts from before the nineteenth century. Forget handwritten manuscripts.33 And even without these problems, the best OCR programs will still make mistakes.

But when the initial texts are modern and in reasonably good shape, OCR does surprisingly well. JSTOR, the scholarly journal repository, claims an overall accuracy of 97 percent to 99.95 percent for some journals. A study based on the Making of America project at Michigan found that about nine out of ten OCRed pages had 99 percent or higher character accuracy without any manual correction. A Harvard project that measured search accuracy instead of character accuracy concluded that uncorrected OCR resulted in successful searches 96.6 percent of the time with the rate for twentieth-century texts (96.9 percent) only slightly higher than that for nineteenth-century works (95.1 percent). To be sure, both of these projects used PrimeOCR, the most expensive OCR package on the market. Jim Zwick, who has personally scanned and OCRed tens of thousands of pages for his Anti-Imperialism website, reports good success with the less-pricey OmniPage. PrimeOCR claims that it makes only 3 errors in every 420 characters scanned, an accuracy rate of 99.3 percent. But the conventional (and very inexpensive) OCR packages like OmniPage claim to achieve 98-99 percent accuracy, although that depends a great deal on the quality of the original; 95-99 percent seems like a more conservative range for automated processing. Even with the most sophisticated software, it is hard to get better than 80 to 90 percent accuracy on texts with small fonts and complex layouts, like newspapers. Keep in mind, as well, that the programs generally measure character accuracy but not errors in typography or layout; hence you could have 100 percent character accuracy but have a title that has lost its italics and footnotes that merge ungracefully into the text.34

Moreover, you will spend a lot of time and money finding and correcting those little errors (whether 3 or 8 out of 400 characters), even though good programs offer some automated methods for locating and snuffing out what they euphemistically (and somewhat comically) call “suspicious characters.” After all, a three-hundred page book OCRed at 99 percent accuracy would have about 600 errors. We would very roughly estimate that it increases digitization costs eight to ten times to move from uncorrected OCR to 99.995 percent accuracy (the statistical equivalent of perfection). From an outside vendor, it might cost 20 cents (half for the page image and half for the OCR) to machine digitize a relatively clean book page of text; getting 99.995 percent might cost you $1.50-2.00 per page.35

Given that uncorrected OCR is so good and making it better costs so much more, why not just leave it in its “raw” form? Actually, many projects like JSTOR do just that. Because JSTOR believes that the “appearance of typographical and other errors could undermine the perception of quality that publishers have worked long and hard to establish,” they display the scanned page image and then use the uncorrected OCR only as an invisible search file. This means that if you search for “Mary Beard,” you will be shown all the pages where her name appears, but you will have to scan the page images to find the specific spot on the page. Visually impaired and learning disabled users complain that the absence of machine-readable text makes it harder for them to use devices that read articles aloud. Making of America, which is less shy about showing its warts, allows you to also display the uncorrected OCR, which not only helps those with limited vision but also makes it possible to copy and paste text, find a specific word quickly, and assess the quality of the OCR.

For some projects, uncorrected OCR—even when 99 percent accurate—is not good enough. In important literary or historical texts, a single word makes a great deal of difference. It would not do for an online version of Franklin D. Roosevelt’s request for a declaration of war to begin “Yesterday, December 1, 1941—a date which will live in infamy—the United States of American was suddenly and deliberately attacked by naval and air forces of the Empire of Japan,” even though we could proudly describe that sentence as having a 99.3 percent character accuracy. One solution is to check the text manually. We have done that ourselves on many projects, but it adds significantly to the expense (and requires checking on the person who did the checking). A skilled worker can generally correct only six to ten pages per hour.37

Although it seems counterintuitive, studies have concluded that the time spent correcting a small number of OCR errors can wind up exceeding the cost of typing the document from scratch. Alexander Street Press, which puts a premium on accuracy and insists on carefully tagged data, has all of its documents “rekeyed”—that is, manually typed in without the help of OCR software. They (and most others working with historical and literary materials) particularly favor the triple keying procedure used by many overseas digital conversion outfits. Two people type the same document; then a third person reviews the discrepancies identified by a computer. Calculations of the relative cost of OCR versus rekeying will vary greatly depending on the quality of the original document, the level of accuracy sought, and especially what the typist gets paid. Still, for projects that need documents with close to 100 percent accuracy, typing is probably best.38 This can be true on even relatively small projects. We wound up hiring a local typist on some of our projects when we realized how much time we were spending on scanning, OCR, and making corrections, especially for documents with poor originals. Manually correcting OCR probably makes sense only on relatively small-scale projects and especially texts that yield particularly clean OCR. You should also keep in mind that if you use a typist, you don’t need to invest in hardware or software or spend time learning new equipment and programs. Despite our occasional euphoria over futuristic technologies like OCR, sometimes tried-and-true methods like typing are more effective and less costly.

28 Most consumer digital cameras, however, “do not have sufficient resolution for archival capture of cultural heritage materials.” The lens in these cameras is “are designed for capturing three-dimensional scenes and may introduce distortions to flat materials.” But these problems amay not be re generally not important for the basic capture of texts. See Western States Digital Imaging, 15.

29 Sitts, ed., Handbook for Digital Projects, 96, 98–99. Scanning in grayscale may improve OCR in some cases and is required for simple readability in some cases (e.g., if photos are present). For scanning guidelines, see, for example, Steven Puglia and Barry Roginski, NARA Guidelines for Digitizing Archival Materials for Electronic Access (College Park, Maryland.: National Archives and Records Administration, 1998), ↪link 3.29a; California Digital Library, Digital Image Format Standards (Oakland: California Digital Library, 2001) ↪link 3.29b.

30 Sitts, ed., Handbook for Digital Projects, 115–116; Nicholson Baker, Double Fold: Libraries and the Assault on Paper (New York: Random House, 2001). Note that you should select a scanner based on actual optical resolution and not “interpolated resolution,” which is a method of increasing resolution through a mathematical algorithm. And make sure that the scanner transfers data quickly (e.g., through FireWire or USB 2.0). See Western States Digital Imaging, 12–14. Another approach, which is being piloted at Stanford University but is only currently feasible only for very large projects, uses an expensive robot that can automatically scan 1,000 one thousand pages per hour. John Markoff, “The Evelyn Wood of Digitized Book Scanners,” New York Times, 12 May 2003, C1.

31 Sitts, ed., Handbook for Digital Projects, 123. Another approach, which is being piloted at Stanford University but is only currently feasible for very large projects, uses an expensive robot that can automatically scan 1,000 pages per hour. John Markoff, “The Evelyn Wood of Digitized Book Scanners,” New York Times (12 May 2003), C1.

32 Kendon Stubbs and David Seaman, “Introduction to the Early American Fiction Project,” Early American Fiction, March 1997↪link 3.32a; “Equipment and Vendors,” Early American Fiction, 2003 ↪link 3.32b; NINCH Guide, 40–41. Microfilm scanners run between $650 and $53,000 and up, depending on scanning quality and other processing capabilities. Generally, projects using cameras for digital images use “digital scan back” cameras, which attach a scanning array in place of a film holder on a 4″” x 5″” camera. Western States Digital Imaging, 15. Another approach, which is being piloted at Stanford University but is only currently feasible for very large projects, uses an expensive robots that can automatically scan 1,000 pages per hour. John Markoff, “The Evelyn Wood of Digitized Book Scanners,” New York Times (12 May 2003), C1.

33 NINCH Guide, 46; Sitts, ed., Handbook for Digital Projects, 130–131. As demand for non-Latin character recognition has grown, so has the amount of available software that recognizes non-Latin and other stylized text. OmniPage, for example, lists 119 languages it supports. See ↪link 3.33a. “Unconstrained machine translation of handwriting appears particularly far off, and may be unachievable.” “FAQ,” RLG DigiNews 8.1 (15 February 2004), ↪link 3.33b.

34 “Why Images?” JSTOR, ↪link 3.34a; Douglas A. Bicknese, Measuring the Accuracy of the OCR in the Making of America (Ann Arbor: University of Michigan, 1998), ↪link 3.34b; LDI Project Team, Measuring Search Retrieval Accuracy of Uncorrected OCR: Findings from the Harvard-Radcliffe Online Historical Reference Shelf Digitization Project (BostonCambridge, Mass.: Harvard University Library, 2001), ↪link 3.34c; “Product Pricing,” Prime Recognition, ↪link 3.34d. As the University of Michigan report points out, the additional costs for PrimeOCR are most readily justified by higher volumes of digitizing such as at Michigan, which digitizes millions of pages per year. University of Michigan Digital Library Services, Assessing the Costs of Conversion, 27. OurMy understanding of this and a number of other points was greatly aided by Roy’s conversation with David Seaman, 10 on May 10, 2004.

35 These very rough figures are based on talking to vendors, digitizers, and reading University of Michigan Digital Library Services, Assessing the Costs of Conversion. For a good discussion of the complexity of pricing digitization, see Dan Pence, “Ten Ways to Spend $100,000 on Digitization” (paper Ppresentedation delivered at The Price of Digitization: New Cost Models for Cultural and Educational Institutions, New York City, 8 April 2003), ↪link 3.35a. A vendor who works in the United States told us he could automatically OCR a typescript page for about 10 cents (at 95 to 98% percent accuracy) but that the price would rise to $1.70 to $2.50 for fully corrected text. Michael Lesk maintains that “you can get quotes down to 4 cents a page or $10/book if you're willing to disbind, you are willing to ship to a lower-cost country, and you're not so fussy about the process and quality.” Lesk, “Short Report” (paper presented at The Price of Digitization: New Cost Models for Cultural and Educational Institutions, New York City, 8 April 2003), ↪link 3.35b. Brewster Kahle is working on getting the cost of scanning a book down to $10 as part of a massive project to digitize all the books in the Library of Congress. Matt Marshall, “Internet Ararchivist Hhas Mmodest gGoal: Store eEverything,” SiliconValley.com (4 August 2004), available at ↪link 3.35c.

36 “Executive Notes,” JSTORNews 8.1 (February 2004), ↪link 3.36a; Bicknese, Measuring the Accuracy of the OCR in the Making of America; “Why Images?” JSTOR. JSTOR hand keys bibliographic citation information, key words, and abstracts. “The Production Process,” JSTOR, ↪link 3.36b. JSTOR also makes available TIFFs and PDFs, which can be used with more sophisticated assistive technologies. JSTOR probably also limits users to the page views, probably for legal reasons that have to do with its rights to make use of content that is still under copyright. —rReformatted content might be considered a new publication and might require them to go back to authors for permission.

37 Chapman, “Working with Printed Text and Manuscripts” in Sitts, ed., Handbook for Digital Projects, 114.