A Halloween Adventure: OCR Horror Story

(Featured Image: “Ichabod Crane pursued by the Headless Horseman”, by F.O.C. Darley, 1849)

This week I examined different means of digital text analysis and, in keeping with the spirit of Halloween I chose for my sample text Washington Irving’s classic American gothic horror story: The Legend of Sleepy Hollow.

Cover of Harvard University’s 1892 Copy of The Legend of Sleepy Hollow, Google Books

I selected an 1893 print of Rip Van Winkle and The Legend of Sleepy Hollow from Google Books as it was the oldest version I could find on the site as well as the only version that I did not have to purchase as an e-book to view in plain text.  I wanted the oldest version because I needed to view the plain text of the book and the oldest version would most likely have had the most trouble going through OCR.  For those who do not know, OCR (or Optical Character Recognition) is a way to convert hand written or typed words to digital text.  So when you look at a particularly old text on Google Books, such as this edition of The Legend of Sleepy Hollow, you are seeing the scanned page of the original document.  Attached to that scanned image is the plain text, which is produced when the scanned image is put through OCR.  This underlying plain text is what makes the search function work, because the browser is actually searching for the selected words and phrases in the plain text, not in the scanned image.  So if you want to be able to search documents like those on Google Books, it is super important for that plain text to be accurate.

Left: Scanned image of 1893 copy of The Legend of Sleepy Hollow.   Right: OCR plain text of the same page.


The trouble with OCR is that when you run it on scanned images of older type fonts, it often does not recognize words accurately.  So in the plain text you sometimes end up with “shculdtrs” instead of “shoulders”, “hcnu” instead of “how”, and “tlu” instead of “the”. (All examples from the Legend of Sleepy Hollow” plain text.)  Really, the only way to ensure that the plain text is accurate is to have somebody go through with the plain text and original scan side-by-side and correct the plain text manually.  In undergrad I worked for my university’s library as a “digitization technician”, or as I like to call it “scanning monkey”, so I can assure that this process is time consuming but necessary for documents to be searchable.  To the credit of whoever at Google was responsible for digitizing this edition of Ichabod Crane’s adventure, the plain text was almost entirely accurate.  The places where words were unrecognizable came from captions on images, not from the narrative text.

Voyant’s imaging of The Legend of Sleepy Hollow text.

I also put The Legend of Sleepy Hollow’s plain text through Voyant Tools’ text analyzer, which gives the user several options for visualizing the content of the text.  I am personally a fan of the word cloud, which shows the most frequently used words in the text.  While this is visually pleasing, the more useful tool related to it is the “contexts” tool.  This tool allows the user to select a frequently used term or phrase and shows the sentences it was used in.  This could help a scholar understand why a certain term appeared so frequently in the text as well as different meanings that may have been attached to the same word.  To finish, I put four of the most common words from The Legend of Sleepy Hollow through Google’s N-Gram Viewer and Hathitrust’s Bookworm.  Both of these sites show the frequency of a selected word’s use over time.  Unsurprisingly, all of the words I selected saw a general decrease in usage over the twentieth century.

Word frequency.png
Top: Hathitrust’s Bookworm   Bottom: Google’s N-Gram Viewer

For my own research as a historian I could see myself using these tools to compare the language used in similar types of publications from different eras, such as newspapers, and using this information to inform museum exhibits or community projects.  As a public historian I cannot really see these tools being useful for a public audience.  The word clouds are fun and it might be neat for someone to play around with the N-Gram Viewer for a few minutes, but I don’t think these tools would hold somebody’s attention for very long or really help a public audience engage more deeply with the text.  For the most part I think these tools are more useful for literary and text scholars.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create a free website or blog at WordPress.com.

Up ↑

Create your website with WordPress.com
Get started
%d bloggers like this: