This week, my LIS 7850 Digital Libraries class participated in a project designed to “introduce you to the issues surrounding access to text-based materials through free-text entries.”
The exercise involved transcribing poems from the Detroit City Poets Project, in order to make the text accessible for searching, since the original digital objects are images. If an image contains text, that text needs to be recreated as actual text in a searchable metadata field, to allow for full-text searching of the content of that image.
Each class member was assigned approximately 4 images of poems (ID numbers provided) and asked to transcribe the text from the JPEG images into a *.txt file. The two poems I was assigned (each in two parts, for a total of four files) were: “Factories along the river” and “Chant on US-80.”
This exercise was much easier and less time-consuming that Exercise 1. Whereas Exercise 1 (creating description metadata for images) took at least 2 hours to describe about 11 images, I finished Exercise 2 in less than 30 minutes. The poems were short, and the text was clear. It was quite easy to transcribe the text and finish the assignment quickly.
Actually, I was a little confused about why these poems are being manually transcribed at all. (I’m not complaining: I’m always up for an easy assignment!)
First, why are these poems being digitized as image files at all? Perhaps they are scans from a book or publication. We did not receive a lot of background information about this particular project, and I was unable to find it online.
Second, I would think that such clear, crisp, machine-typed text would have made a good job for optical character recognition (OCR) software. Maybe this type of software was unavailable to the project managers. Or maybe they are OCR’ing most of it, but we students needed a project.
Most text transcription projects are not this simple, however. I have participated in several digitization projects (most of them directed towards genealogists) in which the original documents contained text but were digitized as scanned images, thus needing transcription metadata to make the text accessible. However, in most of these cases, the original text was not typed but handwritten—and not always neatly!
When I worked at the Greene County Room (Xenia, Ohio), I participated in a project to digitize early (1869-1909) county death records. These death records were recorded in old-fashioned ledger books, with a single record per line. My role was as one of the transcribers of the names, so that users could search for a particular name in the metadata and be directed to any/all images containing that name. That was a sizeable project, and some of those court clerks had some pretty bad handwriting!
Bad handwriting on handwritten records is not limited to Greene County, Ohio, of course. I have also recently participated in transcription projects through Family Search (LDS Church) Indexing. Family Search has a wide variety of free genealogical records available online, which are made more accessible thanks to the work of thousands of volunteer indexers. If you are interested in indexing and transcription, this is a good way to get some free practice (while helping to provide a free service). There are a variety of projects to choose from, at any level of experience, and in many languages. You can do as much or as little as you want; there’s no pressure.
And finally, another project I’m working on that involves text transcription and indexing is the 1875 Montgomery County (OH) Atlas. I digitized this atlas last year, and it is perfectly useable in its current state—which is why it is already publicly available on our web site. However, the atlas includes plat maps for every township—plat maps that show property lines and property owners. The names are handwritten (of course) and in are written in many directions (some are practically up side down). I would like to transcribe the names on each map, so that users when users search our site (yes, the entire site) for a certain name, they will find plat maps listing property owners with that name (if, of course, there are any). I did manage to finish a similar project with a 1938 Montgomery County Plat Map book. But the 1875 is much bigger, and it also includes several pages of straight text (biographical sketches, county history, etc.), which I hope to OCR rather than transcribing. Although, with all the errors I noticed in a test page that I OCR’d, it might be just as fast to simply sit down and transcribe them by hand!