Crowdsourcing for Document Transcription

It’s a fast-paced, electronic world out there, and so much stuff is immediately available “online” now that library patrons are coming to expect it as “a given”. So libraries, archives, and other cultural organizations are putting more and more information online every day, particularly through digital imaging of photographs and documents.

As many of you reading this blog probably already know, creating quality, useful digital images is a big job. It’s really not as simple as just scanning something and slapping it up onto the Internet. There’s only so much you can do, both time- and money-wise, and so it requires prioritizing, making sure you’ve done it right the first time, actually digitizing it (in whatever way necessary/appropriate), and then creating the metadata. The metadata (descriptive information about the image: title, description, subjects, file format, dates, etc.) is key to making the image findable, because let’s face it, neither Google nor anyone else has perfected the ability to search for an image without using words. To put it bluntly, if you have a picture of a cat, somewhere attached to that image needs to be a text record with the word “cat” in it, or else it won’t come up in the search results when someone searches for “cat”. (Sometimes, people searching for “cat” still won’t find it, depending on where/how they are searching or how your metadata is done, but all the possible reasons for that are subject for another time.)

Bottom line is: metadata is pretty darn important! And if you thought scanning took a long time, just wait until you start messing with your metadata. Depending on whether you are doing it from scratch right then or if you have pre-existing metadata that you can use, even simple metadata can take longer per image than the scan did.

But God help you if you are digitizing documents that need to be transcribed and are handwrittten, thus eliminating ability to use OCR (optical character recognition – where a computer transcribes for you, with varying degrees of accuracy). It’s not always necessary to transcribe documents; sometimes the simple metadata is enough to get the documents into the proverbial “hands” of the people that need them.

But what if you do want to transcribe? It might be every word on the page, such as letters or diaries, or maybe it is just the names, such as with birth, death, marriage, or other records where name is the most important access point. I have participated in a number of such projects, and it can definitely get time-consuming.

So, where do you find the time, the resources, the manpower?

One solution is “crowdsourcing,” which is defined by Wikipedia as “the act of outsourcing tasks, traditionally performed by an employee or contractor, to an undefined, large group of people or community (a “crowd”), through an open call” (“Crowdsourcing,” Wikipedia). Crowdsourcing is an awesome way of getting things accomplished, because for one thing, it’s FREE. And you’d be surprised how many people are happy to do projects for free, particularly if they can do the work online, at their leisure.

A rather well-known example of crowdsourcing is the FamilySearch Indexing Project, which invites volunteers to transcribe names and other information from records useful to genealogists. I have participated in this one, and I can tell you that they have a nifty client program for downloading records in batches, entering metadata into forms, and then submitting your work. Each record is transcribed by multiple people (2 or 3), and a computer compares the submitted entries, looking for discrepancies. If there are differences, another (more seasoned) transcriber reviews the records again to determine the correct entries.

FamilySearch is a large-scale operation with a sophisticated software system going on. It’s great, but it’s a little intimidating. I work in a small department; I’m the only archivist. I look at FamilySearch, and I don’t exactly think to myself, “Oh yeah, we could totally do that.”

But today, I saw a project that made me think, “Well…yeah…maybe we could.” It’s not as elaborate, but it looks like it gets the job done.

I’m talking about the Civil War Diaries Transcription Project at the University of Iowa Libraries. A friend of mine at the Clark County (OH) Historical Society posted about this project on the society’s Facebook page (you should “like” them, by the way), and I was just so excited to see a library doing this!

But what is the big deal?  Lots of libraries, past and present, have done transcription projects. Most libraries have volunteers. Most libraries have web pages, most of which contain some kind of web submission form. Many libraries have digital image collections, a lot of them including historic items. Ah, but when you combine all of these things together, that’s where you get the genius and awesomeness that is the University of Iowa’s project. They’ve brought all those elements together: Images are served up using CONTENTdm digital collection management software, and a simple web form displayed next to the image allows volunteers to enter and submit the transcription.

University of Iowa Libraries has taken a project that they want to get done, and created an easy, convenient way for anyone who wants to help, to actually be able to help.

Volunteers for the University of Iowa Civil War Transcription Project don’t need to come to the library, which eliminates the need to come during open hours, find a parking space (which can be tough on a university campus), or even live near the library geographically. Their volunteers don’t need to wear gloves or only use pencil or have any special knowledge about handling. Their volunteers don’t need to be registered as a volunteer or even give their name (although it is optional). The work can be completed from anywhere, anytime, by anyone with Internet access. (All submissions are reviewed prior to being posted.)

The University of Iowa project is especially exciting to me because we also use CONTENTdm for our digital collections at the library where I work. It makes me curious how exactly they implemented this. I might just email them and ask. It can’t hurt to ask!

Update: I now see that the U. Iowa project was featured on the AHA’s blog today…maybe that’s where my friend saw it. 🙂


One response to “Crowdsourcing for Document Transcription

  1. There was an article written about this in the Iowa City Press-Citizen newspaper : check it out.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s