Tuesday, March 29, 2011

Making unintelligible words intelligible

Have you ever purchased tickets online and had to decipher one of those scrambled, hard-to-read words before proceeding to the check-out phase of your purchase? That funny word is called a "captcha" and it's intended to ensure you are a human and not a machine that is programed to game the system. Only a human being can recognize these words and then re-enter them from a computer keyboard.

It turns out that this process is doing more than just separating you from a machine. This article in today's New York Times explains that these catchas serve a second purpose: every one of them is a word from an old text that an OCR (Optical Character Recognition) program was unable to recognize. Such words are siphoned off into this program and presented as captchas. When you type the word, your effort is funneled into a sophisticated computer program that compares the letters you type with the letters typed by others for the same word, and does a few other quality control things (like checking the word in the text before and after this unknown word to create some kind of context) and finally, the computer determines the identity of this previously unknown word. The accuracy rate of this method is higher than that of an individual typist doing purposeful verification of the words and this costs almost nothing. Millions of words are sorted out this way every day.


Jeff Welch said...

I read something about this a few years ago. I wonder how much it costs. The Plymouth Registry of Deeds used OCR to digitize its 1955-1970 indices, but is now in the "third step" of correcting them with registry employees. Since they've lost about 30% of their staff since FY07 due to budget cuts, the process has been pretty slow.

I wonder if this is something that the tech fund could pay for in all the registries- at least where the indicies are typed and not handwritten. (Then again, there could be collective bargaining implications....)

Dick said...

About ten years ago we thoroughly investigated using OCR to digitize the paper indexes. Back then we found it wasn't practical but I'm sure the technology has improved since then. My interest is in OCR'ing the documents themselves which would make them searchable. While the index is the proper tool to use in a title search, there are many words in the body of documents that aren't in the index (street names in the property description, for instance)that would be very valuable if searchable. I look to Google Books as an example of what we can become.