Well, I installed recaptcha on this blog last week. I was sick of receiving tons of spam for such sleaze as I would rather not think of. Recaptcha was selected on the recommendation of a colleague. This past week, a form that we set up for a client was creating spam so bad that our (legitimate) servers were being blacklisted all over the globe. Finally, we put in the captcha (over objections from more sales and marketing-oriented minds) and it, combined with changing the static IP for our mail server, got us unblacklisted. So, it worked at work and I am happy to say that I have not had any spam in "awaiting moderation" since installing it. I am equally sure that it will see use in a website that I am currently building.

The other cool thing about recaptcha, besides the fact that is an excellent captcha in its own right, is the somewhat novel method used for generation and verification of images. From their website:


reCAPTCHA improves the process of digitizing books by sending words that cannot be read by computers to the Web in the form of CAPTCHAs for humans to decipher. More specifically, each word that cannot be read correctly by OCR is placed on an image and used as a CAPTCHA. This is possible because most OCR programs alert you when a word cannot be read correctly.

But if a computer can't read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here's how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct."

Cool, huh? It also occurs to me that the usages of this could go well beyond aiding in the OCRing of a bunch of documents. If their OCR software is using neural networks (and today, whose isn't?) the amount of training data that could wind up in their particular network is nothing short of astounding. It would be nice if we could see the end result! The project itself is being run by Carnegie Melon so I'm sure that if anything truly interesting comes of it, something will be published. That said, the site doesn't seem to contain any references to the influence this could have on artificial intelligence and character recognition so I can't even be sure that they are trying to observe the pattern matching or if it is just a bright idea to improve on existing QA methods for OCR.

Now, on Friday no less, you get a twofer: captcha recommendation and rambling on a tangent about the AI involved. But, that is how the MCS's mind works.