Preventing Automated Collection Through Captcha

CAPTCHA is the image that appears often containing scrambled words or letters when you fill out a Web form and it is used as a “Completely Automated Public Turing test to tell Computers and Humans Apart.” It prevents robots from automatically submitting Web forms without any human interaction.

BrightPlanet cannot submit forms or harvest content from sites that leverage CAPTCHA because our harvester is not an actual human. In this post, you’ll learn how CAPTCHA works, and why our harvester and others do not support the harvesting of content from CAPTCHA.

How Does CAPTCHA Work?

CAPTCHA works by presenting a question to the human that cannot easily be calculated by a computer. Since the server receiving the form submission knows the answer to the prompted question, it compares your answer and allows the form to be submitted if there is a match.

When CAPTCHA first began appearing, developers would use simple images filled with text or numbers that the user could easily identify. The problem was that extracting text from an image was very easy for an automated machine through optical character recognition (OCR). That is why we have extra hash marks, squished characters, words written on a crooked line, and letters turned at weird angles today. Those are the things that make automated character recognition much more difficult.

Evolution of CAPTCHA

In the recent years, CAPTCHA has continued to get more complicated by leveraging more analytic problems, like asking the user to pick an image that doesn’t belong, solve basic math questions, or answer a trivia question. These techniques further require a human to respond as computers are not well equipped to answer random questions.

Automatically Harvesting CAPTCHA

BrightPlanet’s harvest engine is not a human, it is an automated harvest engine, which means it collects large amounts of data from many sites at once. CAPTCHA forms prevent the Deep Web Harvester from collecting any data because it cannot answer the CAPTCHA question.

Over the years, we have been asked about CAPTCHA often, but we can’t, and in general neither can others, process data that requires this type of human intervention. Some custom script based harvest engines, like those leveraging Kapow, do allow their scripts to stop when they hit a step that requires human intervention, present a human with the problem to answer, and then proceed with the script. This works great if you can build a custom script and process the sites one at a time, but it is very labor intensive.

Interested in what the Deep Web Harvester can capture? Check out our blog post on the Deep Web harvesting process.

Photo: Ryan Ruppe (flickr)