security risks for online services by relying on reCAPTCHA
Benjamin Wegener - October 22, 2010
revised version October 27, 2010
A description of major flaws in Google's reCAPTCHA service and a proof of concept regarding successful exploitation using freely available tools in order to achieve a recognition rate of approximately 17%.
Many popular websites use theCAPTCHAs (a contrived acronym for “completely automated public turing test to tell computers and humans apart”1) provided by reCAPTCHA to prevent the automated subscription and registration or unauthorized and extensive use of their offers and services. reCAPTCHA was originally developed by Luis von Ahn, Benjamin Maurer, Colin McMillen, David Abraham and Manuel Blum of the Computer ScienceDepartment of the Carnegie Mellon University in Pittsburgh PA to help the digitization of books by using CAPTCHAS of scanned words the optical-character-recognition-software used couldn't recognize2. In September 2009 Google acquired the system to use it on a wider range of projects like Google Books3. A challenge served by reCAPTCHA consists of two words - one already known by the system and one to beidentified by the human solver. The unknown word is given to many humans in a short period of time and the most answered solution is taken. But reCAPTCHA only checks if the known word is typed in correct and also allows variety of accidental errors to be made.
2 Major Flaws
1 http://en.wikipedia.org/wiki/CAPTCHA 2 Luis von Ahn, Ben Maurer, Colin McMillen, David Abraham and ManuelBlum (2008). "reCAPTCHA: HumanBased Character Recognition via Web Security Measures" (PDF). Science 321 (5895): 1465–1468. 3 "Teaching computers to read: Google acquires reCAPTCHA". Google. Retrieved 2009-09-16.
As mentioned before, only the already known word is taken for the check if the user is human or not. This allows the use of a dictionary attack for solving the challenge, because it islikely that this word is more common than the other one. This could be accomplished by using the by the used ORC-software recognized phrase and comparing it to the word list. If this word doesn't occur in the list, the most likely replacement would be used. Since reCAPTCHA takes mostly English books and sources to be digitized, a comprehensive and huge word list could be made out of thedecompressed archive of the English Wikipedia4 using a simple algorithm to extract all the words separated by punctuation marks and order them alphabetically. A user is allowed to enter 32 wrong answers before security measures are taken and a CAPTCHA consisting of two known words is generated which must be answered perfectly. After that the IPaddress is blocked for a brief period of time. The puzzle ismonochromatic and besides a simple wave distortion no other effects are applied to the phrase. The font used mostly is of serif type because of the sources - so a OCR-Software doesn't need to be trained for other fonts than this.
To attack the mentioned lacks of security, the image is being transformed by some simple algorithms: 3.1 adjustment of contrast
By usingdifferent intensities of contrast for each word, two problems are dealt with at once: the distinction between normal fonts and bold ones and the removal of “dirty spots” or artifacts left over by OCR. Hereby the density of black and white pixels in the areas of both words is separately adjusted to a ratio of 1.5 white to black pixel. 3.2 resizing using hq2x
The open source programhq2x5 uses a pixel art algorithm to enlarge pixel based imagery which fits perfectly in the purpose here after the contrast is applied to get a better overall quality. 3.3 distortion removal
4 http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 5 http://en.wikipedia.org/wiki/Hqx
An algorithm to determine the average center line of the phrase served...
Leer documento completo
Regístrate para leer el documento completo.