Doesn’t matter what kind of intelligence you have — be it artificial or natural — after this detailed analysis no captcha will be an obstacle. At the end of the article, you can find the simplest and most effective workaround solution.

CAPTCHA is a completely automated public Turing test to tell computers and humans apart by automatically setting up specific tasks that are difficult for computers but simple for human. This technology has become the security standard used to prevent automatic voting, registration, spam, brute-force attacks on websites, etc.

1. Categories of captcha

Existing captchas are divided into three categories: text, graphic and audio/video. Below we will look at how various captchas are generated and what successes now are with their bypassing. Do not scold for the quality of images — we took them from scientific publications, to which we provide links =) A full list of publications taken for analysis is given at the end of the article.

1.1. Text captcha

Text captchas are the most commonly used, but due to their simple structure they are also the most vulnerable. This type of captcha usually requires recognition of a geometrically distorted sequence of letters and numbers.

To increase security, various protection mechanisms are used, which can be divided into anti-segmentation and anti-recognition. The first group of mechanisms is aimed at complicating the process of characters separation, while the second group — at recognizing the characters themselves. Fig. 1 shows examples of various approaches to captchas defence.

Fig. 1: Captcha defence types

1.1.1. Hollow symbols

In the captcha-creating strategy “hollow symbols”, contour lines are used to form each symbol.

Fig. 2. Hollow captcha

Such characters are difficult to segment, but they are easily visible to people. Unfortunately, this mechanism is not as secure as expected. In Gao’s research [1], the convolutional neural network successfully recognizes from 36% to 89% of images (depending on the type of distortion and the training sample).

1.1.2. Crowing characters together

Crowing characters together (CCT) complicate segmentation, but also reduce the readability for the user. That is, even human sometimes can not successfully solve such a captcha.

Fig. 3. Overlap and CCT

Researchers from China and Pakistan managed to crack the CCT with a probability of 27.1% to 53.2% [2].

1.1.3. Background noises

Fig. 4. Background noises

Google’s reCAPTCHA, using images from Street View, breaks in 96% of cases [3].

1.1.4. Two-level structure

The two-level structure is a vertical combination of two horizontal captchas, which complicates the segmentation of the image.

Fig. 5. Two-level structure

Gao [4] proposed a segmentation approach for separating captcha images both vertically and horizontally, and achieved 44.6% success (9 seconds per image) using a convolutional neural network.

1.2. Captcha based on image

1.2.1. Selection based captcha

In the case of captcha based on selection, users must select the correct answers according to the prompt provided to this captcha. This is the simplest image based captcha form. For example, you need to highlight all the cars, all traffic signs or all traffic lights among the presented images.

Fig. 6. Various examples of image captchas, based on selection

Gaulle [5] proposed using the support vector machine (SVM) to distinguish images of cats and dogs in Asirra captcha with a probability of successful recognition of 82.7%.

Gao’s team [6] used OpenCV to detect faces in FR-CAPTCHA. It was possible to obtain a detection probability from 8% to 42% with image processing in less than 14 seconds. FaceDCAPTCH was recognized with a probability of 48% on average in 6.2 seconds.

Columbia University employees beat reCAPTCHA and Facebook CAPTCHA with a probability of 70.78% and 83.5%, respectively.

1.2.2. Click-based Captcha

In 2008, Richard Chow with his colleagues [7] first proposed click-based captcha. It requires users to click on symbols that are on a complex background in accordance with the prompt, as it shown on Fig. 7.

Fig. 7. Click-based captcha

Such click based captcha have two protective mechanisms: anti-detection and anti-recognition. Proper character recognition with the development of machine learning is no longer a difficult task. Therefore, almost all protection mechanisms are designed to prevent attackers from correctly identifying characters.

1.2.3. Drag-and-Drop Captcha

Dragging-based captcha determines whether the user is a person by analyzing the mouse trail, pointer movement speed, and response time.

Fig. 8. Drag-and-Drop Captcha

Users need to rotate the image of the subject so that it will take it’s a natural position. For example, rotate the image of the table until it is on its legs. It is simple for humans, but difficult for bots.

1.3. Audio/video captcha

1.3.1 Audio captcha

This captcha is usually considered as an alternative to visual captcha in the case of users with vision impairments. Listeners are invited to complete the task on the basis of what they heard, for example, to determine a specific sound, for example, the sound of a bell or piano [8].

Fig. 9. Audio captcha

There is another type of audio-based captcha in which users are required to not only listen, but to pronounce. For example, Gao [9] proposed a sound captcha (Fig. 9), in which the user should read out a sentence selected randomly from a book. The generated audio file is analyzed to determine if the user is human.

But audio captcha is cracked too: scientists from Stanford University have learned to crack audio captcha with a probability of 75%.

1.3.2 Video-captcha

In the video-captcha a video file is provided to users, and they must select a sentence that describes the movement of the person in the video.

Fig. 10. Summary table. Types of captcha

Japanese researchers used a solution based on HMM (hidden Markov model) and obtained an accuracy of 31.75%.

Let us now examine how exactly neural networks are used to crack captcha.

2. Neural networks with DenseNet and DFCR architectures

In 2017, Gao Huang, Zhuang Liu and others [10] built 4 deep convolutional neural networks with an architecture now called DenseNet. Dense blocks of neural networks alternated with skip-connection layers (Fig. 10). The input of each layer in the block was the union of the output of all previous layers. This distinguished the new architecture from the traditional at that time neural networks, where the layers were connected in series.

Fig. 11. DenseNet with three dense blocks [11]

The DenseNet architecture has several advantages: it solves the dispersion problem and effectively uses the properties of all previous convolutional layers, reducing the computational complexity of network parameters and demonstrating good classification performance.

An example of a variation of the DenseNet architecture is the DFCR neural network. The original captcha images of size 224x224 were passed through the convolution layer and then were combined into pools to output images of size 56x56. After that, 4 “dense” blocks were alternately connected, alternating with transition layers (Fig. 11). The structure of the transition layer made it possible to reduce the dimension of the feature map and speed up the calculations.

Fig. 12: DFCR architecture by the example of image recognition with the characters «W52S» [11]

Further, the feature maps were used to check the correspondence of the map and class. The values in each feature map were summed to obtain the average value, which was taken as the class value and displayed in the corresponding softmax classification layer.

Experiments show that DFCR not only preserves the main advantages of DenseNet, but also reduces memory consumption. In addition, the recognition accuracy of captcha with background noise and superimposed characters is higher than 99.9% [11].

#machine-learning #captcha #algorithms