Finding a certain piece of text inside a document represents an important feature nowadays. This is widely used in many practical things that we regularly do in our everyday lives, such as searching for something on Google or even plagiarism. In small texts, the algorithm used for pattern matching doesn’t require a certain complexity to behave well. However, big processes like searching the word ‘cake’ in a 300 pages book can take a lot of time if a naive algorithm is used.

The naive algorithm

Before, talking about KMP, we should analyze the inefficient approach for finding a sequence of characters into a text. This algorithm slides over the text one by one to check for a match. The complexity provided by this solution is O (m * (n — m + 1)), where m is the length of the pattern and n the length of the text.

Find all the occurrences of string pat in string txt (naive algorithm).

#include <iostream>
	#include <string>
	#include <algorithm>
	using namespace std;

	string pat = "ABA"; // the pattern
	string txt = "CABBCABABAB"; // the text in which we are searching

	bool checkForPattern(int index, int patLength) {
	    int i;
	    // checks if characters from pat are different from those in txt
	    for(i = 0; i < patLength; i++) {
	        if(txt[index + i] != pat[i]) {
	            return false;
	        }
	    }
	    return true;
	}

	void findPattern() {
	    int patternLength = pat.size();
	    int textLength = txt.size();

	    for(int i = 0; i <= textLength - patternLength; i++) {
	        // check for every index if there is a match
	        if(checkForPattern(i,patternLength)) {
	            cout << "Pattern at index " << i << "\n";
	        }
	    }

	}

	int main() 
	{
	    findPattern();
	    return 0;
	}
view raw
main6.cpp hosted with ❤ by GitHub

KMP approach

This algorithm is based on a degenerating property that uses the fact that our pattern has some sub-patterns appearing more than once. This approach is significantly improving our complexity to linear time. The idea is when we find a mismatch, we already know some of the characters in the next searching window. This way we save time by skip matching the characters that we already know will surely match. To know when to skip, we need to pre-process an auxiliary array prePos in our pattern. prePos will hold integer values that will tell us the count of characters to be jumped. This supporting array can be described as the longest proper prefix that is also a suffix.

#programming #data-science #coding #kmp-algorithm #algorithms #algorithms

KMP — Pattern Matching Algorithm
2.35 GEEK