Wednesday, January 18, 2012

Rabin–Karp Algorithm

Rabin–Karp algorithm is yet another algorithm to do pattern search in a string (also see KMP algorithm). The problem is given a target string s and some pattern p (p is also a string), how to efficiently decide if s contains p as substring.

The key of Rabin-Karp algorithm is to use hash, more specifically, rolling hash. Then with hash, when checking the match of a substring of  s  to the pattern p, we don't need to compare each character of s and  p. Instead, we just compare hash(substring of s) and hash(p). If the hash values are not the same, the substring doesn't match with the pattern. If same, we still need to further inspect every character of  the substring and the pattern, since hash function can have collisions. However, this further check is not expected to be often.

Then the problem is how to efficiently calculate the hash of all the substrings of  s with a specific length. The rolling hash just kicks in here. One simple example of rolling hash is to use the sum of all the ASCII value of the characters in the substring as the hash value, e.g., s[i] + s[i+1] + ... + s[i+m-1] for the substring s[i... i+m-1]. Then for the substring s[i+1... i+m], we can reuse the hash value we just calculated for the substring s[i... i+m-1]. Assume it is h, then the hash value for substring s[i+1... i+m] is just h - s[i] + s[i+m]. In practice, more sophisticated rolling hash function is used. For example, using some large prime, the hash of "hi" can be ascii("h")*101^1 + ascii("i")*101^0

If there is just one pattern, Rabin–Karp algorithm is not as good as KMP algorithm. Rabin–Karp algorithm is more useful in multi-pattern match. It is used together with a bloom filter. A bloom filter can be created to represent multi-pattern. Then if a substring has hash value hit in the bloom filter,  it is probably there is a match.




No comments:

Post a Comment