Tuesday, June 28, 2011

KMP algorithm

KMP algorithm stands for Knuth–Morris–Pratt algorithm. The problem it wants to solve is to search for a word W in a text string S efficiently. The followings only give the main idea of this algorithm:

  1. The essence of the algorithm is to try to avoid comparing a character more than once. Assume we have two cursors m and i. m points to the start matching position for W within S. n points to the current location within W where we fail (we only had a partial match). To search for W in the rest of S, we don't need to start the next match check beginning at m+1, we can choose the position in a smart way.
  2. To be smart, we need the help from an auxiliary table T . T has the same length as W. T[i] means if the match fails at the position i, how much we need to look back (reset the start cursor). Thus, m will be reset as m+i-T[i]. Moreover, we don't need to start matching process from the new start position, instead, we can just start from the new i that equals toT[i].
  3. This table is mainly for those patterns which contain some non-prefix substring that are also some prefix of the patterns. For example,  in "ABCDABD", substring starting at position 4, "AB", is also the prefix "AB". On the contrary, for "ABC", there are no need to fall back.
  4. The time complexity is O(n+k), where n and k are the length of S and W, respectively.
The following gives the table building algorithm:
    table[0] = 0;
        table[1] = 0;
    
        int cnd = 0;
    
        for(int n =2; n<p.size();)
        {
           if(p[n-1] == p[cnd])
           {
              cnd++;
              table[n] = cnd;
              n++;
           }
           else if(cnd>0)
           {
             // This is the complicated part, if we can't find 
             // a match,  we need to gradually fall back, 
             // to look at shorter string to see if we can 
             // find a match.
             cnd = table[cnd];
           }
           else
           {
              cnd = 0;
              table[n] = cnd;
              n++;
           }
    
        }
    
    

    No comments:

    Post a Comment