Tuesday, July 5, 2011

Longest Common Subsequence

Problem: Longest common subsequence is the subsequence that appear in multiple sequences (usually two) simultaneously. For example, the longest common sequence of "13486" and "23861' is "386". The diff program is based on this algorithm. Besides, this algorithm is frequently used in bioinformatic research.


Solution: Still we will use DP, but different from longest Increasing subsequence, the solution we have is O(n*m), where n and m are the length of two sequence. If we have more than two sequence, the time complexity will be O(n1*n2*n3*...). The following gives the algorithm for two sequences:

  1. look at the last elements A[n], B[m] of two sequences, if A[n] == B[m], we know the longest common subsequence (LCS) between  A[n] and B[m],  LCS(A[1:n], B[1:m]) = LCS(A[1:n-1], B[1:m-1]) +1. 
  2. if A[n] <> B[m], then  LCS(A[1:n], B[1:m]) = MAX(LCS(A[1:n-1], B[1:m]), LCS(A[1:n], B[1:m-1])).
  3. then we will have a matrix for LCS, after filling up the matrix, we will get the answer.
We can further do some optimization.  One is to optimize the space, the LCS matrix. If most diffs center on the middle of sequence, we can trim the common part first.  Also to be fast, we don't need to compare each character, we can hash the string or the line. Then we just compare the hash value sequences.


More: The solution to LCS problem can be easily applied to Edit Distance problem. Given two string Xi and Yj, and a number of operations such as copy(), delete(), insert(), etc. Find the optimal way to transform Xi to Yj. Each operation has a cost and our goal is to minimize the total cost of the transformation. Here only gives the recursive relations:

  1. Define T(i,j) as the total cost to transform Xi to Yj (one direction). There are several recursive relations.
  2. T(i,j)  = T(i-1j-1) + cost(copy), if X[i] == Y[j]
  3. T(i,j)  = T(i-1, j-1) + cost(replace), if X[i] <> Y[j]
  4. T(i,j)  = T(i-1, j) + cost(delete)
  5. T(i,j)  = T(ij-1) + cost(insert)
  6. find T(i,j)  as the minimum T(i,j) in step 2,3,4,5

No comments:

Post a Comment