Skip to main content

Longest Common Subsequence

The Longest Common Subsequence (LCS) problem is a classic computer science problem used to find the longest subsequence common to all sequences in a set of sequences (often just two sequences). A subsequence is a sequence derived from another sequence where some elements might be deleted without changing the order of the remaining elements.

Formal Definition

Given two sequences, X = x_1, x_2, ..., x_m and Y = y_1, y_2, ..., y_n, find a maximum-length common subsequence of X and Y.

Properties

  • The LCS problem is a typical example of a problem that can be solved using dynamic programming.
  • LCS is useful in data comparison applications, such as diff tools, bioinformatics for DNA sequencing, and version control systems.

Dynamic Programming Approach

The solution involves creating a 2D array dp where dp[i][j] represents the length of the LCS of the subarrays X[0:i] and Y[0:j].

Recurrence Relation

  • If x_i == y_j: dp[i][j]=dp[i1][j1]+1dp[i][j] = dp[i-1][j-1] + 1
  • Otherwise: dp[i][j]=max(dp[i1][j],dp[i][j1])dp[i][j] = \max (dp[i-1][j], dp[i][j-1])

Time Complexity

The time complexity of this approach is O(m×n)O(m \times n), where mm and nn are the lengths of the two sequences. This is because every cell in the m×nm \times n table needs to be filled exactly once.

Space Complexity

The space complexity is also O(m×n)O(m \times n) due to the 2D array used for storing intermediate results. However, it can be optimized to O(min(m,n))O(\min(m, n)) by using only the last and current row of the table.

Implementation

def LCS(X, Y):
m, n = len(X), len(Y)
dp = [[0] * (n + 1) for _ in range(m + 1)]

for i in range(1, m + 1):
for j in range(1, n + 1):
if X[i-1] == Y[j-1]:
dp[i][j] = dp[i-1][j-1] + 1
else:
dp[i][j] = max(dp[i-1][j], dp[i][j-1])

return dp[m][n]

Applications

  • Version control systems (like Git): Used to reconcile different versions of files.
  • Bioinformatics: Used to compare DNA and protein sequences.
  • Data comparison: To find similarities between different sets of data.