Longest Common Subsequence: LeetCode Dynamic Programming

by Jhon Lennon 57 views

Let's dive into the Longest Common Subsequence (LCS) problem, a classic challenge often encountered on LeetCode and in dynamic programming studies. This problem isn't just about finding any common sequence; it's about identifying the longest sequence that appears in the same order within two given strings. Understanding LCS is crucial because it serves as a foundation for solving more complex sequence alignment problems in bioinformatics, text comparison, and data compression.

Understanding the Problem

The Longest Common Subsequence (LCS) problem asks us to find the longest sequence of characters that appear in the same order in both of the given strings. Unlike substrings, the characters in a subsequence don't need to be consecutive. For example, if we have text1 = "abcde" and text2 = "ace", the longest common subsequence is "ace" with a length of 3. The goal is to write an algorithm that efficiently determines this length.

This problem can be approached using dynamic programming. We build a table where each cell (i, j) stores the length of the LCS of the first i characters of text1 and the first j characters of text2. By filling this table correctly, we can find the LCS length for the entire strings.

Dynamic Programming Approach

The heart of solving the Longest Common Subsequence (LCS) problem efficiently lies in dynamic programming. Dynamic programming is a method for solving complex problems by breaking them down into simpler subproblems, solving each of those subproblems just once, and storing their solutions. This avoids recomputing the same solutions over and over, leading to significant efficiency gains.

Constructing the DP Table

To solve the LCS problem, we construct a 2D array (or table), often named dp, where dp[i][j] represents the length of the LCS of the first i characters of text1 and the first j characters of text2. The dimensions of this table are (n+1) x (m+1), where n is the length of text1 and m is the length of text2. We add 1 to each dimension to accommodate the base case of empty prefixes.

Base Case

The first row and first column of the dp table are initialized to 0. This represents the LCS of an empty string with any prefix of the other string, which is always an empty sequence.

n = len(text1)
m = len(text2)
dp = [[0] * (m + 1) for _ in range(n + 1)]

Recurrence Relation

The crux of the dynamic programming solution is the recurrence relation that fills the dp table. For each cell dp[i][j], we consider two cases:

  1. If text1[i-1] and text2[j-1] are equal, it means the current characters in both strings match. Therefore, we can extend the LCS found so far by including this character. The value of dp[i][j] is then dp[i-1][j-1] + 1.

  2. If text1[i-1] and text2[j-1] are not equal, it means the current characters do not match. In this case, the LCS is the longer of the LCS that we can get by either excluding the current character from text1 or excluding the current character from text2. The value of dp[i][j] is then max(dp[i-1][j], dp[i][j-1]).

Filling the Table

We iterate through the dp table, filling each cell based on the recurrence relation. The order of iteration is important; we need to compute the values in a way that ensures the required subproblems are already solved. Typically, we iterate row by row, from left to right.

for i in range(1, n + 1):
    for j in range(1, m + 1):
        if text1[i-1] == text2[j-1]:
            dp[i][j] = dp[i-1][j-1] + 1
        else:
            dp[i][j] = max(dp[i-1][j], dp[i][j-1])

Result

After filling the entire table, the value in the bottom-right cell, dp[n][m], contains the length of the LCS of the entire strings text1 and text2.

return dp[n][m]

Python Code Implementation

Here's a complete Python implementation of the Longest Common Subsequence (LCS) algorithm using dynamic programming, as described above. This code is well-commented to help you understand each step.

def longestCommonSubsequence(text1: str, text2: str) -> int:
    """Finds the length of the longest common subsequence of two strings."""
    n = len(text1)
    m = len(text2)

    # Initialize a DP table with dimensions (n+1) x (m+1) and fill with 0s.
    dp = [[0] * (m + 1) for _ in range(n + 1)]

    # Iterate through the DP table, starting from index 1.
    for i in range(1, n + 1):
        for j in range(1, m + 1):
            # If the characters match, increment the LCS length by 1.
            if text1[i-1] == text2[j-1]:
                dp[i][j] = dp[i-1][j-1] + 1
            # If the characters don't match, take the maximum LCS length from the adjacent cells.
            else:
                dp[i][j] = max(dp[i-1][j], dp[i][j-1])

    # The bottom-right cell contains the length of the LCS.
    return dp[n][m]

# Example usage:
text1 = "abcde"
text2 = "ace"
print(f"The length of the LCS is: {longestCommonSubsequence(text1, text2)}")  # Output: 3

text1 = "abc"
text2 = "abc"
print(f"The length of the LCS is: {longestCommonSubsequence(text1, text2)}")  # Output: 3

text1 = "abc"
text2 = "def"
print(f"The length of the LCS is: {longestCommonSubsequence(text1, text2)}")  # Output: 0

Code Explanation

  1. Function Definition: The longestCommonSubsequence function takes two strings, text1 and text2, as input and returns an integer representing the length of their LCS.
  2. Initialization: We create a 2D array dp of size (n+1) x (m+1) filled with zeros. This table will store the lengths of the LCS for different prefixes of the input strings. The extra row and column are initialized to 0, representing the case where one of the strings is empty.
  3. Iteration: We iterate through the dp table starting from the second row and column (index 1). For each cell dp[i][j], we compare the characters text1[i-1] and text2[j-1].
  4. Matching Characters: If the characters match, it means we can extend the LCS found so far. We update dp[i][j] to be dp[i-1][j-1] + 1, which is the LCS length of the previous prefixes plus one for the current matching character.
  5. Non-Matching Characters: If the characters do not match, we take the maximum LCS length that we can obtain by either excluding the current character from text1 or excluding the current character from text2. We update dp[i][j] to be max(dp[i-1][j], dp[i][j-1]).
  6. Result: After filling the entire table, the value in the bottom-right cell dp[n][m] represents the length of the LCS of the entire strings text1 and text2. We return this value.

Complexity Analysis

Understanding the time and space complexity of the Longest Common Subsequence (LCS) algorithm is crucial for assessing its efficiency and suitability for different problem sizes.

Time Complexity

The algorithm uses a nested loop to fill the dp table, where the outer loop iterates n times (length of text1) and the inner loop iterates m times (length of text2). Inside the inner loop, we perform constant-time operations: comparing characters and updating the dp table. Therefore, the time complexity is O(n*m), where n and m are the lengths of the input strings.

Space Complexity

The algorithm uses a 2D array dp of size (n+1) x (m+1) to store the lengths of the LCS for different prefixes of the input strings. Therefore, the space complexity is O(n*m). In practice, this space complexity can be a concern for very large input strings. However, there are some techniques to optimize the space complexity, such as using only two rows of the dp table at a time, which reduces the space complexity to O(min(n, m)). However, this optimization makes the code harder to read and maintain.

Applications of Longest Common Subsequence

The Longest Common Subsequence (LCS) problem has a wide range of applications in computer science and bioinformatics. Its ability to find similarities between sequences makes it a valuable tool in various domains.

Bioinformatics

In bioinformatics, LCS is used to compare DNA sequences and identify similarities between them. By finding the longest common subsequence of two DNA sequences, researchers can infer evolutionary relationships and identify conserved regions that may have important biological functions.

Text Comparison

LCS is also used in text comparison tools to find the similarities between two documents. For example, it can be used to identify plagiarism by finding the longest common subsequence between a student's paper and a collection of source documents. Similarly, it can be used to track changes in a document over time by finding the longest common subsequence between different versions of the document.

Data Compression

LCS can be used in data compression algorithms to identify redundant data. By finding the longest common subsequence between two files, we can store only the differences between the files, which can significantly reduce the storage space required. This technique is used in version control systems like Git to store the history of changes to a file efficiently.

Conclusion

The Longest Common Subsequence (LCS) problem is a fundamental problem in computer science with applications in bioinformatics, text comparison, and data compression. The dynamic programming approach provides an efficient way to solve this problem with a time complexity of O(n*m) and a space complexity of O(n*m). Understanding the LCS problem and its applications is a valuable skill for any computer scientist or software engineer. By mastering the dynamic programming technique used to solve this problem, you can tackle more complex sequence alignment and optimization problems in various domains. So, keep practicing, and happy coding, guys!