KMP Algorithm for Longest Prefix Suffix problem

 The Longest Prefix Suffix (LPS) problem is central to the Knuth-Morris-Pratt (KMP) algorithm for pattern matching. Here's a detailed explanation:

Problem: Longest Prefix Suffix (LPS)

For a given string s, the LPS array (or prefix table) at position i stores the length of the longest prefix of s[0:i] that is also a suffix of s[0:i], excluding the entire string itself.

Example

For the string s = "abab", the LPS array is [0, 0, 1, 2]:

  1. LPS[0] = 0 because "a" has no prefix that is also a suffix.
  2. LPS[1] = 0 because "ab" has no prefix that matches a suffix.
  3. LPS[2] = 1 because "a" is the longest prefix of "aba" that is also a suffix.
  4. LPS[3] = 2 because "ab" is the longest prefix of "abab" that is also a suffix.

Importance in the KMP Algorithm

The KMP algorithm uses the LPS array to avoid redundant comparisons when a mismatch occurs during pattern matching. Specifically:

  1. If a mismatch occurs after j matches, instead of restarting the comparison from the beginning of the pattern, the algorithm uses the LPS array to shift the pattern intelligently.
  2. The LPS value at position j-1 determines the next position to check.

The Longest Prefix Suffix (LPS) array is a fundamental component of the Knuth-Morris-Pratt (KMP) algorithm, which is used for efficient pattern matching in strings. The LPS array captures the lengths of the longest proper prefixes of a pattern that are also suffixes, enabling the KMP algorithm to skip unnecessary comparisons during the search process.

Definition

For a given pattern P of length n, the LPS array is an array lps where lps[i] represents the length of the longest proper prefix of the substring P[0:i] that is also a suffix of this substring. A proper prefix is a prefix that is not equal to the entire string.

Construction of the LPS Array

The LPS array can be constructed in linear time using the following algorithm:

  1. Initialization:

    • Create an array lps of length n and initialize all elements to 0.
    • Set lps[0] to 0 because a single character has no proper prefix.
    • Initialize two pointers: len (length of the current longest prefix suffix) to 0 and i (current position in the pattern) to 1.
  2. Iteration:

    • While i < n, perform the following checks:
      • Match: If P[i] == P[len]:
        • Increment len by 1.
        • Set lps[i] to len.
        • Increment i by 1.
      • Mismatch: If P[i] != P[len]:
        • If len is not 0, update len to lps[len - 1]. This step utilizes the previously computed LPS values to avoid unnecessary comparisons.
        • If len is 0, set lps[i] to 0 and increment i by 1.

Example

Consider the pattern P = "ABABAC":

  1. Initialization:

    • lps = [0, 0, 0, 0, 0, 0]
    • len = 0, i = 1
  2. Iteration:

    • At i = 1: P[1] = 'B' and P[0] = 'A' (mismatch)
      • len is 0, so set lps[1] = 0 and increment i to 2.
    • At i = 2: P[2] = 'A' and P[0] = 'A' (match)
      • Increment len to 1, set lps[2] = 1, and increment i to 3.
    • At i = 3: P[3] = 'B' and P[1] = 'B' (match)
      • Increment len to 2, set lps[3] = 2, and increment i to 4.
    • At i = 4: P[4] = 'A' and P[2] = 'A' (match)
      • Increment len to 3, set lps[4] = 3, and increment i to 5.
    • At i = 5: P[5] = 'C' and P[3] = 'B' (mismatch)
      • len is not 0, so update len to lps[2] = 1.
      • Now, P[5] = 'C' and P[1] = 'B' (mismatch)
        • Update len to lps[0] = 0.
      • Set lps[5] = 0 and increment i to 6.

The final LPS array is [0, 0, 1, 2, 3, 0].


MOST IMPORTANT POINT:-

In the construction of the Longest Prefix Suffix (LPS) array for the Knuth-Morris-Pratt (KMP) algorithm, the step where len is updated to lps[len - 1] during a mismatch is crucial for efficiently computing the LPS values.

Context

While building the LPS array, two pointers are maintained:

  • i: Iterates through the pattern.
  • len: Tracks the length of the current longest proper prefix which is also a suffix.

When a mismatch occurs between pattern[i] and pattern[len], the algorithm needs to determine the next potential value for len without redundant comparisons.

The Step: len = lps[len - 1]

If len is not zero and a mismatch occurs, the algorithm updates len to lps[len - 1]. This adjustment leverages previously computed LPS values to find the next longest proper prefix that could potentially match the suffix ending at pattern[i - 1].

Why This Works:

  1. Avoiding Redundant Comparisons: By setting len = lps[len - 1], the algorithm skips over prefixes that have already been matched and found to be proper prefixes and suffixes. This prevents re-evaluating these segments, enhancing efficiency.

  2. Utilizing Precomputed Information: The LPS array stores lengths of the longest proper prefixes that are also suffixes for all sub-patterns. By referencing lps[len - 1], the algorithm quickly identifies the next potential prefix length that could align with the current suffix, based on prior computations.

  3. Ensuring Correctness: This step ensures that all possible proper prefixes are considered in decreasing order of length, maintaining the correctness of the LPS array construction.

Illustrative Example:

Consider the pattern P = "ABABC":

  • At i = 4 (P[4] = 'C'), suppose len = 2.
  • The current prefix of length 2 is "AB", and the suffix of length 2 ending at P[3] is also "AB".
  • Since P[4] ('C') does not match P[2] ('A'), a mismatch occurs.
  • Instead of resetting len to 0, the algorithm updates len to lps[len - 1], which is lps[1] = 0.
  • This adjustment allows the algorithm to consider the next possible proper prefix length without redundant comparisons.

By updating len to lps[len - 1], the algorithm efficiently narrows down the search for the longest proper prefix that is also a suffix, ensuring that all potential matches are considered systematically.

For a more in-depth understanding, you might find the following video helpful:

Comments