Class LongestCommonSubsequence

java.lang.Object
org.apache.commons.text.similarity.LongestCommonSubsequence
All Implemented Interfaces:
BiFunction<CharSequence,CharSequence,Integer>, ObjectSimilarityScore<CharSequence,Integer>, SimilarityScore<Integer>

public class LongestCommonSubsequence extends Object implements SimilarityScore<Integer>
A similarity algorithm indicating the length of the longest common subsequence between two strings.

The Longest common subsequence algorithm returns the length of the longest subsequence that two strings have in common. Two strings that are entirely different, return a value of 0, and two strings that return a value of the commonly shared length implies that the strings are completely the same in value and position. Note. Generally this algorithm is fairly inefficient, as for length m, n of the input CharSequence's left and right respectively, the runtime of the algorithm is O(m*n).

As of version 1.10, a more space-efficient of the algorithm is implemented. The new algorithm has linear space complexity instead of quadratic. However, time complexity is still quadratic in the size of input strings.

The implementation is based on Hirschberg's Longest Commons Substring algorithm (cited below).

For further reading see:

  • Lothaire, M. Applied combinatorics on words. New York: Cambridge U Press, 2005. 12-13
  • D. S. Hirschberg, "A linear space algorithm for computing maximal common subsequences," CACM, 1975, pp. 341--343.
Since:
1.0
  • Constructor Details

  • Method Details

    • apply

      public Integer apply(CharSequence left, CharSequence right)
      Computes the longest common subsequence similarity score of two CharSequence's passed as input.

      This method implements a more efficient version of LCS algorithm which has quadratic time and linear space complexity.

      This method is based on newly implemented algorithmB(CharSequence, CharSequence). An evaluation using JMH revealed that this method is almost two times faster than its previous version.

      Specified by:
      apply in interface BiFunction<CharSequence,CharSequence,Integer>
      Specified by:
      apply in interface ObjectSimilarityScore<CharSequence,Integer>
      Specified by:
      apply in interface SimilarityScore<Integer>
      Parameters:
      left - first character sequence
      right - second character sequence
      Returns:
      length of the longest common subsequence of left and right
      Throws:
      IllegalArgumentException - if either String input null
    • logestCommonSubsequence

      Deprecated.
      Deprecated as of 1.2 due to a typo in the method name. Use longestCommonSubsequence(CharSequence, CharSequence) instead. This method will be removed in 2.0.
      Computes the longest common subsequence between the two CharSequence's passed as input.

      Note, a substring and subsequence are not necessarily the same thing. Indeed, abcxyzqrs and xyzghfm have both the same common substring and subsequence, namely xyz. However, axbyczqrs and abcxyzqtv have the longest common subsequence xyzq because a subsequence need not have adjacent characters.

      For reference, we give the definition of a subsequence for the reader: a subsequence is a sequence that can be derived from another sequence by deleting some elements without changing the order of the remaining elements.

      Parameters:
      left - first character sequence
      right - second character sequence
      Returns:
      the longest common subsequence found
      Throws:
      IllegalArgumentException - if either String input null
    • longestCommonSubsequence

      Computes the longest common subsequence between the two CharSequence's passed as input.

      This method implements a more efficient version of LCS algorithm which although has quadratic time, it has linear space complexity.

      Note, a substring and subsequence are not necessarily the same thing. Indeed, abcxyzqrs and xyzghfm have both the same common substring and subsequence, namely xyz. However, axbyczqrs and abcxyzqtv have the longest common subsequence xyzq because a subsequence need not have adjacent characters.

      For reference, we give the definition of a subsequence for the reader: a subsequence is a sequence that can be derived from another sequence by deleting some elements without changing the order of the remaining elements.

      Parameters:
      left - first character sequence
      right - second character sequence
      Returns:
      the longest common subsequence found
      Throws:
      IllegalArgumentException - if either String input null
      Since:
      1.2
    • longestCommonSubstringLengthArray

      Deprecated.
      Deprecated as of 1.10. A more efficient implementation for calculating LCS is now available. Use longestCommonSubsequence(CharSequence, CharSequence) instead to directly calculate the LCS. This method will be removed in 2.0.
      Computes the lcsLengthArray for the sake of doing the actual lcs calculation. This is the dynamic programming portion of the algorithm, and is the reason for the runtime complexity being O(m*n), where m=left.length() and n=right.length().
      Parameters:
      left - first character sequence
      right - second character sequence
      Returns:
      lcsLengthArray