Carleton University
Technical Report TR-14
January 1983
A Common Basis for Similarity Measures Involving Two Strings
R.L. Kashyap & B.J. Oommen
Abstract
Many numerical indices which quantify the similarity and dissimilarity between a pair of strings, X and Y, have been defined in the literature. Some of these include the length of their longest Common subsequence (LLCS(X,Y)), the length of their Shortest common supersequence (LSCSCX,Y)), and their Generalized Levenshtein Distance (GLD(X,V)). Some non-numerical indices relating the strings are the set of their common subsequences, the set of their common supersequences and the set of their shuffles. In this paper, we consider an abstract measure between X and V, written as D(X,V), defined in terms of two abstract operators i and 8 and a binary function d(•,•) whose arguments are symbols of an alphabet A. Depending on the various concrete operators used fort and@ and the specific function used for d(·,·), all the quantities discussed above can be seen to be particular cases of D(X,Y). We have presented an algorithm to recursively compute D(X,V), which can serve to be a common scheme to compute all these quantities. Many new results are obtained using this abstract formulation, such as an explicit linear relationship between the LLCS and the LSCS between two strings.
