Python教程-FuzzyWuzzy Python Library

在本教程中，我们将学习如何使用 Python 内置的 fuzzyWuzzy 库匹配字符串，并使用各种示例确定它们的相似性。

介绍

Python 提供了一些比较两个字符串的方法。以下是一些主要方法。

使用正则表达式
简单比较
使用 dfflib

但还有另一种方法可以有效地用于比较，称为 fuzzywuzzy。 这种方法在区分两个引用相同事物的字符串时非常有效，但它们略有不同。有时我们需要一个可以自动识别拼写错误的程序。

这是一种查找与给定模式匹配的字符串的过程。它使用 Levenshtein 距离 来计算序列之间的差异。

该库可以帮助映射缺乏公共键的数据库，例如通过公司名称连接两个表，这些表在两个表中以不同的方式出现。

示例

让我们看看以下示例。

Str1 = "Welcome to javatiku"  
Str2 = "Welcome to javatiku"  
Result = Str1 == Str2  
print(Result)

输出:

True

上面的代码返回 true，因为字符串完全匹配（100％），如果我们更改 str2 会发生什么。

Str1 = "Welcome to javatiku"  
Str2 = "welcome to javatiku"  
Result = Str1 == Str2  
print(Result)

输出:

False

在这里，上面的代码返回 false，对于人眼来说，字符串看起来非常相似，但对于解释器来说不是。然而，我们可以通过将两个字符串转换为小写来解决此问题。

Str1 = "Welcome to javatiku"  
Str2 = "welcome to javatiku"  
Result = Str1.lower() == Str2.lower()  
print(Result)

输出:

True

但是，如果我们更改字符集，将会遇到另一个问题。

Str1 = "Welcome to javatiku."  
Str2 = "Welcome to javatiku"  
Result = Str1.lower() == Str2.lower()  
print(Result)

输出:

True

要解决此类问题，我们需要更有效的工具来比较字符串。而 fuzzywuzzy 是计算字符串的最佳工具。

Levenshtein 距离

Levenshtein 距离 用于计算两个单词序列之间的距离。它计算了我们需要在给定字符串中进行的最小编辑次数。这些编辑可以是插入、删除或替换。

示例 -

import numpy as np  
  
def levenshtein_distance (s1, t1, ratio_calculation = False):  
  
    # Initialize matrix of zeros  
    rows = len(s1)+1  
    cols = len(t1)+1  
    calc_distance = np.zeros((rows,cols),dtype = int)  
  
    # Populate matrix of zeros with the indeces of each character of both strings  
    for i in range(1, rows):  
        for k in range(1,cols):  
            calc_distance[i][0] = i  
            calc_distance[0][k] = k  
  
    for col in range(1, cols):  
        for row in range(1, rows):  
            if s1[row-1] == t1[col-1]:  
                cost = 0  
                if ratio_calculation == True:  
                    cost = 2  
                else:  
                    cost = 1  
            calc_distance[row][col] = min(calc_distance[row-1][col] + 1,      # Cost of deletions  
                                 calc_distance[row][col-1] + 1,          # Cost of insertions  
                                 calc_distance[row-1][col-1] + cost)     # Cost of substitutions  
    if ratio_calculation == True:  
        # Computation of the Levenshtein calc_distance Ratio  
        Ratio = ((len(s)+len(t)) - calc_distance[row][col]) / (len(s)+len(t))  
        return Ratio  
    else:  
        return "The strings are {} edits away".format(calc_distance[row][col])

我们将在早期的示例中使用上述函数，其中我们尝试比较 "欢迎来到 javatiku." 和 "欢迎来到 javatiku"。我们可以看到两个字符串非常相似，因为 Levensthtein 的长度很小。

Str1 = "Welcome to javatiku"  
Str2 = "welcome to javatiku"  
Distance = levenshtein_distance(Str1,Str2)  
print(Distance)  
Ratio = levenshtein_distance(Str1,Str2,ratio_calc = True)  
print(Ratio)

FuzzyWuzzy 包

这个库的名称有点奇怪和有趣，但它是有益的。它有一种独特的方法来比较两个字符串，并使用不同的方法进行比较后返回100分。要使用此库，我们需要在 Python 环境中安装它。

安装

我们可以使用 pip 命令来安装此库。

pip install fuzzywuzzy

Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0

现在输入以下命令并按回车键。

pip install python-Levenshtein

让我们了解 fuzzuwuzzy 库的以下方法

Fuzz 模块

fuzz 模块用于一次比较两个给定的字符串。它使用不同的方法进行比较后返回100分。

Fuzz.ratio()

这是 fuzz 模块的重要方法之一。它比较字符串，并基于给定的字符串匹配了多少来评分。让我们了解以下示例。

示例 -

from fuzzywuzzy import fuzz  
Str1 = "Welcome to javatiku"  
Str2 = "welcome to javatiku"  
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())  
print(Ratio)

输出:

正如我们在上面的代码中所看到的，fuzz.ratio() 方法返回了分数，这意味着字符串之间几乎没有差异。

Fuzz.partial_ratio()

fuzzywuzzy 库提供了另一个强大的方法 - partial_ratio()。 它用于处理复杂的字符串比较，如子字符串匹配。让我们看看以下示例。

示例 -

#importing the module from the fuzzywuzzy library  
from fuzzywuzzy import fuzz  
  
str1 = "Welcome to javatiku"  
str2 = "tpoint"  
Ratio = fuzz.ratio(str1.lower(),str2.lower())  
Ratio_partial = fuzz.partial_ratio(str1.lower(),str2.lower())  
print(Ratio)  
print(Ratio_partial)

输出:

44
100

解释:

partial_ratio() 方法可以检测子字符串。因此，它产生100％的相似性。它遵循最佳部分逻辑，在这个逻辑中，较短的字符串 k 和较长的字符串 m，算法找到最佳匹配的长度 k 子字符串。

Fuzz.token_sort_ratio

该方法不保证获得准确的结果，因为如果我们更改字符串的顺序，它可能不会提供准确的结果。

但是 fuzzywuzzy 模块提供了解决方案。让我们了解以下示例。

示例 -

str1 = "united states v. nixon"  
str2 = "Nixon v. United States"  
Ratio = fuzz.ratio(str1.lower(),str2.lower())  
Ratio_Partial = fuzz.partial_ratio(str1.lower(),str2.lower())  
Ratio_Token = fuzz.token_sort_ratio(str1,str2)  
print(Ratio)  
print(Ratio_Partial)  
print(Ratio_Token)

输出:

59
74
100

解释:

在上面的代码中，我们使用了 token_sort_ratio() 方法，它相对于 partial_ratio 提供了优势。在这个方法中，字符串按字母顺序排序并连接在一起。但是还有另一种情况，即如果字符串在长度上差异很大怎么办。

让我们了解以下示例。

示例 -

str1 = "The supreme court case of Democratic vs Congress"  
str2 = "Congress v. Democratic"  
Ratio = fuzz.ratio(str1.lower(),str2.lower())  
Partial_Ratio = fuzz.partial_ratio(str1.lower(),str2.lower())  
Token_Sort_Ratio = fuzz.token_sort_ratio(str1,str2)  
Token_Set_Ratio = fuzz.token_set_ratio(str1,str2)  
print(Ratio)  
print(Partial_Ratio)  
print(Token_Sort_Ratio)  
print(Token_Set_Ratio)

输出:

在上面的代码中，我们使用了另一种方法，称为 fuzz.token_set_ratio()，它执行了一个集合操作，取出了公共令牌，然后进行了 ratio() 逐对比较。

排序令牌的交集总是相同的，因为子字符串或较小的字符串由原始字符串的较大块组成，或者剩余令牌彼此之间更接近。

fuzzywuzzy 包提供了 process 模块，允许我们计算相似性最高的字符串。让我们了解以下示例。

示例 -

from fuzzywuzzy import process  
strToMatch = "Hello Good Morning"  
givenOpt = ["hello","Hello Good","Morning","Good Evenining"]  
ratios = process.extract(strToMatch,givenOpt)  
print(ratios)  
# We can choose the string that has highest matching percentage  
high = process.extractOne(strToMatch,givenOpt)  
print(high)

输出:

[('hello', 90), ('Hello Good', 90), ('Morning', 90), ('Good Evenining', 59)]
('hello', 90)

上面的代码将返回给定字符串列表的最高匹配百分比。

Fuzz.WRatio

process 模块还提供了 WRatio，它比简单的 ratio 提供了更好的结果。它处理大小写和其他一些参数。让我们了解以下示例。

示例 -

from fuzzywuzzy import process  
fuzz.WRatio('good morning', 'Good Morning')  
fuzz.WRatio('good morning!!!','good Morning')

输出:

结论

在本教程中，我们讨论了如何匹配字符串以及如何确定它们的相似性。我们提供了简单的示例，但这已足以清楚地说明计算机如何处理不匹配的字符串。许多现实生活中的应用程序，如拼写检查、生物信息学中的 DNA 序列匹配等，都基于模糊逻辑。

Python教程-FuzzyWuzzy Python Library

介绍

示例

Levenshtein 距离

FuzzyWuzzy 包

安装

Fuzz 模块

Fuzz.ratio()

Fuzz.partial_ratio()

Fuzz.token_sort_ratio

Fuzz.WRatio

结论

推荐文章

其它