Python教程-Python中的Difflib模块

在以下教程中，我们将了解Python编程语言中的Difflib模块。我们将讨论此模块的功能，以及基于其类的一些示例。

让我们开始吧。

理解Python Difflib模块

Difflib是Python编程语言中的一个内置模块，包含不同的简单函数和类，允许用户比较数据集。该模块以一种人类可读的格式提供这些序列比较的输出，使用增量来更有效地显示差异。

Difflib模块通常用于比较字符串的序列。但只要它们是可散列的，我们还可以使用它来比较其他数据类型。我们知道，如果对象的哈希值在其生命周期内不会更改，那么它就是可散列的。

Python Difflib模块中最常用的类是Differ和Sequence Matcher类。还有一些其他辅助类和函数，可以用于更特定的操作。让我们在以下部分了解其中一些函数。

理解Sequence Matcher类

让我们首先来看一个相当不言自明的Difflib模块方法：Sequence Matcher。Sequence Matcher方法将比较两个提供的字符串，并返回表示两个字符串之间相似性的数据。让我们使用ratio()对象来尝试此方法。此对象将以小数格式返回比较数据。以下是示例：

示例：

# importing the difflib library and SequenceMatcher class  
import difflib  
from difflib import SequenceMatcher  
  
# defining the strings  
str_1 = "Welcome to javatiku"  
str_2 = "Welcome to Python tutorial"  
  
# using the SequenceMatcher() function  
my_seq = SequenceMatcher(a = str_1, b = str_2)  
  
# printing the result  
print("First String:", str_1)  
print("Second String:", str_2)  
print("Sequence Matched:", my_seq.ratio())

输出：

First String: Welcome to javatiku
Second String: Welcome to Python tutorial
Sequence Matched: 0.5106382978723404

解释：

在上面的代码片段中，我们首先导入了difflib模块以及SequenceMatcher类。然后，我们定义了两个要比较的字符串值。之后，我们创建了一个新变量，该变量封装了SequenceMatcher类，其中包含两个参数a和b。实际上，该方法接受三个参数：None, a和b。

为了使该方法识别这两个字符串，我们必须将字符串的每个值分配给方法的变量，例如SquenceMatcher(a=str_1, b=str_2)。

一旦定义了所有必需的变量，并且SequenceMatcher至少提供了两个参数，我们现在可以使用我们之前提到的ratio()对象来打印值。此对象将确定两个字符串中相同字符的比例，并以小数形式返回输出。就像这样，我们比较了两个简单的字符串，并获得了它们相似性的输出。

注意：ratio()对象是与Sequence Matcher类相关联的一些对象之一。可以查看Python的官方文档，了解更多这些对象，以执行不同的序列操作。

理解Differ类

Differ类被认为是SequenceMatcher的反义词；它接受文本行并查找字符串之间的差异。但Differ类在使用增量时特殊之处，使其更有效且对人类更可读，以便发现差异。

例如，在比较两个字符串之间插入新字符时，在接收额外字符的行之前会出现' + '。

正如我们可能已经猜到的，从第一个字符串中删除一些字符将导致在第二个文本行之前出现' - '。

如果两个序列中的某一行是相同的，将返回' '，如果一行缺失，则将出现' ? '。此外，我们还可以使用ratio()等属性，如前面的示例中所讨论的。

让我们考虑以下示例，以了解Differ类的工作方式。

示例：

# importing the difflib module and Differ class  
import difflib  
from difflib import Differ  
  
# defining the strings  
str_1 = "They would like to order a soft drink"  
str_2 = "They would like to order a corn pizza"  
  
# using the splitlines() function  
lines_str1 = str_1.splitlines()  
lines_str2 = str_2.splitlines()  
  
# using the Differ() and compare() function  
dif = difflib.Differ()  
my_diff = dif.compare(lines_str1, lines_str2)  
  
# printing the results  
print("First String:", str_1)  
print("Second String:", str_2)  
print("Difference between the Strings")  
print('\n'.join(my_diff))

输出：

First String: They would like to order a soft drink
Second String: They would like to order a corn pizza
Difference between the Strings
- They would like to order a soft drink
?                            ^ ^^ ^^ ^^

+ They would like to order a corn pizza
?

解释：

在上面的代码片段中，我们首先导入了difflib模块以及Differ类。然后，我们定义了两个要比较的字符串。然后，我们在这两个字符串上调用splitlines()函数。

语法：

lines_str1 = str_1.splitlines()  
lines_str2 = str_2.splitlines()

此函数允许我们将字符串逐行比较，而不是逐字符比较。

一旦定义了包含Differ类的变量，我们创建了另一个变量，该变量包含Differ类，并且使用compare()对象，该对象接受两个字符串作为参数。

语法：

my_diff = dif.compare(lines_str1, lines_str2)

我们调用print()函数，并使用换行符将my_diff变量连接起来，以便输出以更易阅读的方式进行格式化。

理解get_close_matches方法

difflib模块提供了另一个简单但强大的工具，即get_close_matches方法。该方法正是其听起来的样子：一个工具，将接受参数并返回与目标字符串最接近的匹配项。在伪代码中，该函数按照以下方式运行：

语法：

get_close_matches(target_word, list_of_possibilities, n = res_limit, cutoff)

正如上面的语法所示，get_close_matches()方法接受四个参数，但只需要前两个参数即可返回输出。

第一个参数是要定位的单词；我们希望方法返回相似性。第二个参数可以是变量或指向字符串数组的变量的数组。第三个参数允许用户定义要返回的输出数量的限制。最后一个参数确定两个单词之间的相似度需要多大才能作为输出返回。

仅使用前两个参数，函数将根据默认切割值0.6（在0 - 1范围内）和默认结果限制3返回输出。让我们考虑以下示例，以了解此函数的工作方式。

示例：

# importing the difflib module and get_close_matches method  
import difflib  
from difflib import get_close_matches  
  
# using the get_close_matches method  
my_list = get_close_matches('mas', ['master', 'mask', 'duck', 'cow', 'mass', 'massive', 'python', 'butter'])  
  
# printing the list  
print("Matching words:", my_list)

输出：

Matching words: ['mass', 'mask', 'master']

解释：

在上面的代码片段中，我们导入了difflib模块和get_close_matches方法。然后，我们在一个具有一些相似字符的项目的列表上使用get_close_matches()方法。执行程序后，函数将仅返回三个包含相似字母的单词，即使第四个与单词'mas'相似：'massive'。现在，让我们尝试在以下示例中定义result_limit和cutoff：

示例：

# importing the difflib module and get_close_matches method  
import difflib  
from difflib import get_close_matches  
  
# using the get_close_matches method  
my_list = get_close_matches(  
    'mas',  
    ['master', 'mask', 'duck', 'cow',  
    'mass', 'massive', 'python', 'butter'],  
    n = 4,  
    cutoff = 0.6  
    )  
  
# printing the list  
print("Matching words:", my_list)

输出：

Matching words: ['mass', 'mask', 'master', 'massive']

解释：

在上面的示例中，我们生成了至少与单词'mas'相似的四个结果。cutoff等于默认值，因为我们仅定义了与默认值0.6相同的值。但是，我们可以更改此参数以使结果更严格或不太严格。接近1的值越接近，约束就越严格。

理解unified_diff和context_diff类

在difflib中有两个类，它们的工作方式相同：unified_diff和context_diff。它们之间的唯一主要区别是结果。

unified_diff类接受两个数据字符串，然后返回从第一个字符串中插入或删除的每个单词。

让我们考虑以下示例，以更好地理解这个类的工作方式。

示例：

# importing the required modules  
import sys  
import difflib  
from difflib import unified_diff  
  
# defining the string variables  
str_1 = ['Mark\n', 'Henry\n', 'Richard\n', 'Stella\n', 'Robin\n', 'Employees\n']  
str_2 = ['Arthur\n', 'Joseph\n', 'Stacey\n', 'Harry\n', 'Emma\n', 'Employees\n']  
  
# using the unified_diff() function  
sys.stdout.writelines(unified_diff(str_1, str_2))

输出：

--- 
+++ 
@@ -1,6 +1,6 @@
-Mark
-Henry
-Richard
-Stella
-Robin
+Arthur
+Joseph
+Stacey
+Harry
+Emma
 Employees

解释：

在上面的代码片段中，我们已经导入了所需的模块并定义了两个变量，用于存储一些单词。然后，我们使用了unified_diff()函数来从第一个变量中删除单词，并将第二个变量中的单词添加到第一个变量中。结果，我们可以观察到unified_diff返回被删除的单词前缀为-，返回添加的单词前缀为+。最后一个单词"Employees"在两个字符串中都没有前缀。

context_diff类的工作方式与unified_diff类似。但是，它不是显示原始字符串中插入和删除了什么，而是通过返回带有!前缀的已更改行来表示哪些行发生了更改。

让我们考虑以下示例来了解这个类的工作方式。

示例:

# importing the required modules  
import sys  
import difflib  
from difflib import context_diff  
  
# defining the string variables  
str_1 = ['Mark\n', 'Henry\n', 'Richard\n', 'Stella\n', 'Robin\n', 'Employees\n']  
str_2 = ['Arthur\n', 'Joseph\n', 'Stacey\n', 'Harry\n', 'Emma\n', 'Employees\n']  
  
# using the context_diff() function  
sys.stdout.writelines(context_diff(str_1, str_2))

输出:

*** 
--- 
***************
*** 1,6 ****
! Mark
! Henry
! Richard
! Stella
! Robin
  Employees
--- 1,6 ----
! Arthur
! Joseph
! Stacey
! Harry
! Emma
  Employees

解释:

在上面的示例中，我们使用context_diff来从第一个字符串中删除和添加单词。结果可以看到，已更改的单词都用'!'前缀描述。

Python教程-Python中的Difflib模块

理解Python Difflib模块

理解Sequence Matcher类

注意：ratio()对象是与Sequence Matcher类相关联的一些对象之一。可以查看Python的官方文档，了解更多这些对象，以执行不同的序列操作。

理解Differ类

理解get_close_matches方法

理解unified_diff和context_diff类

推荐文章

其它