Python教程-Python中的Wikipedia模块
在本文中,我们将讨论Python中的Wikipedia模块,并讨论如何使用Python脚本利用Wikipedia模块。我们将从Wikipedia获取各种信息。
简介
互联网是信息的最重要来源。只要有互联网连接,所有知识都距离我们仅一步之遥。因此,了解如何从正确的来源收集正确的信息是必要的。当我们从各种来源检索信息时,这个过程称为数据爬取。我们都曾使用过Wikipedia。它是充满信息的宝库。
Wikipedia是互联网上最大的平台,包含大量信息。它是一个由志愿编辑社区使用基于维基的编辑系统管理的开源平台。它是一本多语言百科全书。
Python提供了Wikipedia模块(或API),用于从Wikipedia页面获取数据。这个模块允许我们获取并解析Wikipedia中的信息。简而言之,我们可以说它像一个小型的网络爬虫,只能获取有限量的数据。在开始使用它之前,我们需要在本地安装这个模块。
安装
这个模块包装了官方的Wikipedia API。在第一步中,我们将使用以下pip命令安装Wikipedia模块。在终端中键入以下命令:
$pip install wikipedia
上述命令将在系统中安装该模块。现在,我们需要使用以下命令导入它。
import wikipedia
现在我们已经准备好从Wikipedia提取数据了。
开始使用Wikipedia模块
Wikipedia模块包含各种内置方法,可以帮助我们获取所需的信息。
搜索标题和结果
Python的Wikipedia模块允许我们使用search()方法作为参数来搜索查询。这个方法返回包含搜索查询的所有文章的列表。让我们理解以下示例。
示例 -
import wikipedia
# Seaching a title
print(wikipedia.search("India"))
输出:
['India', 'Constitution of India', 'Demographics of India', 'Languages of India', 'Republic Day (India)', 'Government of India', 'Economy of India', 'History of India', 'The Times of India', 'List of prime ministers of India']
如上面的输出中所示,该方法返回了标题和相关搜索。我们可以通过为结果参数传递一个值来限制搜索标题的数量。考虑以下示例。
示例 -
import wikipedia
# Seaching a title
print(wikipedia.search("India", results = 4))
输出:
['India', 'Constitution of India', 'Demographics of India', 'Languages of India']
上面的代码打印了四个结果,因为我们请求仅获取四个结果。
建议
顾名思义,建议方法返回查询的建议Wikipedia标题,如果没有找到则返回None。让我们看看以下示例。
示例 -
import wikipedia
print(wikipedia.suggest("Coronavrdsf"))
输出:
None
在上面的代码中,我们搜索了“冠状病毒”但拼写错误。suggest()方法返回None,因为它没有找到搜索的查询。
文章摘要
Python的Wikipedia模块提供了summary()方法,该方法返回文章的摘要或主题。此方法接受两个参数 - 标题和句子,并以字符串格式返回摘要。让我们考虑下面的示例。
示例 -
import wikipedia
print(wikipedia.summary("Rohit Sharma", sentences=4))
输出:
Rohit Gurunath Sharma (born 30 April 1987) is an Indian international cricketer who plays for Mumbai in domestic cricket and captains Mumbai Indians in the Indian Premier League as a right-handed batsman and an occasional right-arm off break bowler. He is the vice-captain of the Indian national team in limited-overs formats.
Outside cricket, Sharma is an active supporter of animal welfare campaigns. He is the official Rhino Ambassador for WWF-India and is a member of People for the Ethical Treatment of Animals (PETA).
给定标题的摘要被打印出来,我们使用句子参数自定义了摘要文本中要显示的句子数量。
请记住,如果页面不存在,summary()方法将引发“消歧义错误”。让我们理解以下示例。
示例 -
print(wikipedia.summary("key"))
输出:
Traceback (most recent call last):
File "C:/Users/DEVANSH SHARMA/PycharmProjects/MyPythonProject/pillow_image.py", line 194, in
print(wikipedia.summary("key"))
File "C:\Users\DEVANSH SHARMA\PycharmProjects\MyPythonProject\venv\lib\site-packages\wikipedia\util.py", line 28, in __call__
ret = self._cache[key] = self.fn(*args, **kwargs)
File "C:\Users\DEVANSH SHARMA\PycharmProjects\MyPythonProject\venv\lib\site-packages\wikipedia\wikipedia.py", line 231, in summary
page_info = page(title, auto_suggest=auto_suggest, redirect=redirect)
File "C:\Users\DEVANSH SHARMA\PycharmProjects\MyPythonProject\venv\lib\site-packages\wikipedia\wikipedia.py", line 276, in page
return WikipediaPage(title, redirect=redirect, preload=preload)
File "C:\Users\DEVANSH SHARMA\PycharmProjects\MyPythonProject\venv\lib\site-packages\wikipedia\wikipedia.py", line 299, in __init__
self.__load(redirect=redirect, preload=preload)raise DisambiguationError(getattr(self, 'title', page['title']), may_refer_to)
wikipedia.exceptions.DisambiguationError: "Key" may refer to:
Key (cryptography)
Key (lock)
Key (map)
typewriter
test
Cay
Key, Alabama
Key, Ohio
Key, West Virginia
Keys, Oklahoma
Florida Keys
提取标题的元数据
我们可以获取完整的Wikipedia页面元数据或文本内容,不包括图像、表格等。这个模块提供了页面对象的内容属性。让我们看看以下示例。
示例 -
import wikipedia
print(wikipedia.page("Sachin Tendulkar").content)
输出:
Sachin Ramesh Tendulkar ( (listen); born 24 April 1973) is an Indian former international cricketer who served as captain of the Indian national team. He is widely regarded as one of the greatest batsmen in the history of cricket. He is the highest run scorer of all time in International cricket. Considered as the world's most prolific batsman of all time, he is the only player to have scored one hundred international centuries, the first batsman to score a double century in a One Day International (ODI), the holder of the record for the most runs in both Test and ODI cricket, and the only player to complete more than 30,000 runs in international cricket. In 2013, he was the only Indian cricketer included in an all-time Test World XI named to mark the 150th anniversary of Wisden Cricketers' Almanac.
............
获取完整的Wikipedia页面数据
Python的Wikipedia模块允许我们使用page()函数获取完整的Wikipedia页面。它返回页面内容、类别、坐标、图像、链接和其他元数据。让我们理解以下示例。
示例 -
import wikipedia
# wikipedia page object is created
object = wikipedia.page("America")
# printing html of page_object
print(object.html)
# printing title
print(object.original_title)
# printing links on that page object
print(object.links[0:20])
输出:
>
United States
['.as', '.com', '.edu', '.gov', '.gu', '.mil', '.mp', '.net', '.org', '.pr', '.um', '.us', '.vi', '100th meridian west', '117th United States Congress', '1790 United States Census', '1800 United States Census', '1810 United States Census', '1820 United States Census', '1830 United States Census']
自定义页面语言
我们可以更改现有页面的默认语言。使用set_lang()方法来更改页面语言。每种语言都有一个标准的前缀代码,该代码作为方法的参数传递。让我们理解以下示例。
示例 -
import wikipedia
wikipedia.set_lang("hi")
print(wikipedia.summary("Python"))
输出:
????? ?? ??????? ??????? ?? ??? ???????, ???? ?????? ???????????? ???? (General Purpose and High Level Programming language), ???????????, ???????? ?????????, ???????????? ???? ??? ?? ???? ?? ?? ??? ?? ?????? ???? ??? ?? ???? ????? ???? ?? ??? ????? ?? ???? ?? ???? ?? ?????
???? ???????????? ?????? ?? ??????, ?????? ???-??????? ?? ??????? ?? ??? ????? ?????? ( {} ) ?? ???????? ???? ???? ??, ????? ??? ???-??????? ?? ??????? ?? ??? ?????? ????? (white space) ?? ?????? ???? ???? ??? ?? ???????????? ???? ?? Guido van Rossum ?? 1991 ??? ????? ??? ?? ??????? ?? ???????????? ???? ?? ?????? ????????? ????? ?? ??? ??? ?? ??????, ???? ?????-??????? ???? ?? ????? ???? ??? ????? "????? ???? ?? ??? ???? ?????? ????????? ?????" ?? ???? ???? ??? ?? ???? ???? ????????? (standard library) ???? ?? ?????? ???
?? ???? ?? ??????-????? ??? ???-??????? (code readability) ?? ??? ???? ??? ??? ????? ?? ???? ?? ?? ???? ????????? ???? ?????? ??; ???? ???? ????????? ????? ?? ?????????? (comprehensive) ??? ?? ?????? ???????? ?? ??? ????? ?????? ????? ??? (pre-installed) ??? ???
???? ?????? ?????? ?? ???, ????? ????? ?? ???????????? ???? ?? ??? ??? ?????? ???? ???? ??, ????? ??? ??? ??? ???????????? ???????? ?? ?? ??????? ?????? ??? ?? ?????? ???? ???? ??? ??? ??????? ?? ????? ????, ????? ??? ?????????? ???????? ????? ????????? (???????????? ?????????) ?? ??? ??? ??? ???? ?? ???? ??? ????? ??????????? ?? ???????? ?????? ?? ??? ?????? ????
正如上面的代码中所示,它将请求的页面转换为印地语。我们可以使用set_lang()方法更改任何语言。
结论
我们已经涵盖了使用Python代码访问Wikipedia API的所有重要概念。我们还讨论了如何获取各种信息,如页面标题、摘要、类别以及从网络中提取数据。