Python教程-Python中的分词器
众所周知,互联网上有大量的文本数据可用。但是,大多数人可能不熟悉处理这些文本数据的方法。此外,我们也知道,在机器学习中,处理语言的字母是一项棘手的任务,因为机器可以识别数字,但无法识别字母。
那么,如何进行文本数据的处理和清理以创建模型呢?为了回答这个问题,让我们探讨一些自然语言处理(NLP)的精彩概念。
解决NLP问题是一个分为多个阶段的过程。首先,我们必须在进入建模阶段之前清理非结构化文本数据。数据清理中包括一些关键步骤。这些步骤如下:
- 词语分词
- 预测每个令牌的词性
- 文本词形还原
- 停用词的识别和删除,等等。
在接下来的教程中,我们将学习关于分词的基本步骤。我们将了解什么是分词以及为什么分词对自然语言处理(NLP)至关重要。此外,我们还将了解一些在Python中执行分词的独特方法。
理解分词
分词被称为将大量文本划分为较小片段,称为令牌。这些片段或令牌非常有用,用于查找模式,并被认为是词干化和词形还原的基础步骤。分词还有助于将敏感数据元素替换为非敏感数据元素。
自然语言处理(NLP)用于创建诸如文本分类、情感分析、智能聊天机器人、语言翻译等应用程序。因此,了解文本模式对于实现上述目的非常重要。
但是现在,将词干化和词形还原视为使用自然语言处理(NLP)来清理文本数据的主要步骤。像文本分类或垃圾邮件过滤这样的任务使用NLP以及深度学习库如Keras和Tensorflow。
理解分词在NLP中的重要性
为了理解分词的重要性,让我们以英语为例。让我们选择任何一句话,并在理解以下部分时记住它。
在处理自然语言之前,我们必须识别构成字符字符串的单词。因此,分词似乎是进行自然语言处理(NLP)的最基本步骤。
这一步骤是必要的,因为通过分析文本中的每个单词可以解释文本的实际含义。
现在,让我们以以下字符串为例:
My name is Jamie Clark.
在对上述字符串执行分词之后,我们将得到以下输出:
['My', 'name', 'is', 'Jamie', 'Clark']
执行此操作有许多用途。我们可以利用令牌化形式来:
- 计算文本中的总单词数。
- 计算单词的频率,即特定单词出现的总次数等等。
现在,让我们了解在Python中执行自然语言处理(NLP)中的分词的一些方法。
在Python中执行分词的一些方法
有多种在文本数据上执行分词的独特方法。以下是其中一些独特方法的描述:
使用Python中的split()函数进行分词
split()函数是一种基本的字符串拆分方法之一。此函数在拆分提供的字符串后返回字符串列表。split()函数默认在每个空格处拆分字符串。但是,我们可以根据需要指定分隔符。
让我们考虑以下示例:
示例 1.1:使用split()函数进行词语分词
my_text = """Let's play a game, Would You Rather! It's simple, you have to pick one or the other. Let's get started. Would you rather try Vanilla Ice Cream or Chocolate one? Would you rather be a bird or a bat? Would you rather explore space or the ocean? Would you rather live on Mars or on the Moon? Would you rather have many good friends or one very best friend? Isn't it easy though? When we have less choices, it's easier to decide. But what if the options would be complicated? I guess, you pretty much not understand my point, neither did I, at first place and that led me to a Bad Decision."""
print(my_text.split())
输出:
['Let's', 'play', 'a', 'game,', 'Would', 'You', 'Rather!', 'It's', 'simple,', 'you', 'have', 'to', 'pick', 'one', 'or', 'the', 'other.', 'Let's', 'get', 'started.', 'Would', 'you', 'rather', 'try', 'Vanilla', 'Ice', 'Cream', 'or', 'Chocolate', 'one?', 'Would', 'you', 'rather', 'be', 'a', 'bird', 'or', 'a', 'bat?', 'Would', 'you', 'rather', 'explore', 'space', 'or', 'the', 'ocean?', 'Would', 'you', 'rather', 'live', 'on', 'Mars', 'or', 'on', 'the', 'Moon?', 'Would', 'you', 'rather', 'have', 'many', 'good', 'friends', 'or', 'one', 'very', 'best', 'friend?', 'Isn't', 'it', 'easy', 'though?', 'When', 'we', 'have', 'less', 'choices,', 'it's', 'easier', 'to', 'decide.', 'But', 'what', 'if', 'the', 'options', 'would', 'be', 'complicated?', 'I', 'guess,', 'you', 'pretty', 'much', 'not', 'understand', 'my', 'point,', 'neither', 'did', 'I,', 'at', 'first', 'place', 'and', 'that', 'led', 'me', 'to', 'a', 'Bad', 'Decision.']
解释:
在上面的示例中,我们使用split()方法将段落拆分为较小的片段或单词。同样,我们也可以通过将分隔符指定为split()函数的参数来将段落拆分为句子。正如我们所知,句子通常以句号 "." 结尾,这意味着我们可以使用"." 作为分隔符来拆分字符串。
让我们在以下示例中考虑相同的情况:
示例 1.2:使用split()函数进行句子分词
my_text = """Dreams. Desires. Reality. There is a fine line between dream to become a desire and a desire to become a reality but expectations are way far then the reality. Nevertheless, we live in a world of mirrors, where we always want to reflect the best of us. We all see a dream, a dream of no wonder what; a dream that we want to be accomplished no matter how much efforts it needed but we try."""
print(my_text.split('. '))
输出:
['Dreams', 'Desires', 'Reality', 'There is a fine line between dream to become a desire and a desire to become a reality but expectations are way far then the reality', 'Nevertheless, we live in a world of mirrors, where we always want to reflect the best of us', 'We all see a dream, a dream of no wonder what; a dream that we want to be accomplished no matter how much efforts it needed but we try.']
解释:
在上面的示例中,我们使用split()函数的参数full stop (.)来在句号处拆分段落。使用split()函数的一个主要缺点是该函数一次只接受一个参数。因此,我们只能使用一个分隔符来拆分字符串。此外,split()函数不将标点符号视为单独的片段。
在Python中使用正则表达式(RegEx)进行分词
在进入下一种方法之前,让我们简要了解正则表达式。正则表达式,也称为RegEx,是一种特殊的字符序列,允许用户使用该序列作为模式来查找或匹配其他字符串或字符串集。
为了开始使用正则表达式(RegEx),Python提供了名为re的库。re库是Python中预安装的库之一。
让我们考虑以下基于单词分词和句子分词的示例,使用Python中的RegEx方法。
示例 2.1:使用Python中的RegEx方法进行单词分词
import re
my_text = """Joseph Arthur was a young businessman. He was one of the shareholders at Ryan Cloud's Start-Up with James Foster and George Wilson. The Start-Up took its flight in the mid-90s and became one of the biggest firms in the United States of America. The business was expanded in all major sectors of livelihood, starting from Personal Care to Transportation by the end of 2000. Joseph was used to be a good friend of Ryan."""
my_tokens = re.findall
输出:
['Joseph', 'Arthur', 'was', 'a', 'young', 'businessman', 'He', 'was', 'one', 'of', 'the', 'shareholders', 'at', 'Ryan', 'Cloud', 's', 'Start', 'Up', 'with', 'James', 'Foster', 'and', 'George', 'Wilson', 'The', 'Start', 'Up', 'took', 'its', 'flight', 'in', 'the', 'mid', '90s', 'and', 'became', 'one', 'of', 'the', 'biggest', 'firms', 'in', 'the', 'United', 'States', 'of', 'America', 'The', 'business', 'was', 'expanded', 'in', 'all', 'major', 'sectors', 'of', 'livelihood', 'starting', 'from', 'Personal', 'Care', 'to', 'Transportation', 'by', 'the', 'end', 'of', '2000', 'Joseph', 'was', 'used', 'to', 'be', 'a', 'good', 'friend', 'of', 'Ryan']
解释:
在上面的示例中,我们从re库中导入了findall()方法。此方法帮助用户查找匹配参数中的模式的所有单词,并将它们存储在列表中。
此外,"w"用于表示任何单词字符,指的是字母数字字符(包括字母、数字)和下划线(_)。"+"表示任何频率。因此,我们遵循了[w']+模式,以便程序查找并找到直到遇到其他字符的所有字母数字字符。
现在,让我们来看看在Python中使用RegEx方法进行句子分词。
示例 2.2:使用Python中的RegEx方法进行句子分词
import re
my_text = """The Advertisement was telecasted nationwide, and the product was sold in around 30 states of America. The product became so successful among the people that the production was increased. Two new plant sites were finalized, and the construction was started. Now, The Cloud Enterprise became one of America's biggest firms and the mass producer in all major sectors, from transportation to personal care. Director of The Cloud Enterprise, Ryan Cloud, was now started getting interviewed over his success stories. Many popular magazines were started publishing Critiques about him."""
my_sentences = re.compile('[.!?] ').split(my_text)
print(my_sentences)
输出:
['The Advertisement was telecasted nationwide, and the product was sold in around 30 states of America', 'The product became so successful among the people that the production was increased', 'Two new plant sites were finalized, and the construction was started', "Now, The Cloud Enterprise became one of America's biggest firms and the mass producer in all major sectors, from transportation to personal care", 'Director of The Cloud Enterprise, Ryan Cloud, was now started getting interviewed over his success stories', 'Many popular magazines were started publishing Critiques about him.']
解释:
在上面的示例中,我们使用re库的compile()函数,参数为'[.?!]',并使用split()方法来分隔字符串。结果,程序在遇到这些字符之一时拆分句子。
在Python中使用Natural Language ToolKit进行分词
Natural Language ToolKit,也称为NLTK,是用Python编写的库。NLTK库通常用于符号和统计自然语言处理,并且在处理文本数据方面效果良好。
Natural Language ToolKit(NLTK)是一个第三方库,可以使用以下命令在命令shell或终端中安装:
$ pip install --user -U nltk
为了验证安装是否成功,可以在程序中导入nltk库并执行如下所示:
import nltk
如果程序没有引发错误,则该库已成功安装。否则,建议再次按照上述安装步骤进行安装,并阅读官方文档以获取更多详细信息。
Natural Language ToolKit(NLTK)具有一个名为tokenize()的模块。该模块进一步分为两个子类别:Word Tokenize和Sentence Tokenize
- Word Tokenize: 使用word_tokenize()方法将字符串分割成标记或单词。
- Sentence Tokenize: 使用sent_tokenize()方法将字符串或段落分割成句子。
让我们考虑一些基于这两种方法的示例:
示例 3.1:使用NLTK库在Python中进行单词分词
from nltk.tokenize import word_tokenize
my_text = """The Advertisement was telecasted nationwide, and the product was sold in around 30 states of America. The product became so successful among the people that the production was increased. Two new plant sites were finalized, and the construction was started. Now, The Cloud Enterprise became one of America's biggest firms and the mass producer in all major sectors, from transportation to personal care. Director of The Cloud Enterprise, Ryan Cloud, was now started getting interviewed over his success stories. Many popular magazines were started publishing Critiques about him."""
print(word_tokenize(my_text))
输出:
['The', 'Advertisement', 'was', 'telecasted', 'nationwide', ',', 'and', 'the', 'product', 'was', 'sold', 'in', 'around', '30', 'states', 'of', 'America', '.', 'The', 'product', 'became', 'so', 'successful', 'among', 'the', 'people', 'that', 'the', 'production', 'was', 'increased', '.', 'Two', 'new', 'plant', 'sites', 'were', 'finalized', ',', 'and', 'the', 'construction', 'was', 'started', '.', 'Now', ',', 'The', 'Cloud', 'Enterprise', 'became', 'one', 'of', 'America', "'s", 'biggest', 'firms', 'and', 'the', 'mass', 'producer', 'in', 'all', 'major', 'sectors', ',', 'from', 'transportation', 'to', 'personal', 'care', '.', 'Director', 'of', 'The', 'Cloud', 'Enterprise', ',', 'Ryan', 'Cloud', ',', 'was', 'now', 'started', 'getting', 'interviewed', 'over', 'his', 'success', 'stories', '.', 'Many', 'popular', 'magazines', 'were', 'started', 'publishing', 'Critiques', 'about', 'him', '.']
解释:
在上面的程序中,我们从tokenize模块中导入word_tokenize()方法。因此,结果是该方法将字符串分成不同的标记,并将它们存储在列表中。此外,此方法将句号和其他标点符号 视为单独的标记。
示例 3.1:使用NLTK库在Python中进行句子分词
from nltk.tokenize import sent_tokenize
my_text = """The Advertisement was telecasted nationwide, and the product was sold in around 30 states of America. The product became so successful among the people that the production was increased. Two new plant sites were finalized, and the construction was started. Now, The Cloud Enterprise became one of America's biggest firms and the mass producer in all major sectors, from transportation to personal care. Director of The Cloud Enterprise, Ryan Cloud, was now started getting interviewed over his success stories. Many popular magazines were started publishing Critiques about him."""
print(sent_tokenize(my_text))
输出:
['The Advertisement was telecasted nationwide, and the product was sold in around 30 states of America.', 'The product became so successful among the people that the production was increased.', 'Two new plant sites were finalized, and the construction was started.', "Now, The Cloud Enterprise became one of America's biggest firms and the mass producer in all major sectors, from transportation to personal care.", 'Director of The Cloud Enterprise, Ryan Cloud, was now started getting interviewed over his success stories.', 'Many popular magazines were started publishing Critiques about him.']
解释:
在上面的程序中,我们从tokenize模块中导入sent_tokenize()方法。因此,结果是该方法将段落分成不同的句子,并将它们存储在列表中。
结论
在上面的教程中,我们了解了分词的概念以及它在整个自然语言处理(NLP)流程中的作用。我们还讨论了从Python中的特定文本或字符串中分词的一些方法(包括单词分词和句子分词)。