1. Google 网络安全证书 - 快速开启网络安全职业生涯。
2. Google 数据分析专业证书 - 提升你的数据分析能力
3. Google IT 支持专业证书 - 支持你的组织在 IT 领域
这篇文章将演示一个抽取式摘要过程,使用简单的词频方法,在 Python 中实现。在开始之前,请注意我们不会在这篇文章中花费太多精力进行数据预处理、分词、标准化等操作(类似于上次),也不会介绍任何能够轻松有效执行这些任务的库。我希望重点介绍文本摘要的步骤,略过其他重要的概念。我计划在这篇文章后续进行更多的跟进,并在过程中逐步增加我们自然语言处理任务的复杂性。
from collections import Counter
from string import punctuation
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS as stop_words
接下来,我们需要一些文本来测试我们的摘要技术。我手动从 CNN 复制并粘贴了这篇文章,但你可以随意寻找自己的:
# https://www.cnn.com/2019/11/26/politics/judiciary-committee-hearing/index.html
text = """
The House Judiciary Committee has invited President Donald Trump or his counsel to participate in the panel's first impeachment hearing next week as the House moves another step closer to impeaching the President.
The committee announced that it would hold a hearing December 4 on the "constitutional grounds for presidential impeachment," with a panel of expert witnesses testifying.
House Judiciary Chairman Jerry Nadler sent a letter to Trump on Tuesday notifying him of the hearing and inviting the President or his counsel to participate, including asking questions of the witnesses.
"I write to ask if you or your counsel plan to attend the hearing or make a request to question the witness panel," the New York Democrat wrote.
In the letter, Nadler said the hearing would "serve as an opportunity to discuss the historical and constitutional basis of impeachment, as well as the Framers' intent and understanding of terms like 'high crimes and misdemeanors.' "
"We expect to discuss the constitutional framework through which the House may analyze the evidence gathered in the present inquiry," Nadler added. "We will also discuss whether your alleged actions warrant the House's exercising its authority to adopt articles of impeachment."
The Judiciary Committee hearing is the latest sign that House Democrats are moving forward with impeachment proceedings against the President following the two-month investigation led by the House Intelligence Committee into allegations that Trump pushed Ukraine to investigate his political rivals while a White House meeting and $400 million in security aid were withheld from Kiev.
The hearing announcement comes as the Intelligence Committee plans to release its report summarizing the findings of its investigation to the House Judiciary Committee soon after Congress returns from its Thanksgiving recess next week.
Democratic aides declined to say what additional hearings they will schedule as part of the impeachment proceedings.
The Judiciary Committee is expected to hold multiple hearings related to impeachment, and the panel would debate and approve articles of impeachment before a vote on the House floor.
The aides said the first hearing was a "legal hearing" that would include some history of impeachment, as well as evaluating the seriousness of the allegations and the evidence against the President.
Nadler asked Trump to respond by Sunday on whether the White House wanted to participate in the hearings, as well as who would act as the President's counsel for the proceedings. The letter was copied to White House Counsel Pat Cipollone.
def tokenizer(s):
tokens = []
for word in s.split(' '):
return tokens
def sent_tokenizer(s):
sents = []
for sent in s.split('.'):
return sents
请注意,我们在这里使用“重要性”作为文档中相对词频的同义词;我们将每个单词的出现次数除以文档中出现频率最高的单词的出现次数。这种高频等于真正的重要性吗?假设它等于重要性是幼稚的,但这也是引入文本摘要概念的最简单方法。对我们这里的“重要性”假设感兴趣吗?可以尝试像 TF-IDF 或词嵌入这样的东西。
tokens = tokenizer(text)
sents = sent_tokenizer(text)
['the', 'house', 'judiciary', 'committee', 'has', 'invited', 'president', 'donald', 'trump', 'or', 'his', 'counsel', 'to', 'participate', 'in', 'the', "panel's", 'first', 'impeachment', 'hearing', 'next', 'week', 'as', 'the',
'house', 'moves', 'another', 'step', 'closer', 'to', 'impeaching', 'the', 'president.', 'the', 'committee', 'announced', 'that', 'it', 'would', 'hold', 'a', 'hearing', 'december', '4', 'on', 'the', '"constitutional', 'grounds', 'for',
'the', 'white', 'house', 'wanted', 'to', 'participate', 'in', 'the', 'hearings,', 'as', 'well', 'as', 'who', 'would', 'act', 'as', 'the', "president's", 'counsel', 'for', 'the', 'proceedings.', 'the', 'letter', 'was', 'copied', 'to',
'white', 'house', 'counsel', 'pat', 'cipollone.']
["The House Judiciary Committee has invited President Donald Trump or his counsel to participate in the panel's first impeachment hearing next week as the House moves another step closer to impeaching the President", 'The committee
announced that it would hold a hearing December 4 on the "constitutional grounds for presidential impeachment," with a panel of expert witnesses testifying', 'House Judiciary Chairman Jerry Nadler sent a letter to Trump on Tuesday
seriousness of the allegations and the evidence against the President', "Nadler asked Trump to respond by Sunday on whether the White House wanted to participate in the hearings, as well as who would act as the President's counsel for the
proceedings", 'The letter was copied to White House Counsel Pat Cipollone', '']
def count_words(tokens):
word_counts = {}
for token in tokens:
if token not in stop_words and token not in punctuation:
if token not in word_counts.keys():
word_counts[token] = 1
word_counts[token] += 1
return word_counts
word_counts = count_words(tokens)
{'house': 10,
'judiciary': 5,
'committee': 7,
'invited': 1,
'president': 3,
"president's": 1,
'proceedings.': 1,
'copied': 1,
'pat': 1,
'cipollone.': 1}
def word_freq_distribution(word_counts):
freq_dist = {}
max_freq = max(word_counts.values())
for word in word_counts.keys():
freq_dist[word] = (word_counts[word]/max_freq)
return freq_dist
freq_dist = word_freq_distribution(word_counts)
{'house': 1.0,
'judiciary': 0.5,
'committee': 0.7,
'invited': 0.1,
'president': 0.3,
"president's": 0.1,
'proceedings.': 0.1,
'copied': 0.1,
'pat': 0.1,
'cipollone.': 0.1}
接下来我们要使用我们生成的频率分布来对句子进行评分。这只是简单地将每个单词在句子中的得分相加,并保留这个得分。我们的函数接受一个 max_len
def score_sentences(sents, freq_dist, max_len=40):
sent_scores = {}
for sent in sents:
words = sent.split(' ')
for word in words:
if word.lower() in freq_dist.keys():
if len(words) < max_len:
if sent not in sent_scores.keys():
sent_scores[sent] = freq_dist[word.lower()]
sent_scores[sent] += freq_dist[word.lower()]
return sent_scores
sent_scores = score_sentences(sents, freq_dist)
{"The House Judiciary Committee has invited President Donald Trump or his counsel to participate in the panel's first impeachment hearing next week as the House moves another step closer to impeaching the President": 6.899999999999999,
'The committee announced that it would hold a hearing December 4 on the "constitutional grounds for presidential impeachment," with a panel of expert witnesses testifying': 2.8000000000000007,
'House Judiciary Chairman Jerry Nadler sent a letter to Trump on Tuesday notifying him of the hearing and inviting the President or his counsel to participate, including asking questions of the witnesses': 5.099999999999999,
'"I write to ask if you or your counsel plan to attend the hearing or make a request to question the witness panel," the New York Democrat wrote': 2.5000000000000004,
'In the letter, Nadler said the hearing would "serve as an opportunity to discuss the historical and constitutional basis of impeachment, as well as the Framers\' intent and understanding of terms like \'high crimes and misdemeanors': 3.300000000000001,
'\' "\n"We expect to discuss the constitutional framework through which the House may analyze the evidence gathered in the present inquiry," Nadler added': 2.7,
'"We will also discuss whether your alleged actions warrant the House\'s exercising its authority to adopt articles of impeachment': 1.6999999999999997,
'The hearing announcement comes as the Intelligence Committee plans to release its report summarizing the findings of its investigation to the House Judiciary Committee soon after Congress returns from its Thanksgiving recess next week': 5.399999999999999,
'Democratic aides declined to say what additional hearings they will schedule as part of the impeachment proceedings': 1.3,
'The Judiciary Committee is expected to hold multiple hearings related to impeachment, and the panel would debate and approve articles of impeachment before a vote on the House floor': 4.300000000000001,
'The aides said the first hearing was a "legal hearing" that would include some history of impeachment, as well as evaluating the seriousness of the allegations and the evidence against the President': 2.8000000000000007,
"Nadler asked Trump to respond by Sunday on whether the White House wanted to participate in the hearings, as well as who would act as the President's counsel for the proceedings": 3.5000000000000004,
'The letter was copied to White House Counsel Pat Cipollone': 2.2}
现在我们已经对句子的相对重要性进行了评分,剩下的就是选择(即“提取性汇总”中的“提取”)前 k 个句子来代表文章的总结。这个函数将使用我们上面生成的句子得分以及一个值来确定用于汇总的得分最高的 k 个句子。它将返回一个由前句子连接成的字符串总结,以及用于汇总的句子得分。
def summarize(sent_scores, k):
top_sents = Counter(sent_scores)
summary = ''
scores = []
top = top_sents.most_common(k)
for t in top:
summary += t[0].strip()+'. '
scores.append((t[1], t[0]))
return summary[:-1], scores
summary, summary_sent_scores = summarize(sent_scores, 3)
The House Judiciary Committee has invited President Donald Trump or his
counsel to participate in the panel's first impeachment hearing next week as
the House moves another step closer to impeaching the President. The hearing
announcement comes as the Intelligence Committee plans to release its report
summarizing the findings of its investigation to the House Judiciary Committee
soon after Congress returns from its Thanksgiving recess next week. House
Judiciary Chairman Jerry Nadler sent a letter to Trump on Tuesday notifying
him of the hearing and inviting the President or his counsel to participate,
including asking questions of the witnesses.
for score in summary_sent_scores: print(score[0], '->', score[1], '\n')
6.899999999999999 -> The House Judiciary Committee has invited President
Donald Trump or his counsel to participate in the panel's first impeachment
hearing next week as the House moves another step closer to impeaching the President
5.399999999999999 -> The hearing announcement comes as the Intelligence Committee
plans to release its report summarizing the findings of its investigation to
the House Judiciary Committee soon after Congress returns from its Thanksgiving
recess next week
5.099999999999999 -> House Judiciary Chairman Jerry Nadler sent a letter to
Trump on Tuesday notifying him of the hearing and inviting the President or
his counsel to participate, including asking questions of the witnesses
对我们基线方法的改进,使用 TF-IDF 权重而不是简单的词频
Matthew Mayo (@mattmayo13) 是一名数据科学家及 KDnuggets 的主编,KDnuggets 是开创性的在线数据科学和机器学习资源。他的兴趣领域包括自然语言处理、算法设计与优化、无监督学习、神经网络以及机器学习的自动化方法。Matthew 拥有计算机科学硕士学位和数据挖掘研究生文凭。他可以通过 editor1 at kdnuggets[dot]com 联系到。