NLP之文本分类：「Tf-Idf、Word2Vec和BERT」三种模型比较

本文作者： AI研习社-译站

2020-09-30 10:04

导语：BERT的表现要比之前的模型稍好，它能识别的科技新闻要比其他模型多一些。

字幕组双语原文：NLP之文本分类：「Tf-Idf、Word2Vec和BERT」三种模型比较

英语原文：Text Classification with NLP: Tf-Idf vs Word2Vec vs BERT

翻译：雷锋字幕组（关山、wiige）

概要

在本文中，我将使用NLP和Python来解释3种不同的文本多分类策略：老式的词袋法（tf-ldf），著名的词嵌入法（Word2Vec）和最先进的语言模型（BERT）。

NLP之文本分类：「Tf-Idf、Word2Vec和BERT」三种模型比较

NLP（自然语言处理）是人工智能的一个领域，它研究计算机和人类语言之间的交互作用，特别是如何通过计算机编程来处理和分析大量的自然语言数据。NLP常用于文本数据的分类。文本分类是指根据文本数据内容对其进行分类的问题。

我们有多种技术从原始文本数据中提取信息，并用它来训练分类模型。本教程比较了传统的词袋法（与简单的机器学习算法一起使用）、流行的词嵌入模型（与深度学习神经网络一起使用）和最先进的语言模型（和基于attention的transformers模型中的迁移学习一起使用），语言模型彻底改变了NLP的格局。

我将介绍一些有用的Python代码，这些代码可以轻松地应用在其他类似的案例中（仅需复制、粘贴、运行），并对代码逐行添加注释，以便你能复现这个例子（下面是全部代码的链接）。

mdipietro09/DataScience_ArtificialIntelligence_Utils

我将使用“新闻类别数据集”（News category dataset），这个数据集提供了从HuffPost获取的2012-2018年间所有的新闻标题，我们的任务是把这些新闻标题正确分类，这是一个多类别分类问题（数据集链接如下）。

News Category Dataset

特别地，我要讲的是：

设置：导入包，读取数据，预处理，分区。
词袋法：用scikit-learn进行特征工程、特征选择以及机器学习，测试和评估，用lime解释。
词嵌入法：用gensim拟合Word2Vec，用tensorflow/keras进行特征工程和深度学习，测试和评估，用Attention机制解释。
语言模型：用transformers进行特征工程，用transformers和tensorflow/keras进行预训练BERT的迁移学习，测试和评估。

设置

首先，我们需要导入下面的库：

## for data
import json
import pandas as pd
import numpy as np## for plotting
import matplotlib.pyplot as plt
import seaborn as sns## for bag-of-words
from sklearn import feature_extraction, model_selection, naive_bayes, pipeline, manifold, preprocessing## for explainer
from lime import lime_text## for word embedding
import gensim
import gensim.downloader as gensim_api## for deep learning
from tensorflow.keras import models, layers, preprocessing as kprocessing
from tensorflow.keras import backend as K## for bert language model
import transformers

该数据集包含在一个jason文件中，所以我们首先将其读取到一个带有json的字典列表中，然后将其转换为pandas的DataFrame。

lst_dics = []
with open('data.json', mode='r', errors='ignore') as json_file:
for dic in json_file:
lst_dics.append( json.loads(dic) )## print the first one
lst_dics[0]

NLP之文本分类：「Tf-Idf、Word2Vec和BERT」三种模型比较

原始数据集包含30多个类别，但出于本教程中的目的，我将使用其中的3个类别：娱乐（Entertainment）、政治（Politics）和科技（Tech）。

## create dtf
dtf = pd.DataFrame(lst_dics)## filter categories
dtf = dtf[ dtf["category"].isin(['ENTERTAINMENT','POLITICS','TECH']) ][["category","headline"]]## rename columns
dtf = dtf.rename(columns={"category":"y", "headline":"text"})## print 5 random rows
dtf.sample(5)

NLP之文本分类：「Tf-Idf、Word2Vec和BERT」三种模型比较

从图中可以看出，数据集是不均衡的：和其他类别相比，科技新闻的占比很小，这会使模型很难识别科技新闻。

在解释和构建模型之前，我将给出一个预处理示例，包括清理文本、删除停用词以及应用词形还原。我们要写一个函数，并将其用于整个数据集上。

'''
Preprocess a string.
:parameter
:param text: string - name of column containing text
:param lst_stopwords: list - list of stopwords to remove
:param flg_stemm: bool - whether stemming is to be applied
:param flg_lemm: bool - whether lemmitisation is to be applied
:return
cleaned text
'''
def utils_preprocess_text(text, flg_stemm=False, flg_lemm=True, lst_stopwords=None):
## clean (convert to lowercase and remove punctuations and
characters and then strip)
text = re.sub(r'[^\w\s]', '', str(text).lower().strip())

## Tokenize (convert from string to list)
lst_text = text.split() ## remove Stopwords
if lst_stopwords is not None:
lst_text = [word for word in lst_text if word not in
lst_stopwords]

## Stemming (remove -ing, -ly, ...)
if flg_stemm == True:
ps = nltk.stem.porter.PorterStemmer()
lst_text = [ps.stem(word) for word in lst_text]

## Lemmatisation (convert the word into root word)
if flg_lemm == True:
lem = nltk.stem.wordnet.WordNetLemmatizer()
lst_text = [lem.lemmatize(word) for word in lst_text]

## back to string from list
text = " ".join(lst_text)
return text

该函数从语料库中删除了一组单词（如果有的话）。我们可以用nltk创建一个英语词汇的通用停用词列表（我们可以通过添加和删除单词来编辑此列表）。

lst_stopwords = nltk.corpus.stopwords.words("english")
lst_stopwords

NLP之文本分类：「Tf-Idf、Word2Vec和BERT」三种模型比较

现在，我将在整个数据集中应用编写的函数，并将结果存储在名为“text_clean”的新列中，以便你选择使用原始的语料库，或经过预处理的文本。

dtf["text_clean"] = dtf["text"].apply(lambda x:
utils_preprocess_text(x, flg_stemm=False, flg_lemm=True,
lst_stopwords=lst_stopwords))dtf.head()

NLP之文本分类：「Tf-Idf、Word2Vec和BERT」三种模型比较

如果你对更深入的文本分析和预处理感兴趣，你可以查看这篇文章。我将数据集划分为训练集（70%）和测试集（30%），以评估模型的性能。

## split dataset
dtf_train, dtf_test = model_selection.train_test_split(dtf, test_size=0.3)## get target
y_train = dtf_train["y"].values
y_test = dtf_test["y"].values

让我们开始吧！

词袋法

词袋法的模型很简单：从文档语料库构建一个词汇表，并计算单词在每个文档中出现的次数。换句话说，词汇表中的每个单词都成为一个特征，文档由具有相同词汇量长度的矢量（一个“词袋”）表示。例如，我们有3个句子，并用这种方法表示它们：

NLP之文本分类：「Tf-Idf、Word2Vec和BERT」三种模型比较特征矩阵的形状：文档数x词汇表长度

可以想象，这种方法将会导致很严重的维度问题：文件越多，词汇表越大，因此特征矩阵将是一个巨大的稀疏矩阵。所以，为了减少维度问题，词袋法模型通常需要先进行重要的预处理（词清除、删除停用词、词干提取/词形还原）。

词频不一定是文本的最佳表示方法。实际上我们会发现，有些常用词在语料库中出现频率很高，但是它们对目标变量的预测能力却很小。为了解决此问题，有一种词袋法的高级变体，它使用词频-逆向文件频率（Tf-Idf）代替简单的计数。基本上，一个单词的值和它的计数成正比地增加，但是和它在语料库中出现的频率成反比。

先从特征工程开始，我们通过这个流程从数据中提取信息来建立特征。使用Tf-Idf向量器(vectorizer)，限制为1万个单词（所以词长度将是1万），捕捉一元文法（即 "new "和 "york"）和二元文法（即 "new york"）。以下是经典的计数向量器的代码:

ngram_range=(1,2))vectorizer = feature_extraction.text.TfidfVectorizer(max_features=10000, ngram_range=(1,2))

现在将在训练集的预处理语料上使用向量器来提取词表并创建特征矩阵。

corpus = dtf_train["text_clean"]vectorizer.fit(corpus)X_train = vectorizer.transform(corpus)dic_vocabulary = vectorizer.vocabulary_

特征矩阵X_train的尺寸为34265（训练集中的文档数）×10000（词长度），这个矩阵很稀疏:

sns.heatmap(X_train.todense()[:,np.random.randint(0,X.shape[1],100)]==0, vmin=0, vmax=1, cbar=False).set_title('Sparse Matrix Sample')

NLP之文本分类：「Tf-Idf、Word2Vec和BERT」三种模型比较

从特征矩阵中随机抽样（黑色为非零值）

为了知道某个单词的位置，可以这样在词表中查询:

word = "new york"dic_vocabulary[word]

如果词表中存在这个词，这行脚本会输出一个数字N，表示矩阵的第N个特征就是这个词。

为了降低矩阵的维度所以需要去掉一些列，我们可以进行一些特征选择（Feature Selection），这个流程就是选择相关变量的子集。操作如下:

将每个类别视为一个二进制位（例如，"科技"类别中的科技新闻将分类为1，否则为0）;
进行卡方检验，以便确定某个特征和其（二进制）结果是否独立;
只保留卡方检验中有特定p值的特征。

y = dtf_train["y"]
X_names = vectorizer.get_feature_names()
p_value_limit = 0.95dtf_features = pd.DataFrame()
for cat in np.unique(y):
    chi2, p = feature_selection.chi2(X_train, y==cat)
    dtf_features = dtf_features.append(pd.DataFrame(
                   {"feature":X_names, "score":1-p, "y":cat}))
    dtf_features = dtf_features.sort_values(["y","score"],
                    ascending=[True,False])
    dtf_features = dtf_features[dtf_features["score"]>p_value_limit]X_names = dtf_features["feature"].unique().tolist()

这将特征的数量从10000个减少到3152个，保留了最有统计意义的特征。选一些打印出来是这样的:

for cat in np.unique(y):
   print("# {}:".format(cat))
   print("  . selected features:",
         len(dtf_features[dtf_features["y"]==cat]))
   print("  . top features:", ",".join(
dtf_features[dtf_features["y"]==cat]["feature"].values[:10]))
   print(" ")

NLP之文本分类：「Tf-Idf、Word2Vec和BERT」三种模型比较

我们将这组新的词表作为输入，在语料上重新拟合向量器。这将输出一个更小的特征矩阵和更短的词表。

vectorizer = feature_extraction.text.TfidfVectorizer(vocabulary=X_names)vectorizer.fit(corpus)X_train = vectorizer.transform(corpus)dic_vocabulary = vectorizer.vocabulary_

新的特征矩阵X_train的尺寸是34265（训练中的文档数量）×3152（给定的词表长度）。你看矩阵是不是没那么稀疏了:

NLP之文本分类：「Tf-Idf、Word2Vec和BERT」三种模型比较

从新的特征矩阵中随机抽样（非零值为黑色）

现在我们该训练一个机器学习模型试试了。我推荐使用朴素贝叶斯算法：它是一种利用贝叶斯定理的概率分类器，贝叶斯定理根据可能相关条件的先验知识进行概率预测。这种算法最适合这种大型数据集了，因为它会独立考察每个特征，计算每个类别的概率，然后预测概率最高的类别。

classifier = naive_bayes.MultinomialNB()

我们在特征矩阵上训练这个分类器，然后在经过特征提取后的测试集上测试它。因此我们需要一个scikit-learn流水线：这个流水线包含一系列变换和最后接一个estimator。将Tf-Idf向量器和朴素贝叶斯分类器放入流水线，就能轻松完成对测试数据的变换和预测。

## pipelinemodel = pipeline.Pipeline([("vectorizer", vectorizer),
("classifier", classifier)])## train classifiermodel["classifier"].fit(X_train, y_train)## testX_test = dtf_test["text_clean"].values
predicted = model.predict(X_test)
predicted_prob = model.predict_proba(X_test)

至此我们可以使用以下指标评估词袋模型了:

准确率: 模型预测正确的比例。
混淆矩阵: 是一张记录每类别预测正确和预测错误数量的汇总表。
ROC: 不同阈值下，真正例率与假正例率的对比图。曲线下的面积(AUC)表示分类器中随机选择的正观察值排序比负观察值更靠前的概率。
精确率: "所有被正确检索的样本数(TP)"占所有"实际被检索到的(TP+FP)"的比例。
召回率: 所有"被正确检索的样本数(TP)"占所有"应该检索到的结果(TP+FN)"的比例。

classes = np.unique(y_test)
y_test_array = pd.get_dummies(y_test, drop_first=False).values
    ## Accuracy, Precision, Recallaccuracy = metrics.accuracy_score(y_test, predicted)
auc = metrics.roc_auc_score(y_test, predicted_prob,
                            multi_)
print("Accuracy:",  round(accuracy,2))
print("Auc:", round(auc,2))
print("Detail:")
print(metrics.classification_report(y_test, predicted))
    ## Plot confusion matrixcm = metrics.confusion_matrix(y_test, predicted)
fig, ax = plt.subplots()
sns.heatmap(cm, annot=True, fmt='d', ax=ax, cmap=plt.cm.Blues,
            cbar=False)
ax.set(xlabel="Pred", ylabel="True", xticklabels=classes,
       yticklabels=classes, title="Confusion matrix")
plt.yticks(rotation=0)
fig, ax = plt.subplots(nrows=1, ncols=2)## Plot rocfor i in range(len(classes)):
    fpr, tpr, thresholds = metrics.roc_curve(y_test_array[:,i],
                           predicted_prob[:,i])
    ax[0].plot(fpr, tpr, lw=3,
              label='{0} (area={1:0.2f})'.format(classes[i],
                              metrics.auc(fpr, tpr))
               )
ax[0].plot([0,1], [0,1], color='navy', lw=3, line)
ax[0].set(xlim=[-0.05,1.0], ylim=[0.0,1.05],
          xlabel='False Positive Rate',
          ylabel="True Positive Rate (Recall)",
          title="Receiver operating characteristic")
ax[0].legend(loc="lower right")
ax[0].grid(True)
    ## Plot precision-recall curvefor i in range(len(classes)):
    precision, recall, thresholds = metrics.precision_recall_curve(
                 y_test_array[:,i], predicted_prob[:,i])
    ax[1].plot(recall, precision, lw=3,
               label='{0} (area={1:0.2f})'.format(classes[i],
                                  metrics.auc(recall, precision))
              )
ax[1].set(xlim=[0.0,1.05], ylim=[0.0,1.05], xlabel='Recall',
          ylabel="Precision", title="Precision-Recall curve")
ax[1].legend(loc="best")
ax[1].grid(True)
plt.show()

NLP之文本分类：「Tf-Idf、Word2Vec和BERT」三种模型比较

词袋模型能够在测试集上正确分类85%的样本（准确率为0.85），但在辨别科技新闻方面却很吃力（只有252条预测正确）。

让我们探究一下为什么模型会将新闻分类为其他类别，顺便看看预测结果是不是能解释些什么。lime包可以帮助我们建立一个解释器。为让这更好理解，我们从测试集中随机采样一次, 看看能发现些什么:

## select observationi = 0
txt_instance = dtf_test["text"].iloc[i]## check true value and predicted valueprint("True:", y_test[i], "--> Pred:", predicted[i], "| Prob:", round(np.max(predicted_prob[i]),2))## show explanationexplainer = lime_text.LimeTextExplainer(class_names=
np.unique(y_train))
explained = explainer.explain_instance(txt_instance,
model.predict_proba, num_features=3)
explained.show_in_notebook(text=txt_instance, predict_proba=False)

NLP之文本分类：「Tf-Idf、Word2Vec和BERT」三种模型比较

这就一目了然了：虽然"舞台(stage)"这个词在娱乐新闻中更常见, "克林顿(Clinton) "和 "GOP "这两个词依然为模型提供了引导（政治新闻）。

词嵌入

词嵌入（Word Embedding）是将中词表中的词映射为实数向量的特征学习技术的统称。这些向量是根据每个词出现在另一个词之前或之后的概率分布计算出来的。换一种说法，上下文相同的单词通常会一起出现在语料库中，所以它们在向量空间中也会很接近。例如，我们以前面例子中的3个句子为例:

NLP之文本分类：「Tf-Idf、Word2Vec和BERT」三种模型比较

二维向量空间中的词嵌入

在本教程中，我门将使用这类模型的开山怪: Google的Word2Vec（2013）。其他流行的词嵌入模型还有斯坦福大学的GloVe（2014）和Facebook的FastText（2016）。

Word2Vec生成一个包含语料库中的每个独特单词的向量空间，通常有几百维, 这样在语料库中拥有共同上下文的单词在向量空间中的位置就会相互靠近。有两种不同的方法可以生成词嵌入：从某一个词来预测其上下文（Skip-gram）或根据上下文预测某一个词（Continuous Bag-of-Words）。

在Python中，可以像这样从genism-data中加载一个预训练好的词嵌入模型:

nlp = gensim_api.load("word2vec-google-news-300")

我将不使用预先训练好的模型，而是用gensim在训练数据上自己训练一个Word2Vec。在训练模型之前，需要将语料转换为n元文法列表。具体来说，就是尝试捕获一元文法（"york"）、二元文法（"new york"）和三元文法（"new york city"）。

corpus = dtf_train["text_clean"]## create list of lists of unigramslst_corpus = []
for string in corpus:
   lst_words = string.split()
   lst_grams = [" ".join(lst_words[i:i+1])
               for i in range(0, len(lst_words), 1)]
   lst_corpus.append(lst_grams)## detect bigrams and trigramsbigrams_detector = gensim.models.phrases.Phrases(lst_corpus,
                 delimiter=" ".encode(), min_count=5, threshold=10)
bigrams_detector = gensim.models.phrases.Phraser(bigrams_detector)trigrams_detector = gensim.models.phrases.Phrases(bigrams_detector[lst_corpus],
            delimiter=" ".encode(), min_count=5, threshold=10)
trigrams_detector = gensim.models.phrases.Phraser(trigrams_detector)

在训练Word2Vec时，需要设置一些参数:

词向量维度设置为300;
窗口大小，即句子中当前词和预测词之间的最大距离，这里使用语料库中文本的平均长度;
训练算法使用 skip-grams (sg=1)，因为一般来说它的效果更好。

## fit w2vnlp = gensim.models.word2vec.Word2Vec(lst_corpus, size=300,
window=8, min_count=1, sg=1, iter=30)

现在我们有了词嵌入模型，所以现在可以从语料库中任意选择一个词，将其转化为一个300维的向量。

word = "data"nlp[word].shape

NLP之文本分类：「Tf-Idf、Word2Vec和BERT」三种模型比较

甚至可以通过某些维度缩减算法（比如TSNE），将一个单词及其上下文可视化到一个更低的维度空间（2D或3D）。

word = "data"
fig = plt.figure()## word embedding
tot_words = [word] + [tupla[0] for tupla in
                 nlp.most_similar(word, topn=20)]
X = nlp[tot_words]## pca to reduce dimensionality from 300 to 3
pca = manifold.TSNE(perplexity=40, n_components=3, init='pca')
X = pca.fit_transform(X)## create dtf
dtf_ = pd.DataFrame(X, index=tot_words, columns=["x","y","z"])
dtf_["input"] = 0
dtf_["input"].iloc[0:1] = 1## plot 3d
from mpl_toolkits.mplot3d import Axes3D
ax = fig.add_subplot(111, projection='3d')
ax.scatter(dtf_[dtf_["input"]==0]['x'],
           dtf_[dtf_["input"]==0]['y'],
           dtf_[dtf_["input"]==0]['z'], c="black")
ax.scatter(dtf_[dtf_["input"]==1]['x'],
           dtf_[dtf_["input"]==1]['y'],
           dtf_[dtf_["input"]==1]['z'], c="red")
ax.set(xlabel=None, ylabel=None, zlabel=None, xticklabels=[],
       yticklabels=[], zticklabels=[])
for label, row in dtf_[["x","y","z"]].iterrows():
    x, y, z = row
    ax.text(x, y, z, s=label)

NLP之文本分类：「Tf-Idf、Word2Vec和BERT」三种模型比较

这非常酷，但词嵌入在预测新闻类别这样的任务上有何裨益呢？词向量可以作为神经网络的权重。具体是这样的:

首先，将语料转化为单词id的填充(padded)序列，得到一个特征矩阵。
然后，创建一个嵌入矩阵，使id为N的词向量位于第N行。
最后，建立一个带有嵌入层的神经网络，对序列中的每一个词都用相应的向量进行加权。

还是从特征工程开始，用 tensorflow/keras 将 Word2Vec 的同款预处理语料（n-grams 列表）转化为文本序列的列表:

## tokenize texttokenizer = kprocessing.text.Tokenizer(lower=True, split=' ',
                     oov_token="NaN",
                     filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n')
tokenizer.fit_on_texts(lst_corpus)
dic_vocabulary = tokenizer.word_index## create sequencelst_text2seq= tokenizer.texts_to_sequences(lst_corpus)## padding sequenceX_train = kprocessing.sequence.pad_sequences(lst_text2seq,
                    maxlen=15, padding="post", truncating="post")

特征矩阵X_train的尺寸为34265×15（序列数×序列最大长度）。可视化一下是这样的:

sns.heatmap(X_train==0, vmin=0, vmax=1, cbar=False)
plt.show()

NLP之文本分类：「Tf-Idf、Word2Vec和BERT」三种模型比较

特征矩阵(34 265 x 15)

现在语料库中的每一个文本都是一个长度为15的id序列。例如，如果一个文本中有10个词符，那么这个序列由10个id和5个0组成，这个0这就是填充元素（而词表中没有的词其id为1）。我们来输出一下看看一段训练集文本是如何被转化成一个带有填充元素的词序列:

i = 0## list of text: ["I like this", ...]len_txt = len(dtf_train["text_clean"].iloc[i].split())print("from: ", dtf_train["text_clean"].iloc[i], "| len:", len_txt)## sequence of token ids: [[1, 2, 3], ...]len_tokens = len(X_train[i])print("to: ", X_train[i], "| len:", len(X_train[i]))## vocabulary: {"I":1, "like":2, "this":3, ...}print("check: ", dtf_train["text_clean"].iloc[i].split()[0],
" -- idx in vocabulary -->",
dic_vocabulary[dtf_train["text_clean"].iloc[i].split()[0]])print("vocabulary: ", dict(list(dic_vocabulary.items())[0:5]), "... (padding element, 0)")

NLP之文本分类：「Tf-Idf、Word2Vec和BERT」三种模型比较

记得在测试集上也要做这个特征工程:

corpus = dtf_test["text_clean"]## create list of n-gramslst_corpus = []
for string in corpus:
    lst_words = string.split()
    lst_grams = [" ".join(lst_words[i:i+1]) for i in range(0,
                 len(lst_words), 1)]
    lst_corpus.append(lst_grams)
    ## detect common bigrams and trigrams using the fitted detectorslst_corpus = list(bigrams_detector[lst_corpus])
lst_corpus = list(trigrams_detector[lst_corpus])## text to sequence with the fitted tokenizerlst_text2seq = tokenizer.texts_to_sequences(lst_corpus)## padding sequenceX_test = kprocessing.sequence.pad_sequences(lst_text2seq, maxlen=15,
             padding="post", truncating="post")

NLP之文本分类：「Tf-Idf、Word2Vec和BERT」三种模型比较

X_test (14,697 x 15)

现在我们就有了X_train和X_test，现在需要创建嵌入矩阵，它将作为神经网络分类器的权重矩阵.

## start the matrix (length of vocabulary x vector size) with all 0sembeddings = np.zeros((len(dic_vocabulary)+1, 300))for word,idx in dic_vocabulary.items():
    ## update the row with vector    try:
        embeddings[idx] =  nlp[word]
    ## if word not in model then skip and the row stays all 0s    except:
        pass

这段代码生成的矩阵尺寸为22338×300（从语料库中提取的词表长度×向量维度）。它可以通过词表中的词id。

word = "data"print("dic[word]:", dic_vocabulary[word], "|idx")print("embeddings[idx]:", embeddings[dic_vocabulary[word]].shape,
"|vector")

NLP之文本分类：「Tf-Idf、Word2Vec和BERT」三种模型比较

终于要建立深度学习模型了! 我门在神经网络的第一个Embedding层中使用嵌入矩阵，训练它之后就能用来进行新闻分类。输入序列中的每个id将被视为访问嵌入矩阵的索引。这个嵌入层的输出是一个包含输入序列中每个词id对应词向量的二维矩阵（序列长度 x 词向量维度）。以 "我喜欢这篇文章(I like this article) "这个句子为例:

NLP之文本分类：「Tf-Idf、Word2Vec和BERT」三种模型比较

我的神经网络的结构如下:

一个嵌入层，如前文所述, 将文本序列作为输入, 词向量作为权重。
一个简单的Attention层，它不会影响预测，但它可以捕捉每个样本的权重, 以便将作为一个不错的解释器（对于预测来说它不是必需的，只是为了提供可解释性，所以其实可以不用加它）。这篇论文（2014）提出了序列模型（比如LSTM）的Attention机制，探究了长文本中哪些部分实际相关。
两层双向LSTM，用来建模序列中词的两个方向。
最后两层全连接层，可以预测每个新闻类别的概率。

## code attention layerdef attention_layer(inputs, neurons):
    x = layers.Permute((2,1))(inputs)
    x = layers.Dense(neurons, activation="softmax")(x)
    x = layers.Permute((2,1), name="attention")(x)
    x = layers.multiply([inputs, x])
    return x## inputx_in = layers.Input(shape=(15,))## embeddingx = layers.Embedding(input_dim=embeddings.shape[0],
                     output_dim=embeddings.shape[1],
                     weights=[embeddings],
                     input_length=15, trainable=False)(x_in)## apply attentionx = attention_layer(x, neurons=15)## 2 layers of bidirectional lstmx = layers.Bidirectional(layers.LSTM(units=15, dropout=0.2,
                         return_sequences=True))(x)
x = layers.Bidirectional(layers.LSTM(units=15, dropout=0.2))(x)## final dense layersx = layers.Dense(64, activation='relu')(x)
y_out = layers.Dense(3, activation='softmax')(x)## compilemodel = models.Model(x_in, y_out)
model.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam', metrics=['accuracy'])
model.summary()

NLP之文本分类：「Tf-Idf、Word2Vec和BERT」三种模型比较

现在来训练模型，不过在实际测试集上测试之前，我们要在训练集上划一小块验证集来验证模型性能。

## encode ydic_y_mapping = {n:label for n,label in
                 enumerate(np.unique(y_train))}
inverse_dic = {v:k for k,v in dic_y_mapping.items()}
y_train = np.array([inverse_dic[y] for y in y_train])## traintraining = model.fit(x=X_train, y=y_train, batch_size=256,
                     epochs=10, shuffle=True, verbose=0,
                     validation_split=0.3)## plot loss and accuracymetrics = [k for k in training.history.keys() if ("loss" not in k) and ("val" not in k)]
fig, ax = plt.subplots(nrows=1, ncols=2, sharey=True)ax[0].set(title="Training")
ax11 = ax[0].twinx()
ax[0].plot(training.history['loss'], color='black')
ax[0].set_xlabel('Epochs')
ax[0].set_ylabel('Loss', color='black')for metric in metrics:
    ax11.plot(training.history[metric], label=metric)
ax11.set_ylabel("Score", color='steelblue')
ax11.legend()ax[1].set(title="Validation")
ax22 = ax[1].twinx()
ax[1].plot(training.history['val_loss'], color='black')
ax[1].set_xlabel('Epochs')
ax[1].set_ylabel('Loss', color='black')for metric in metrics:
     ax22.plot(training.history['val_'+metric], label=metric)
ax22.set_ylabel("Score", color="steelblue")
plt.show()

NLP之文本分类：「Tf-Idf、Word2Vec和BERT」三种模型比较

Nice！在某些epoch中准确率达到了0.89。为了对词嵌入模型进行评估，在测试集上也要进行预测，并用相同指标进行对比（评价指标的代码与之前相同）。

## testpredicted_prob = model.predict(X_test)
predicted = [dic_y_mapping[np.argmax(pred)] for pred in
predicted_prob]

NLP之文本分类：「Tf-Idf、Word2Vec和BERT」三种模型比较

该模式的表现与前一个模型差不多。其实，它的科技新闻分类也不怎么样。

但它也具有可解释性吗? 是的! 因为在神经网络中放了一个Attention层来提取每个词的权重，我们可以了解这些权重对一个样本的分类贡献有多大。所以这里我将尝试使用Attention权重来构建一个解释器（类似于上一节里的那个）:

## select observationi = 0txt_instance = dtf_test["text"].iloc[i]## check true value and predicted valueprint("True:", y_test[i], "--> Pred:", predicted[i], "| Prob:", round(np.max(predicted_prob[i]),2))## show explanation### 1. preprocess inputlst_corpus = []for string in [re.sub(r'[^\w\s]','', txt_instance.lower().strip())]:
    lst_words = string.split()
    lst_grams = [" ".join(lst_words[i:i+1]) for i in range(0,
                 len(lst_words), 1)]
    lst_corpus.append(lst_grams)
lst_corpus = list(bigrams_detector[lst_corpus])
lst_corpus = list(trigrams_detector[lst_corpus])
X_instance = kprocessing.sequence.pad_sequences(
              tokenizer.texts_to_sequences(corpus), maxlen=15,
              padding="post", truncating="post")### 2. get attention weightslayer = [layer for layer in model.layers if "attention" in
         layer.name][0]
func = K.function([model.input], [layer.output])
weights = func(X_instance)[0]
weights = np.mean(weights, axis=2).flatten()### 3. rescale weights, remove null vector, map word-weightweights = preprocessing.MinMaxScaler(feature_range=(0,1)).fit_transform(np.array(weights).reshape(-1,1)).reshape(-1)
weights = [weights[n] for n,idx in enumerate(X_instance[0]) if idx
           != 0]
dic_word_weigth = {word:weights[n] for n,word in
                   enumerate(lst_corpus[0]) if word in
                   tokenizer.word_index.keys()}### 4. barplotif len(dic_word_weigth) > 0:
   dtf = pd.DataFrame.from_dict(dic_word_weigth, orient='index',
                                columns=["score"])
   dtf.sort_values(by="score",
           ascending=True).tail(top).plot(kind="barh",
           legend=False).grid(axis='x')
   plt.show()else:
   print("--- No word recognized ---")### 5. produce html visualizationtext = []for word in lst_corpus[0]:
    weight = dic_word_weigth.get(word)
    if weight is not None:
         text.append('<b><span >' + word + '</span></b>')
    else:
         text.append(word)
text = ' '.join(text)### 6. visualize on notebookprint("\033[1m"+"Text with highlighted words")from IPython.core.display import display, HTML
display(HTML(text))

NLP之文本分类：「Tf-Idf、Word2Vec和BERT」三种模型比较

就像之前一样，"克林顿 (clinton)"和 "老大党(gop) "这两个词激活了模型的神经元，而且这次发现 "高(high) "和 "班加西(benghazi) "与预测也略有关联。

语言模型

语言模型, 即上下文/动态词嵌入（Contextualized/Dynamic Word Embeddings），克服了经典词嵌入方法的最大局限：多义词消歧义，一个具有不同含义的词（如" bank "或" stick"）只需一个向量就能识别。最早流行的是 ELMO（2018），它并没有采用固定的嵌入，而是利用双向 LSTM观察整个句子，然后给每个词分配一个嵌入。

到Transformers时代, 谷歌的论文Attention is All You Need（2017）提出的一种新的语言建模技术，在该论文中，证明了序列模型（如LSTM）可以完全被Attention机制取代，甚至获得更好的性能。

而后谷歌的BERT（Bidirectional Encoder Representations from Transformers，2018）包含了ELMO的上下文嵌入和几个Transformers，而且它是双向的（这是对Transformers的一大创新改进）。BERT分配给一个词的向量是整个句子的函数，因此，一个词可以根据上下文不同而有不同的词向量。我们输入岸河(bank river)到Transformer试试:

txt = "bank river"## bert tokenizertokenizer = transformers.BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)## bert modelnlp = transformers.TFBertModel.from_pretrained('bert-base-uncased')## return hidden layer with embeddingsinput_ids = np.array(tokenizer.encode(txt))[None,:]
embedding = nlp(input_ids)
embedding[0][0]

NLP之文本分类：「Tf-Idf、Word2Vec和BERT」三种模型比较

如果将输入文字改为 "银行资金(bank money)"，则会得到这样的结果:

NLP之文本分类：「Tf-Idf、Word2Vec和BERT」三种模型比较

为了完成文本分类任务，可以用3种不同的方式来使用BERT:

从零训练它，并将其作为分类器使用。
提取词嵌入，并在嵌入层中使用它们（就像上面用Word2Vec那样）。
对预训练模型进行精调(迁移学习)。

我打算用第三种方式，从预训练的轻量 BERT 中进行迁移学习，人称 Distil-BERT （用6600 万个参数替代1.1 亿个参数）

## distil-bert tokenizertokenizer = transformers.AutoTokenizer.from_pretrained('distilbert-base-uncased', do_lower_case=True)

在训练模型之前，还是需要做一些特征工程，但这次会比较棘手。为了说明我们需要做什么，还是以我们这句 "我喜欢这篇文章(I like this article) "为例，他得被转化为3个向量（Ids, Mask, Segment）:

NLP之文本分类：「Tf-Idf、Word2Vec和BERT」三种模型比较

尺寸为 3 x 序列长度

首先，我们需要确定最大序列长度。这次要选择一个大得多的数字(比如50)，因为BERT会将未知词分割成子词符(sub-token)，直到找到一个已知的单字。比如若给定一个像 "zzdata "这样的虚构词，BERT会把它分割成["z"，"##z"，"##data"]。除此之外, 我们还要在输入文本中插入特殊的词符，然后生成掩码(musks)和分段(segments)向量。最后，把它们放进一个张量里得到特征矩阵，其尺寸为3（id、musk、segment）x 语料库中的文档数 x 序列长度。

这里我使用原始文本作为语料（前面一直用的是clean_text列）。

corpus = dtf_train["text"]
maxlen = 50## add special tokensmaxqnans = np.int((maxlen-20)/2)
corpus_tokenized = ["[CLS] "+
             " ".join(tokenizer.tokenize(re.sub(r'[^\w\s]+|\n', '',
             str(txt).lower().strip()))[:maxqnans])+
             " [SEP] " for txt in corpus]## generate masksmasks = [[1]*len(txt.split(" ")) + [0]*(maxlen - len(
           txt.split(" "))) for txt in corpus_tokenized]
    ## paddingtxt2seq = [txt + " [PAD]"*(maxlen-len(txt.split(" "))) if len(txt.split(" ")) != maxlen else txt for txt in corpus_tokenized]
    ## generate idxidx = [tokenizer.encode(seq.split(" ")) for seq in txt2seq]
    ## generate segmentssegments = [] for seq in txt2seq:
    temp, i = [], 0    for token in seq.split(" "):
        temp.append(i)
        if token == "[SEP]":
             i += 1    segments.append(temp)## feature matrixX_train = [np.asarray(idx, dtype='int32'),
           np.asarray(masks, dtype='int32'),
           np.asarray(segments, dtype='int32')]

特征矩阵X_train的尺寸为3×34265×50。我们可以从特征矩阵中随机挑一个出来看看:

i = 0print("txt: ", dtf_train["text"].iloc[0])
print("tokenized:", [tokenizer.convert_ids_to_tokens(idx) for idx in X_train[0][i].tolist()])
print("idx: ", X_train[0][i])
print("mask: ", X_train[1][i])
print("segment: ", X_train[2][i])

NLP之文本分类：「Tf-Idf、Word2Vec和BERT」三种模型比较

这段代码在dtf_test["text"]上跑一下就能得到X_test。

现在要从预练好的 BERT 中用迁移学习一个深度学习模型。具体就是，把 BERT 的输出用平均池化压成一个向量，然后在最后添加两个全连接层来预测每个新闻类别的概率.

下面是使用BERT原始版本的代码（记得用正确的tokenizer重做特征工程):

## inputsidx = layers.Input((50), dtype="int32", name="input_idx")
masks = layers.Input((50), dtype="int32", name="input_masks")
segments = layers.Input((50), dtype="int32", name="input_segments")## pre-trained bertnlp = transformers.TFBertModel.from_pretrained("bert-base-uncased")
bert_out, _ = nlp([idx, masks, segments])## fine-tuningx = layers.GlobalAveragePooling1D()(bert_out)
x = layers.Dense(64, activation="relu")(x)
y_out = layers.Dense(len(np.unique(y_train)),
                     activation='softmax')(x)## compilemodel = models.Model([idx, masks, segments], y_out)for layer in model.layers[:4]:
    layer.trainable = Falsemodel.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam', metrics=['accuracy'])model.summary()

NLP之文本分类：「Tf-Idf、Word2Vec和BERT」三种模型比较

这里用轻量级的Distil-BERT来代替BERT:

## inputsidx = layers.Input((50), dtype="int32", name="input_idx")
masks = layers.Input((50), dtype="int32", name="input_masks")## pre-trained bert with configconfig = transformers.DistilBertConfig(dropout=0.2,
           attention_dropout=0.2)
config.output_hidden_states = Falsenlp = transformers.TFDistilBertModel.from_pretrained('distilbert-
                  base-uncased', config=config)
bert_out = nlp(idx, attention_mask=masks)[0]## fine-tuningx = layers.GlobalAveragePooling1D()(bert_out)
x = layers.Dense(64, activation="relu")(x)
y_out = layers.Dense(len(np.unique(y_train)),
                     activation='softmax')(x)## compilemodel = models.Model([idx, masks], y_out)for layer in model.layers[:3]:
    layer.trainable = Falsemodel.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam', metrics=['accuracy'])model.summary()

NLP之文本分类：「Tf-Idf、Word2Vec和BERT」三种模型比较