Text Classification with BERT in PyTorch
在PyTorch中使用BERT进行文本分类
Text Classification with BERT in PyTorch
在PyTorch中使用BERT进行文本分类
How to leverage a pre-trained BERT model from Hugging Face to classify text of news articles
如何利用Hugging Face的预训练BERT模型对新闻文章的文本进行分类
Back in 2018, Google developed a powerful Transformer-based machine learning model for NLP applications that outperforms previous language models in different benchmark datasets. And this model is called BERT.
早在2018年,谷歌就为NLP应用程序开发了一个基于Transformer的强大的机器学习模型,该模型在不同的基准数据集中均优于以前的语言模型。而这个模型被称为BERT。
In this post, we’re going to use a pre-trained BERT model from Hugging Face for a text classification task. As you might already know, the main goal of the model in a text classification task is to categorize a text into one of the predefined labels or tags.
在这篇文章中,我们将使用来自Hugging Face的预训练BERT模型对文本分类任务进行处理。您可能已经知道,文本分类任务中模型的主要目标是将文本分类为预定义的标签中。
Specifically, soon we’re going to use the pre-trained BERT model to classify whether the text of a news article can be categorized as sport, politics, business, entertainment, or tech category.
具体来说,我们将很快使用预训练的 BERT 模型来对新闻的文本进行分类,并判断其是否可以归类为体育、政治、商业、娱乐或技术类别。
But before we dive into the implementation, let’s talk about the concept behind BERT briefly.
但在我们深入实施这项任务之前,让我们简单谈谈BERT背后的概念。
What is BERT?
什么是BERT?
BERT is an acronym for Bidirectional Encoder Representations from Transformers. The name itself gives us several clues to what BERT is all about.
BERT是来自Transformer的双向编码器表示的首字母的缩写。这个名字本身就已经为我们提供了一些关于BERT的线索。
BERT architecture consists of several Transformer encoders stacked together. Each Transformer encoder encapsulates two sub-layers: a self-attention layer and a feed-forward layer.
BERT架构由多个堆叠在一起的Transformer编码器组成。每个Transformer编码器都封装了两个子层:一个自我关注层和一个前馈层。
There are two different BERT models:
有两种不同的BERT模型:
BERT base, which is a BERT model consists of 12 layers of Transformer encoder, 12 attention heads, 768 hidden size, and 110M parameters.
BERT base是一个由12层Transformer编码器、12个关注头、768个隐藏大小和 1.1亿个参数组成的BERT模型。
BERT large, which is a BERT model consists of 24 layers of Transformer encoder,16 attention heads, 1024 hidden size, and 340 parameters.
BERT large是一个由24层Transformer编码器、16个关注头、1024个隐藏大小和 3.4亿个参数组成的BERT模型。
There are at least two reasons why BERT is a powerful language model:
至少有两个原因可以解释为什么BERT是一个强大的语言模型:
It is pre-trained on unlabeled data extracted from BooksCorpus, which has 800M words, and from Wikipedia, which has 2,500M words.
它使用从BooksCorpus(有8亿字)和Wikipedia(有 25亿字)中提取的未标记数据进行预训练。
As the name suggests, it is pre-trained by utilizing the bidirectional nature of the encoder stacks. This means that BERT learns information from a sequence of words not only from left to right, but also from right to left.
顾名思义,它是通过利用编码器堆栈的双向特性进行预训练的。这意味着BERT不仅可以从左到右在单词序列中学习信息,还可以从右到左学习信息。
BERT Input and Output
BERT的输入与输出
BERT model expects a sequence of tokens (words) as an input. In each sequence of tokens, there are two special tokens that BERT would expect as an input:
BERT模型需要一系列标记(单词)作为输入。在每个标记序列中,有两个特殊标记是BERT需要输入的:
[CLS]: This is the first token of every sequence, which stands for classification token.
[CLS]:这是每个序列的第一个标记,其代表分类标记。
[SEP]: This is the token that makes BERT know which token belongs to which sequence. This special token is mainly important for a next sentence prediction task or question-answering task. If we only have one sequence, then this token will be appended to the end of the sequence.
[SEP]:这是让BERT知道哪个标记是属于哪个序列的标记。 这个特殊标记主要对下一句的预测任务或问答任务很重要。如果我们只有一个序列,那么这个标记将被附加到序列的末尾。
To make it more clear, let’s say we have a text consisting of the following short sentence:
为了更清楚地进行理解,假设我们有一个包含以下短句的文本:
As a first step, we need to transform this sentence into a sequence of tokens (words) and this process is called tokenization.
作为第一步,我们需要将这个句子转换为一系列标记(单词),这个过程称为标记化。
Although we have tokenized our input sentence, we need to do one more step. We need to reformat that sequence of tokens by adding[CLS] and [SEP] tokens before using it as an input to our BERT model.
虽然我们已经对输入的句子进行了标记化,但我们还需要再做一步。 在将其用作BERT模型的输入之前,我们需要通过添加[CLS]和[SEP]标记来重新格式化该标记序列。
Luckily, we only need one line of code to transform our input sentence into a sequence of tokens that BERT expects as we have seen above. We will use BertTokenizer to do this and you can see how we do this later on.
幸运的是,我们只需要一行代码就可以将输入语句转换为正如我们前面所看到的BERT期望的标记序列。我们将使用BertTokenizer来实现这一点,稍后您可以看到我们是如何实现这一点的。
It is also important to note that the maximum size of tokens that can be fed into BERT model is 512. If the tokens in a sequence are less than 512, we can use padding to fill the unused token slots with [PAD] token. If the tokens in a sequence are longer than 512, then we need to do a truncation.
还需要注意的是,可以输入到BERT模型的最大标记大小为512。如果序列中的标记短于512,则我们可以使用[PAD]标记来填充未使用的令牌槽。 如果序列中的标记长于512,那么我们需要进行截断操作。
And that’s all that BERT expects as input.
这就是BERT所需要的全部输入。
BERT model then will output an embedding vector of size 768 in each of the tokens. We can use these vectors as an input for different kinds of NLP applications, whether it is text classification, next sentence prediction, Named-Entity-Recognition (NER), or question-answering.
然后,BERT 模型将在每个标记中输出一个大小为768的嵌入向量。我们可以将这些向量用作不同类型NLP应用的输入,无论是文本分类、下句预测、命名实体识别(NER)还是问题回答。
For a text classification task, we focus our attention on the embedding vector output from the special [CLS] token. This means that we’re going to use the embedding vector of size 768 from [CLS] token as an input for our classifier, which then will output a vector of size the number of classes in our classification task.
对于文本分类任务,我们将注意力集中在特殊[CLS]标记的嵌入向量输出上。这意味着我们将使用[CLS]标记中大小为768的嵌入向量作为分类器的输入,然后分类器将输出一个大小为分类任务中类的向量。
Below is the illustration of the input and output of the BERT model.
下面是BERT模型的输入和输出的图解。
Text Classification with BERT
使用BERT进行文本分类
Now we’re going to jump into our main topic to classify text with BERT. In this post, we’re going to use the BBC News Classification dataset. If you want to follow along, you can download the dataset on Kaggle.
现在我们将进入我们正题,即使用BERT对文本进行分类。在这篇文章中,我们将使用来自BBC的新闻分类数据集。如果您想继续学习,您可以在Kaggle上下载该数据集。
This dataset is already in CSV format and it has 2126 different texts, each labeled under one of 5 categories: entertainment, sport, tech, business, or politics.
这个数据集已经是CSV格式,它有2126种不同的文本,每一种都标记在如下5个类别中的一个里面:娱乐、体育、科技、商业或政治。
Let’s take a look at what the dataset looks like.
让我们来看看数据集的构成。
As you can see, the dataframe only has two columns, which is category that will be our label, and text which will be our input data for BERT.
如您所见,数据框只有两列,这两列分别问将成为标签的目录和将作为BERT输入数据的文本。
Preprocessing Data
数据预处理
As you might already know from the previous section, we need to transform our text into the format that BERT expects by adding [CLS] and [SEP] tokens. We can do this easily with BertTokenizer class from Hugging Face.
正如您可能已经从上一节中知道的那样,我们需要通过添加[CLS]和[SEP]标记将我们的文本转换为BERT期望的格式。我们可以使用Hugging Face中的BertTokenizer类轻松完成此操作。
First, we need to install Transformers library via pip:
首先,我们需要通过pip安装Transformers库:
To make it easier for us to understand the output that we get from BertTokenizer, let’s use a short text as an example.
为了使我们更容易理解我们从BertTokenizer得到的输出,我们以一个简短的文本为例。
Here is the explanation of BertTokenizer parameters above:
下面是对上述BertTokenizer参数的解释:
padding : to pad each sequence to the maximum length that you specify.
padding:将每个序列填充到您指定的最大长度。
max_length : the maximum length of each sequence. In this example we use 10, but for our actual dataset we will use 512, which is the maximum length of a sequence allowed for BERT.
max_length:每个序列的最大长度。在此示例中,我们将此值设置为10,但对于实际的数据集,我们将会把此值设置为512,这是BERT允许的序列最大长度。
truncation : if True, then the tokens in each sequence that exceed the maximum length will be truncated.
truncation:如果为True,则每个序列中超过最大长度的标记将被截断。
return_tensors : the type of tensors that will be returned. Since we’re using Pytorch, then we use pt. If you use Tensorflow, then you need to use tf .
return_tensors:将返回的张量类型。由于我们使用的是Pytorch,我们将使用pt。如果你使用Tensorflow,那么你需要使用tf。
The outputs that you see from bert_input variable above are necessary for our BERT model later on. But what do those outputs mean?
您从上面的bert_input变量中看到的输出对于我们稍后讲述的BERT模型是必需的。但是这些输出是什么意思?
The first row is input_ids , which is the id representation of each token. We can actually decode these input ids into the actual tokens as follows:
第一行是input_ids,它是每个标记的id表示。 我们实际上可以将这些输入id解码为实际的标记,如下所示:
As you can see, the BertTokenizer takes care of all of the necessary transformations of the input text such that it’s ready to be used as an input for our BERT model. It adds [CLS], [SEP], and [PAD] tokens automatically. Since we specified the maximum length to be 10, then there are only two [PAD] tokens at the end.
如您所见,BertTokenizer 负责处理输入文本的所有必要转换,以便准备好将其用作我们的BERT模型的输入。它会自动添加[CLS]、[SEP]和[PAD]标记。 由于我们指定最大长度为10,所以最后只有两个[PAD]标记。
2. The second row is token_type_ids , which is a binary mask that identifies in which sequence a token belongs. If we only have a single sequence, then all of the token type ids will be 0. For a text classification task, token_type_ids is an optional input for our BERT model.
2. 第二行是token_type_ids,它是一个二进制掩码,用于标识标记属于哪个序列。 如果我们只有一个序列,那么所有的标记类型id都将为0。对于文本分类任务,token_type_ids是我们BERT模型的可选输入。
3. The third row is attention_mask , which is a binary mask that identifies whether a token is a real word or just padding. If the token contains [CLS], [SEP], or any real word, then the mask would be 1. Meanwhile, if the token is just padding or [PAD], then the mask would be 0.
3. 第三行是attention_mask,它是一个二进制掩码,用于识别标记是真实单词还是仅仅是填充。如果令牌包含[CLS]、[SEP]或任何真实单词,则掩码将为1。同时,如果令牌只是填充或[PAD],则掩码将为0。
As you might notice, we use a pre-trained BertTokenizer from bert-base-cased model. This pre-trained tokenizer works well if the text in your dataset is in English.
正如您可能注意到的,我们使用了来自bert-base-cased模型的经过预训练的 BertTokenizer。如果数据集中的文本是英文,那么这个预训练的标记器就可以很好地工作。
If you have datasets from different languages, you might want to use bert-base-multilingual-cased. Specifically, if your dataset is in German, Dutch, Chinese, Japanese, or Finnish, you might want to use a tokenizer pre-trained specifically in these languages. You can check the name of the corresponding pre-trained tokenizer here.
如果您有来自不同语言的数据集,您可能需要使用bert-base-multilingual-cased。具体来说,如果您的数据集是德语、荷兰语、中文、日语或芬兰语,您可能需要使用专门针对这些语言进行预训练的标记器。您可以在此处查看相应的预训练标记器的名称。
To sum up, below is the illustration of what BertTokenizer does to our input sentence.
综上所述,下面是BertTokenizer对我们的输入句子所做处理的说明。
Dataset Class
数据集类
Now that we know what kind of output that we will get from BertTokenizer , let’s build a Dataset class for our news dataset that will serve as a class to generate our news data.
既然我们知道我们将从BertTokenizer获得什么样的输出,让我们为我们的新闻数据集构建一个Dataset类,它将作为一个类来生成我们的新闻数据。
In the above implementation, we define a variable called labels , which is a dictionary that maps the category in the dataframe into the id representation of our label. Notice that we also call BertTokenizer in the __init__ function above to transform our input texts into the format that BERT expects.
在上面的应用中,我们定义了一个名为labels的变量,它是一个字典,将数据帧中的类别映射到我们标签变量的id中。注意,我们还在上面的__init__函数中调用了BertTokenizer来将我们的输入文本转换为BERT期望的格式。
After defining dataset class, let’s split our dataframe into training, validation, and test set with the proportion of 80:10:10.
定义数据集类后,让我们将数据框拆分为训练集、验证集和测试集,比例为80:10:10。
Model Building
模型建立
So far, we have built a dataset class to generate our data. Now let’s build the actual model using a pre-trained BERT base model which has 12 layers of Transformer encoder.
到目前为止,我们已经构建了一个数据集类来生成我们的数据。现在让我们使用具有12层Transformer编码器的预训练BERT基础模型构建实际模型。
If your dataset is not in English, it would be best if you use bert-base-multilingual-cased model. If your data is in German, Dutch, Chinese, Japanese, or Finnish, you can use the model pre-trained specifically in these languages. You can check the name of the corresponding pre-trained model here.
如果您的数据集不是英文的,最好使用 bert-base-multilingual-cased 模型。 如果您的数据是德语、荷兰语、中文、日语或芬兰语,您可以使用专门针对这些语言进行预训练的模型。 您可以在此处查看相应预训练模型的名称。
As you can see from the code above, BERT model outputs two variables:
从上面的代码可以看出,BERT模型输出了两个变量:
The first variable, which we named _ in the code above, contains the embedding vectors of all of the tokens in a sequence.
第一个变量,我们在上面的代码中将其命名为_,它包含序列中所有标记的嵌入向量。
The second variable, which we named pooled_output, contains the embedding vector of [CLS] token. For a text classification task, it is enough to use this embedding as an input for our classifier.
第二个变量,我们将其命名为pooled_output,它包含[CLS]标记的嵌入向量。 对于文本分类任务,使用这个嵌入作为我们分类器的输入就足够了。
We then pass the pooled_output variable into a linear layer with ReLU activation function. At the end of the linear layer, we have a vector of size 5, each corresponds to a category of our labels (sport, business, politics, entertainment, and tech).
然后,我们将pooled_output变量传递到具有ReLU激活功能的线性层中。 在线性层的末尾,我们有一个大小为5的向量,每个向量对应于我们的标签类别(体育、商业、政治、娱乐和科技)。
Training Loop
训练循环
Now it’s time for us to train the model. The training loop will be a standard PyTorch training loop.
现在是时候训练我们的模型了。训练循环将是标准的PyTorch训练循环。
We train the model for 5 epochs and we use Adam as the optimizer, while the learning rate is set to 1e-6. We also need to use categorical cross entropy as our loss function since we’re dealing with multi-class classification.
我们对模型进行了5个epoch的训练,我们使用Adam作为优化器,而学习率设置为1e-6。我们还需要使用分类交叉熵作为我们的损失函数,因为我们正在处理多类分类。
It is recommended that you use GPU to train the model since BERT base model contains 110 million parameters.
建议您使用GPU来训练模型,因为BERT基础模型包含1.1亿个参数。
After 5 epochs with the above configuration, you’ll get the following output as an example:
经历了使用上述配置的5 个epoch后,您将获得以下输出作为示例:
Obviously you might not get similar loss and accuracy values as the screenshot above due to the randomness of training process. If you haven’t got a good result after 5 epochs, try to increase the epochs to, let’s say, 10 or adjust the learning rate.
显然,由于训练过程的随机性,您可能不会得到与上面截图类似的损失和准确率值。如果你在5个epoch之后没有得到好的结果,试着将 epoch 增加到10个,或者调整学习率。
Evaluate Model on Test Data
测试数据评价模型
Now that we have trained the model, we can use the test data to evaluate the model’s performance on unseen data. Below is the function to evaluate the performance of the model on the test set.
既然我们已经训练了模型,我们可以使用测试数据来评估模型在未见数据上的性能。 下面是评估模型在测试集上的性能的函数。
After running the code above, I got the accuracy of 0.994 from the test data. The accuracy that you’ll get will obviously slightly differ from mine due to the randomness during the training process.
运行上面的代码后,我从测试数据中得到了0.994的准确率。由于训练过程的随机性,你得到的准确率显然会与我的略有不同。
Conclusion
结论
Now you know the step on how we can leverage a pre-trained BERT model from Hugging Face for a text classification task. I hope this post helps you to get started with BERT.
现在您已经知道了我们如何利用来自Hugging Face的预训练BERT模型处理文本分类任务。我希望这篇文章能帮助你开始使用BERT。
One thing to remember is that we can use the embedding vectors from BERT to do not only a sentence or text classification task, but also the more advanced NLP applications such as question answering, next sentence prediction, or Named-Entity-Recognition (NER) tasks.
要记住的一点是,我们不仅可以使用来自BERT的嵌入向量来执行句子或文本分类任务,还可以执行更高级的NLP应用程序,例如问题回答、下句预测或命名实体识别 (NER) 任务。
You can find all of the code snippets demonstrated in this post in this notebook.
您可以在此笔记本中找到本文中演示的所有代码片段。