武汉工程大学教学平台-试译宝

Customer Churn Prediction with Text and Interpretability
基于文本和可解释性的客户流失预测

阅读原文

赵倩妮武汉工程大学

时间：2022-11-18 语向：英-中类型：人工智能字数：2302

Customer Churn Prediction with Text and Interpretability
基于文本和可解释性的客户流失预测
Customer Churn Prediction with Text and Interpretability
基于文本和可解释性的客户流失预测
Predicting if and understanding why customers want to leave.
预测客户是否想离开，并理解客户为什么想离开。
By Daniel Herkert, Tyler Mullenbach
作者：Daniel Herkert，Tyler Mullenbach
The code repository accompanying this blog post can be found here.
这篇博文附带的代码存储库可以在这里找到。
Customer churn, the loss of current customers, is a problem faced by a wide range of companies. When trying to retain customers, it is in a company’s best interest to focus their efforts on customers who are more likely to leave, but companies need a way detect customers who are likely to leave before they have decided to leave. Users prone to churn often leave clues to their disposition in user behavior and customer support chat logs which can be detected and understood using Natural Language Processing (NLP) tools.
客户流失，即现有客户的流失，是许多公司面临的问题。当公司试图留住客户时，把精力集中在更有可能离开的客户上是最符合公司利益的，但公司需要一种方法，在客户决定离开之前发现他们可能离开。容易流失的用户经常在用户行为和客户支持聊天日志中留下他们的倾向的线索，这些线索可以使用自然语言处理(NLP)工具检测和理解。
Here, we demonstrate how to build a churn prediction model that leverages both text and structured data (numerical and categorical) which we call a bi-modal model architecture. We use Amazon SageMaker to prepare, build, and train the model. Detecting customers who are likely to churn is only part of the battle, finding the root cause is an essential part of actually solving the issue. Since we are not only interested in the likelihood of a customer churning but also in the driving factors, we complement the prediction model with an analysis into feature importance for both text and non-text inputs. The code for this post can be found here.
在这里，我们演示了如何构建一个同时利用文本和结构化数据(数值和分类)的用户流失预测模型，我们称之为双模态模型架构。我们使用Amazon SageMaker来准备、构建和培训模型。发现可能流失的客户只是战斗的一部分，找到根本原因是实际解决问题的关键部分。由于我们不仅对客户流失的可能性感兴趣，还对驱动因素感兴趣，所以我们通过分析文本和非文本输入的特征重要性来补充预测模型。这篇文章的代码可以在这里找到。
We focus on Amazon SageMaker in this solution which is used to prepare the data, train the churn prediction model, as well as evaluate and interpret the trained model. We use Amazon SageMaker to store the training data and model artifacts, and Amazon CloudWatch to log the data preparation and model training outputs (Fig. 1).
在该解决方案中，我们专注于亚马逊SageMaker，该解决方案用于准备数据，训练churn预测模型，以及评估和解释训练后的模型。我们使用Amazon SageMaker存储训练数据和模型工件，使用Amazon CloudWatch记录数据准备和模型训练输出(图1)。
State-of-the-art natural language models are harder to interpret compared to simpler models like linear regression. The interpretability issues can impede business adoption despite their top-of-the-line performance. In this post, we demonstrate some methods for extracting understanding from NLP models. We use the BERT sentence encoder [2][3] for processing the text inputs, and we provide a way to attribute the model predictions to the input features. While there are different approaches to interpreting language models, we choose to go with an ablation analysis of a subset of relevant keywords which can scale easily to the entire dataset and hence provide us with a global interpretation of the predictions of the language model.
与线性回归等更简单的模型相比，最先进的自然语言模型更难解释。可解释性问题可能会阻碍业务的采用，尽管它们的性能是一流的。在这篇文章中，我们展示了一些从NLP模型中提取理解的方法。我们使用BERT句子编码器[2][3]来处理文本输入，并提供了一种将模型预测归因于输入特征的方法。虽然有不同的方法来解释语言模型，但我们选择对相关关键字子集进行消融分析，这可以轻松扩展到整个数据集，从而为我们提供语言模型预测的全局解释。
Exploring and Preparing Data
探索和准备数据
To mimic a categorical and text churn dataset we leverage the Kaggle: Customer Churn Prediction 2020 for the structured data and combined this with a synthetic text dataset created using GPT-2 [1]. The dataset comprises 21 columns with features including categorical ones (State, International Plan, VoiceMail Plan), numerical ones (Account Length, Area Code, etc.), and one column with text holding the chat logs between customer and agent generated with GPT-2.
为了模拟分类和文本流失数据集，我们利用Kaggle: Customer churn Prediction 2020对结构化数据进行分析，并将其与使用GPT-2[1]创建的合成文本数据集结合起来。该数据集由21列组成，其特征包括分类列(国家、国际计划、语音邮件计划)、数字列(账户长度、区号等)和一列文本，其中一列保存了客户和代理之间使用GPT-2生成的聊天日志。
The following image shows an excerpt of the data.
下图显示了数据的摘录。
To prepare the data for modeling, we use one-hot encoding to transform the categorical feature values into numeric form and impute missing numerical feature values with their corresponding mean. Since this post’s focus is on predicting with and interpreting language models, we won’t spend more time with exploring or feature engineering of the categorical and numerical features. Rather, we will place our focal point on the customer-agent interactions.
为了准备建模所需的数据，我们使用一热编码将分类特征值转换为数值形式，并将缺失的数值特征值用其对应的均值进行补强。由于这篇文章的重点是预测和解释语言模型，我们不会花更多的时间探索或特征工程的分类和数字特征。相反，我们将把重点放在客户-代理交互上。
As previously mentioned, the chat logs were generated with GPT-2 using a sample set of manually created customer-agent conversations. Here is an excerpt from a customer-agent conversation generated by GPT-2.
如前所述，聊天日志是使用GPT-2使用手动创建的客户代理对话样本集生成的。以下是GPT-2生成的客户代理对话的节选。
While the GPT-2 generated conversations have less breadth than actual conversations and (as seen above) sometimes fail to make perfect sense, we believe, in lieu of a public customer-agent dataset, this generated data is a reasonable way to get a large customer-agent interaction dataset concerning churn.
虽然GPT-2生成的对话比实际对话的广度小，而且(如上所示)有时没有完美的意义，但我们认为，这种生成的数据不是公共的客户代理数据集，而是获得与客户代理交互有关的大型客户代理数据集的合理方法。
We prepare the textual features on Amazon SageMaker by transforming each chat log into a vector representation using a pre-trained Sentence-BERT encoder (SBERT) from the Hugging Face models repository [4]. The Hugging Face repository provides open-source, pretrained natural language models that can be used, as in our case, to encode text without any further training of the model. SBERT is a modification of the pre-trained BERT network that uses the following network architecture to derive semantically meaningful sentence embeddings.
我们在Amazon SageMaker上准备文本特征，方法是使用拥抱面部模型存储库[4]中的预训练的句子bert编码器(SBERT)将每个聊天日志转换为向量表示。拥抱脸存储库提供了开源的、预训练的自然语言模型，可以使用这些模型来编码文本，就像我们的例子一样，而无需对模型进行任何进一步的训练。SBERT是对预训练BERT网络的修改，它使用以下网络架构来派生具有语义意义的句子嵌入。
A pair of sentences is encoded using BERT, each independently from the other, before applying a pooling operation to generate a fixed size sentence embedding per input sentence. As part of the bifold structure, BERT is fine-tuned by updating the weights such that the produced sentence embeddings are semantically meaningful and can be compared with cosine-similarity.
在应用池化操作为每个输入句子生成固定大小的句子嵌入之前，使用BERT对两个句子进行编码，每个句子独立于另一个句子。作为双元结构的一部分，BERT通过更新权重进行微调，使生成的句子嵌入具有语义意义，并可以与余弦相似度进行比较。
SBERT is trained using a combination of objectives including classification and regression. For the objective to classify the relation between two sentences (Fig. 2a), the sentence embeddings are concatenated by calculating the element-wise differences and multiplying them with trainable weights before passing them into a classification layer. For the regression objective (Fig. 2b), the cosine-similarity between two sentence embeddings is calculated.
SBERT使用包括分类和回归在内的目标组合进行训练。为了对两个句子之间的关系进行分类(图2a)，通过计算元素差异并将其与可训练的权重相乘，将句子嵌入连接起来，然后将其传递到一个分类层。对于回归目标(图2b)，计算两句嵌入之间的余弦相似度。
The benefit of using SBERT over other embedding techniques (such as InferSent, Universal Sentence Encoder) is that it is more efficient and achieves better results in most semantic similarity tasks [3].
使用SBERT相对于其他嵌入技术（如InferSent，通用语句编码器）的好处在于它更高效，在大多数语义相似性任务中取得更好的结果[3]。
Since BERT is built to encode word pieces, there is little to no preprocessing required for our text data. We can directly transform each chat log into a 768-dimensional semantically meaningful embedding vector.
由于BERT是为了编码单词片段而构建的，因此我们的文本数据几乎不需要预处理。我们可以直接将每个聊天日志转换为一个768维语义有意义的嵌入向量。
Applying the aforementioned preprocessing steps to all categorical, numerical, and textual features results in the data now being encoded in numeric form so it can be processed by our neural network.
将上述预处理步骤应用于所有分类、数字和文本特征，结果是数据现在被编码为数字形式，以便我们的神经网络可以处理它。
Creating a bi-modal ML model
创建双模态ML模型
The network architecture consists of three fully-connected layers and a Sigmoid activation function for binary classification (churn/no churn). First, the encoded categorical/numerical data is fed into a fully-connected layer before it is concatenated with the encoded textual data. The concatenated data is then fed into a second and third fully-connected layer before applying the Sigmoid activation for binary classification. The first fully-connected layer serves to reduce the dimensionality of the sparse categorical/numerical input data (see more details below). The second and third fully-connected layers serve as decoders of the encoded data in order to classify the inputs into churn/no churn. We call the architecture bi-modal because it takes both structured and unstructured data, i.e., categorical/numerical and textual data, as inputs in order to generate predictions.
该网络架构由3个全连接层和一个用于二进制分类(churn/no churn)的Sigmoid激活函数组成。首先，编码的分类/数值数据在与编码的文本数据连接之前被送入全连接层。然后，在应用Sigmoid激活进行二进制分类之前，将连接的数据输入到第2和第3个全连接层。第一个全连接层用于降低稀疏的分类/数值输入数据的维数(详见下文)。第二层和第三层全连接层作为编码数据的解码器，以便将输入分类为churn/no churn。我们称该架构为双模态，因为它将结构化和非结构化数据(即分类/数值和文本数据)作为输入，以生成预测。
The following diagram (Fig. 3) illustrates the model’s bi-modal architecture.
下图（图3）说明了模型的双模态架构。
As discussed in the previous section, the categorical data has been transformed into numerical values using one-hot-encoding where the category of each feature is represented by binary values in a separate column. This leads to sparsity in the resulting encoded data (many zeros) when there are many categories as with feature ‘State’ which has 51 categories (including District of Columbia). Additionally, indicator columns that are created by imputing missing values of numerical features also contribute to the sparsity of the encoded data. The first fully-connected layer serves to reduce this sparsity resulting in more efficient model training.
如前一节所讨论的，使用一热编码将分类数据转换为数值，其中每个特征的类别由单独列中的二进制值表示。当有许多类别时，这导致了结果编码数据的稀疏性(许多零)，就像特征”State”有51个类别(包括哥伦比亚区)一样。此外，通过输入数值特征的缺失值创建的指示列也有助于编码数据的稀疏性。第一个全连接层旨在减少这种稀疏性，从而获得更高效的模型训练。
We train the model using the SGD (stochastic gradient descent) optimizer and BCE (binary cross entropy) loss function on Amazon SageMaker and achieve a performance of 0.98 AUC on the test dataset after around 8–10 epochs. For comparison, training a model with just categorical and numerical data achieves a performance of 0.93 AUC, or about 5% lower than when the text data is used (Fig. 4). A similar improvement with real data would be expected since the model with text data has more information with which to make a decision.
我们在Amazon SageMaker上使用SGD(随机梯度下降)优化器和BCE(二元交叉熵)损失函数训练模型，在大约8-10个epoch后，在测试数据集上实现了0.98 AUC的性能。相比之下，仅使用分类和数值数据训练模型的AUC为0.93，比使用文本数据时低约5%(图4)。由于使用文本数据的模型具有更多的信息来进行决策，因此可以预期使用真实数据时也会有类似的改进。
Important Features
重要特征
However, in order to prevent customers from churning, it is not sufficient to know how likely the churn event is. Additionally, we need to find out what the driving factors are so that preventive actions can be taken.
然而，为了防止客户流失，仅仅知道流失事件的可能性是不够的。此外，我们需要找出驱动因素是什么，以便采取预防措施。
Categorical and numerical features
范畴和数字特征
We will train an XGBoost model [5] on the categorical and numerical data using the predicted labels of the trained neural network as target values to find out which categorical/numerical features contribute most to our model’s predictions. The built-in method for relative feature importance allows us to get an overview of the most important features which you can see in Figure 5.
我们将使用训练过的神经网络的预测标签作为目标值，在分类和数值数据上训练XGBoost模型[5]，以找出哪些分类/数值特征对我们模型的预测贡献最大。相对特征重要性的内置方法允许我们对最重要的特征进行概述，如图5所示。
From the above illustration, we can see that the top three features for determining whether a customer will churn include number of vmail messages, number of customer service calls, as well as not having an international plan (x2_no). The features x0_xx indicate the state.
从上面的插图中，我们可以看到，用于确定客户是否会离开的前三个特性包括vmail消息数量、客户服务呼叫数量以及没有国际计划(x2_no)。特性x0_xx表示状态。
Textual features
语篇特征
Now let’s focus on the customer-agent conversations and try to attribute the model’s predictions to its textual input features. While different approaches to interpreting deep learning models for natural language exist, their primary focus seems to be on explaining each prediction individually. For example, the Captum library [6] implements techniques based on gradients or SHAP values for the assessment of each token/word in the text sequence. While these methods provide interpretability on a local level, they don’t easily scale to the entire dataset limiting their use for a global interpretation of the trained model’s predictions.
现在，让我们关注客户-代理对话，并尝试将模型的预测归因于其文本输入特性。虽然存在解释自然语言深度学习模型的不同方法，但它们的主要重点似乎是单独解释每个预测。例如，Captum库[6]实现了基于梯度或SHAP值的技术，用于评估文本序列中的每个标记/单词。虽然这些方法提供了局部级别的可解释性，但它们不容易扩展到整个数据集，限制了它们对训练模型的预测的全球解释的使用。
Our approach to interpreting the trained model that uses SBERT-encoded textual features works well for both local interpretability (single chat logs) and global interpretability (entire dataset), it scales efficiently, and it consists of the following steps that we perform on Amazon SageMaker. We first subset the text into keywords using parts of speech (POS) tagging and semantic similarity matching to churn. Then we perform an ablation analysis to determine the marginal contribution of each keyword to the model prediction. Finally, we combine semantic similarity, marginal contribution, as well as keyword frequency into a single score which allows us to rank the keywords and provide the most relevant ones to churn.
我们解释使用sbert编码的文本特征的训练模型的方法在局部可解释性(单个聊天日志)和全局可解释性(整个数据集)方面都很好，它有效地扩展，它包括我们在Amazon SageMaker上执行的以下步骤。我们首先使用词性标注和语义相似度匹配将文本划分为关键词。然后，我们进行烧蚀分析，以确定每个关键字对模型预测的边际贡献。最后，我们将语义相似度、边际贡献和关键词频率结合在一起，形成一个评分，这允许我们对关键词进行排名，并提供最相关的关键词。
The below flow chart (Fig. 6) illustrates our approach to extracting the keywords:
下面的流程图（图6）说明了我们提取关键字的方法：
The approach starts with the raw text conversation; we reduce the size of our text body by focusing on a subset of candidate keywords that we obtain by applying several token filters as you can see in Figure 6, step 1. We apply Spacy’s POS tagging and keep only adjectives, verbs, and nouns [7]. Then we remove stop words, lower case, and lemmatize the tokens.
这种方法从原始文本对话开始;我们通过关注通过应用几个令牌过滤器获得的候选关键字子集来减少文本主体的大小，如图6第1步所示。我们使用Spacy的词性标记，只保留形容词、动词和名词[7]。然后，我们删除停止词、小写字母，并对标记进行引理化。
Next, we rank the candidate keywords in order of semantic similarity to both class outcomes — here we will focus on churn (Fig.6, step 2). Specifically, we encode each keyword using pre-trained SBERT and calculate each keyword’s cosine similarity to the average embedding of all SBERT-encoded chat logs that result in churn. This allows us to rank the keywords by similarity to churn which further reduces the subset of keywords into a subset of keywords that are relevant to churn. From the above illustration you can see that ranking keywords by semantic similarity already gives us important insights into why customers may be churning. Many of the keywords, including ‘cancel’, ‘frustrated’, or ‘unhappy’, indicate a negative sentiment.
接下来，我们将根据两个类结果的语义相似度对候选关键字进行排名-这里我们将专注于流失率(图6，步骤2)。具体来说，我们使用预训练的SBERT对每个关键字进行编码，并计算每个关键字与导致流失率的所有SBERT编码聊天日志的平均嵌入度的余弦相似度。这让我们能够根据关键字与churn的相似性对关键字进行排名，这进一步将关键字子集减少到与churn相关的关键字子集中。从上面的插图中，你可以看到，根据语义相似度对关键词进行排名已经为我们提供了重要的见解，让我们了解为什么客户可能会流失。许多关键词，包括“取消”、“沮丧”或“不高兴”，都表明了一种负面情绪。
In addition to semantic similarity to churn, we would like to further quantify the impact of keywords by measuring their marginal contribution to the prediction of the model (Fig. 6, step 3). We embed the chat logs with and without the relevant keywords (where they occur) and measure the average prediction difference. For example, the keyword cancel occurred 171 times across all churn chat logs and removing it results in a reduction of the model’s churn prediction by 4.18%, on average, across the 171 instances.
除了对churn的语义相似性，我们还希望通过测量关键字对模型预测的边际贡献来进一步量化关键字的影响(图6，步骤3)。我们嵌入有相关关键字和没有相关关键字的聊天记录(它们发生的位置)，并测量平均预测差异。例如，关键字cancel在所有流失聊天日志中发生了171次，删除它会导致在171个实例中，模型的流失预测平均减少4.18%。
Finally, we merge all three scores, semantic similarity, marginal contribution, and keyword frequency, into one joint metric to achieve our final ranking of important keywords. The joint metric is calculated by setting the individual metrics to the same scale (via range-bound Min/Max Scaler) and calculating a weighted average.
最后，我们将语义相似度、边际贡献和关键词频率这三个分数合并到一个联合度量中，以实现我们对重要关键词的最终排名。联合度量是通过将单个度量设置为相同的尺度(通过范围限制的最小/最大标量)并计算加权平均来计算的。
Results
结果
The following table (Fig. 7) shows the 20 most important keywords for predicting churn. They include ‘voicemail’, ‘cancel’, ‘spam’, ‘turnover’, ‘frustrated’, and ‘unhappy’ which are indicative of poor customer satisfaction or other issues.
下表(图7)显示了预测流失的20个最重要的关键字。它们包括“语音邮件”、“取消”、“垃圾邮件”、“营业额”、“沮丧”和“不高兴”，这些都表明客户满意度不高或其他问题。
To provide more context around the keywords and to help better explain churn events, we added a functionality to query the phrases where the keywords were used in the customer-agent conversation. For example, the keyword ‘spam’ was used when customers complained about being flooded “with emails and phone calls, spamming me with thousands of phony invoices.” Original chat logs with mention of spam:
为了提供关于关键字的更多上下文并帮助更好地解释流失事件，我们添加了一个功能来查询在customer-agent对话中使用关键字的短语。例如，当客户抱怨被”电子邮件和电话淹没，向我发送数千张虚假发票”时，就使用了”垃圾邮件”关键字。原始聊天日志提到垃圾邮件:
“I just got some spam messages last night, and today it’s been getting a lot of texts that I ‘don’t have my SIM card’ and I need my SIM card.”
“我昨晚刚收到了一些垃圾短信，今天又收到了很多短信，说我‘没有SIM卡’，我需要我的SIM卡。”
“TelCom started to flood me with emails and phone calls, spamming me with thousands of phony invoices.”
”电信公司开始向我发送大量电子邮件和电话，向我发送数千张虚假发票。”
“Basically, I’m getting a lot of spam calls every day from a guy named Michael who’s calling from a really weird number.”
“基本上，我每天都会接到很多垃圾邮件电话，来自一个叫迈克尔的家伙，他的电话号码很奇怪。”
Understanding the keywords and their context will allow us to prescribe actions to address customer churn. For example, some customers seem to be suffering high amounts of spam calls and are therefore deciding to leave the service. We could devise a plan to reduce the issue of spam calls which could in turn lower customer churn.
理解关键字及其上下文将允许我们制定应对客户流失的行动。例如，一些客户似乎遭受了大量的垃圾电话，因此决定离开服务。我们可以设计一个计划来减少垃圾电话的问题，这反过来可以降低客户流失。
Alternatively, we could gain more insight by categorizing the different keywords or churn phrases into distinct topics and formulate actions based on those topics. However, given the nature of our synthetic dataset with rather narrow conversation topics, we found it didn’t help in our case.
或者，我们可以通过将不同的关键字或搅拌短语分类为不同的主题并根据这些主题制定操作来获得更多的见解。然而，考虑到我们的合成数据集具有相当狭窄的对话主题的性质，我们发现它对我们的情况没有帮助。
Conclusion
结论
In this post, we showed how incorporating text data based on customer-agent interactions with traditional customer account data can improve performance of predicting customer churn. Furthermore, we introduced an approach that enables us to learn insights from the text, in particular, which keywords are most indicative of customer churn. Given our focus on global interpretability of the language model, our approach efficiently scales to the entire dataset which means we are able to understand the main drivers of churn across all customer-agent conversations. All data transformation steps, as well as model training, evaluation, and interpretation steps were performed on Amazon SageMaker.
在这篇文章中，我们展示了如何将基于客户代理交互的文本数据与传统的客户账户数据结合起来，以提高预测客户流失的性能。此外，我们还介绍了一种方法，使我们能够从文本中学习见解，特别是哪些关键字最能表明客户流失。鉴于我们专注于语言模型的全球可解释性，我们的方法有效地扩展到整个数据集，这意味着我们能够理解所有客户代理对话中客户流失的主要驱动因素。所有数据转换步骤以及模型训练、评估和解释步骤都在Amazon SageMaker上执行。
References:
参考资料：
[1] Language Models are Unsupervised Multitask Learners, Radford, Wu, et. al., 2019
[1]语言模型是无监督多任务学习者，Radford，Wu，et。艾尔，2019年
[2] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Devlin, Chang, et.al., 2019
[2] BERT:用于语言理解的深度双向变形器的预训练，Devlin, Chang等。, 2019年
[3] Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, Reimers, Gurevych, 2019
[3]句子- bert:使用Siamese BERT-Networks的句子嵌入，Reimers, Gurevych, 2019年
[4] Hugging Face Sentence Transformers
[4]拥抱脸句子变形金刚
[5] XGBoost model
[5]XGBoost模型
[6] Captum library for model interpretability
[6]用于模型可解释性的Captum库
[7] Spacy library for text processing
[7]用于文本处理的Spacy库

查看更多我要分享

Customer Churn Prediction with Text and Interpretability 基于文本和可解释性的客户流失预测

Customer Churn Prediction with Text and Interpretability
基于文本和可解释性的客户流失预测