Logistic Regression from First Principles in Python
Python中基于第一性原理的Logistic回归
Logistic Regression from First Principles in Python
Python中基于第一性原理的Logistic回归
Gain a better understanding of your model’s underlying assumptions
更好地理解模型的基本假设
Starting with nothing but a data set and three assumptions, we derive and implement a basic logistic regression in Python. The goal is to better understand the underlying assumptions of the model.
仅从一个数据集和三个假设出发,我们便可以推导并实现Python中的基本Logistic回归。我们的目标是更好的理解该模型的基本假设。
Define the Problem
问题定义
We use data from the Vega movie dataset. The goal is to predict whether a movie will be successful on Rotten Tomatoes given its net profitability. In real life, this is a little backwards, since a movie usually has reviews before its profitability is known. Let’s imagine we work at Rotten Tomatoes and are trying to predict what will happen if we backfill the site with older movies that have known box office performances but no reviews.
我们使用的数据来自于Vega电影数据集。我们的目标是对一部电影在考虑其净盈利能力的情况下预测其是否能在烂番茄上取得成功。在现实生活中,这有一些落后,因为通常情况下在知道电影的盈利能力前,电影就有评论了。让我们假设我们在烂番茄工作,那么当我们在试图使用已知票房但是没有评论的老电影来回填我们的网站时会发生什么。
First, we install data dependencies with
首先,我们加载数据的依赖项
The Rotten Tomatoes site uses scores over 75 for their Certified Fresh rating, so let’s stick with that number for what we call “good” movies, or good_rt. For simplicity, we only keep “G” rated movies.
烂番茄网站使用75分以上的分数作为他们的认证新鲜评级,所以让我们将这个分数作为“好”电影(good_rt)的评判标准。简单起见,我们只保留“G”级电影。
The dataframe now looks like:
数据帧现在是这样的:
We plot and inspect the data using Altair.
我们使用Altair对数据进行绘图和检验。
Lower profit movies seem less likely to be good. The goal now is to make a function that takes in a profit and returns a probability of being good. If that probability is over 50% for some movie, we say the function predicts that movie is “good”.
利润较低的电影似乎不太可能是好电影。现在我们的目标是编写一个函数,该函数可以在给定利润的情况下返回该电影是好电影的概率。如果某个电影的此概率超过50%,我们就认为这个函数预测该电影是好电影。
Eyeballing the above graph, we should expect the following characteristics from our function:
观察上图,我们应该期望我们的函数具有以下特征:
For negative profits, the function should return a number less than 50%.
对于负利润,函数应该返回一个小于50%的数字。
For very high profits, the function should return a number over 50%.
对于非常高的利润,函数应该返回一个超过50%的数字。
For lower (positive) profits, the function should return about 50%.
对于较低的(正)利润,函数应该返回50%左右的数字。
To improve on these guesses, we want use existing data to find an optimum “profit” cutoff.
为了改进这些猜测,我们希望使用现有数据来找到最佳的“利润”截止点。
Get Started with Predictions
从预测开始
The most drop-dead simple starting approach is a linear function that creates the probability by multiplying the profit, x, by a constant weight, m, and adding a constant, b.
最简单的启动方法是使用一个线性函数,它通过将利润 x 乘以恒定权重 m 并添加一个常数 b 来创建概率。
We could pick m and b to get reasonable responses for net profits between 0 and a billion dollars. However, this breaks because dollars are unbounded. A successful enough movie would push the probability m*x+b greater than 1.
我们可以使用m和b来获得对于净利润在0到10亿美元之间时的合理响应。 然而,这是存在问题的,因为美元是无限的。 一部足够成功的电影会使概率m*x+b 大于1。
Assumption 1: “Odds” are better than probability for dealing with numbers greater than 1
假设1:在处理大于1的数字时,“赔率”优于概率
The ratio of something happening versus it not happening is called the odds, and is natural terminology at any race track or sports betting arena. If a coin is twice as likely to land on heads than tails, we say the odds are 2 to 1. It is easier to say the odds on a horse in a race are 9 to 2 instead of there’s an 81.818% probability of the horse losing.
赔率指某事发生与未发生的比率,它是任何赛马场或体育博彩领域的自然术语。 如果硬币正面朝上的可能性是反面朝上的两倍,我们就说赔率是2比1。在比赛中,说一匹马的赔率是9比2比说它输掉的概率是81.818%更容易。
Since the net profit of a movie can be arbitrarily high, just like the odds of a bet, we assume it is appropriate to relate the profit with the odds of a good Rotten Tomatoes score.
由于一部电影的净利润可以任意高,就像下注的几率一样,所以我们假设将利润与在烂番茄获得高评分的赔率联系起来是合适的。
This way, any arbitrarily high profit, x, can be matched with a corresponding probability close to one. Plug 0.99999 in for probability to see why.
通过这种方式,任何任意高的利润x都可以与接近1的相应概率相匹配。输入0.99999的概率来看看这是为什么。
There is a second problem. The net profit of a movie, like 102 Dalmations, can be negative if the movie cost more money to film than it made at the box office. We need a way to make sure the right side of the equation is always positive no matter how negative x is.
还有第二个问题。如果电影的拍摄成本高于票房收入,则电影的净利润可能为负数(如 102 Dalmations)。我们需要一种方法来确保等式的右侧始终为正,无论x有多负。
Assumption 2: Exponentiating a list of ordered numbers is the simplest way to make them all positive without losing order.
假设 2:对有序数字列表求幂是使它们全部为正而且不丢失顺序的最简单方法。
Try making every number in this sorted list positive while keeping it sorted:
尝试使此排序后的列表中的每个数字都为正数,同时保持排序不变:
One way to make numbers positive is taking the absolute value of each item:
使数字为正的一种方法是对每项取其绝对值:
The problem is the order changed. Say these numbers represent movie profits in millions. The movie that made a $2M profit and the movie that lost $2M are next to each other and indistinguishable, an undesirable outcome for sure!
然而问题是这样做会导致顺序发生改变。 假设这些数字代表以百万计的电影利润。 盈利200万美元的电影和亏损200万美元的电影紧挨着,并且无法区分,这肯定是不受欢迎的结果!
Instead, take e^x for each movie profit x, giving
相反,为每个电影的利润x取e^x,则
The five movies are still in the same order but we have successfully made all the values positive.
这五部电影的顺序仍然是一样的,但我们已经成功地使所有的值都是正的。
At this point you may be asking but why exponentiate instead of using any other function, like 1+tanh(x), to make negative numbers positive? That would work fine! We are trying to solve the profit-to-movie-rating-probability problem in the simplest way possible, and have to make these numbers positive somehow. If we assume that e^x is the best choice, we can go on to derive the logistic regression. Another assumption may lead to a valid, but different, result. That’s why exponentiation is an assumption of the logistic regression.
此时您可能会问,为什么要取幂而不是使用任何其他函数(如 1+tanh(x))来使负数变为正数?其他函数也会很好用!但是我们试图以最简单的方式解决利润对电影评级的概率问题,并且我们必须以某种方式使这些数字变为正数。如果我们认为e^x是最佳选择,我们可以继续推导Logistic回归。 其他的假设可能同样有效,但是会得到不同的结果。 这就是为什么取幂是Logistic回归的假设。
Our relation between profit and probability now looks like:
现在,我们的利润和概率之间的关系如下:
Finally, we’re ready to do math to create the function taking in a net profit and returning a probability!
最后,我们准备做数学运算来创建函数,获取净利润并返回概率!
In Python:
在Python中:
The function is done! We haven’t said so until now, but we have derived the logistic regression using only two assumptions! The only thing remaining is to put in the optimal values for m and b, which is called “fitting” the function. But wait… what are the best values for m and b? It turns out that is such an important question, we go on to answer it as a continuation of the derivation.
函数被成功实现了!直到现在我们才这么说,但我们在推导Logistic回归的过程中只使用了两个假设!剩下的最后一件事情就是输入m和b的最优值,这被称为函数“拟合”。但是等等……m和b的最优值是多少?事实证明这是一个如此重要的问题,我们将继续回答,并将它作为推导的延续。
Plug in Data to Find Optimal Values
输入数据以寻找最优值
To find the best values for the logistic regression, we start with random values and see what happens when m and b are tweaked. But first, we define “what happens” by considering what it means to correctly guess movie goodness.
为了找到Logistic回归的最优值,我们从随机值开始,看看调整m和b时会发生什么。但首先,我们通过考虑正确猜测电影好坏意味着什么来定义“发生了什么”。
When Movie 1, with a high Rotten Tomatoes score, is evaluated by our function, the probability the prediction is right is prob(x1). When evaluating the sequel Movie 2, with a low Rotten Tomatoes score, the probability is 1-prob(x2). The probability for correctly predicting both together is prob(x1)*(1-prob(x2)).
当我们的函数评估具有较高烂番茄评分的电影1时,预测正确的概率为prob(x1)。 在评估续集电影2时,烂番茄评分较低,概率为1-prob(x2)。同时,两者的预测均正确的概率是 prob(x1)*(1-prob(x2))。
Generally, the probability of getting several predictions right is the product of the each prediction’s probability. For example, take two movies from the data:
通常,获得多个正确预测的概率是每个正确预测概率的乘积。例如,使用两部来自数据库的电影:
The Adventures of Elmo in Grouchland, which lost $5M but has a high Rotten Tomatoes rating.
Grouchland的Elmo历险记,亏损了500万美元,但烂番茄评分很高。
Muppets From Space, which lost $7M and has a low Rotten Tomatoes rating.
来自太空的布偶损失了 700 万美元,并且烂番茄评分很低。
Using arbitrary values for m and b, the probability of predicting these two movies correctly is:
使用任意m和b,正确预测这两部电影的概率:
A combined probability of 1.2%. Let’s see if tweaking m improves this:
综合概率为1.2%。让我们看看调整m是否会改善这一点:
27%! We’re predicting better! Let’s do it again and see what happens:
27%!此时我们预测的更好!我们再来调整一次,看看会发生什么:
Hmmm, down to 20%. Maybe we could have moved b instead?
嗯,降到20%。也许我们可以调整b?
A slight improvement, but what if we… you get the idea. We could keep nudging m and b in either direction until every change makes the probability lower, meaning we arrived at the optimal values. This is tedious and will be even harder with more movies from the data. Worse yet, if there were more than two parameters, such as by taking an additional movie feature into account, we would have to tweak all of them this way.
略有改进,但如果我们……你应该明白怎么办。 我们可以继续沿任一方向轻微调整m和b,直到每次变化都会降低概率,这意味着我们达到了最优值。然而,这很乏味,并且随着数据库中的电影增多而变得更加困难。更糟糕的是,如果有两个以上的参数,例如考虑到额外的电影特性,我们将不得不以这种方式调整所有参数。
Instead, let’s create one giant probability with all the data and then find a better way to optimize it. To do this, first combine the low and high Rotten Tomatoes outcomes into a single probability statement:
相反,让我们用所有数据创建一个巨大的概率,然后找到一种更好的方法来优化它。为此,首先将烂番茄的低分和高分组合成一个概率声明:
To see why the third equation above works, notice when rt =1 the right term goes away. When rt = 0, the left term goes away.
要了解上述第三个等式为何有效,请注意当rt =1时,右项消失。当rt=0时,左项消失。
Now we can calculate the probability for good and bad movies at the same time. Note we are still using arbitrarily chosen values for m and b.
现在我们可以同时计算好电影和坏电影的概率。请注意,我们仍然使用任意选择的m和b值。
We have the probability for guessing all the data points together. Now it’s time to find the best values for m and b in a smarter way than manually tweaking.
我们有可能把所有数据点猜到一起。现在,是时候使用比手动调整更智能的方式来寻找m和b的最佳值了。
Maximize the Probability
概率最大化
The total probability in the for loop above is represented this way in math:
上面for循环中的总概率在数学中是这样表示的:
The prob(x)function, the single movie probability, is given by:
prob(x)函数,即单个电影的概率,由下式给出:
To get the optimal value for m (or b) we take the derivative of the total probability with respect to m (or b), set the equation to zero and solve, right? Unfortunately, there are two problems:
为了得到m(或b)的最优值,我们求总概率对m(或b)的导数,将方程设置为零并求解,对吗? 不幸的是,有两个问题:
Taking the derivative of a whole bunch of things multiplied together suuuccckkks.
取一大堆东西相乘的导数,真恶心。
Even once that giant derivative is taken, it turns out we can’t find a closed-form solution for m (or b) so we can’t get the optimal value this way.
即使求解了巨大的导数,我们也无法找到m(或b)的封闭形式的解,因此我们无法通过这种方式求解最优值。
But all is not lost. While we can’t analytically solve this, we can still find the optimal values for m and b. We could try to take the (terrible) derivative of the probability equation, then take a small step in that direction and repeat until we are at the peak!
但我们并没有失去一切。虽然我们不能解析地解决这个问题,但我们仍然可以找到m和b的最优值。我们可以试着对概率方程求解(可怕的)导数,然后朝那个方向迈出一小步并不断重复,直到我们到达顶峰!
This works. But remember, that derivative of a bunch of things multiplied is hard. In practice, taking one more step first makes things much easier.
这行得通。 但是请记住,一堆东西相乘的导数是很难的。 在实践中,先多迈出一步会让事情变得容易得多。
Assumption 3: To maximize a function, first take the log of that function if it makes the math easier.
假设3:要最大化一个函数,为了简化计算,可以取其对数。
Justification: for any function with peaks and valleys, taking the log everywhere lowers the heights of the peaks and valleys, but it doesn’t remove any of them or move them left and right. To find the maximum value for m of the probability function, take the log of it everywhere and find the maximum value of m for that function and the peak will be in the same place.
理由:对于任何有峰谷的函数,在任何地方取对数可以降低峰谷的高度,但不会导致任何峰谷被删除或左右移动。 要找到概率函数m的最大值,请在任何地方取它的对数,并找到该函数取最大值时m的值,峰值将在同一位置。
Again, you might be thinking why the log? Why not the square or square root or cosh or any other function? And again you’d be right. We are free to pick any reasonable function that doesn’t affect where the probability peak is. We are assuming the log makes our work easier than the other options.
同样,您可能会想为什么要使用对数?为什么不是平方或平方根或cosh或任何其他函数?你又是对的。我们可以自由选择任何不影响概率峰值所在的合理函数。 我们假设取对数可以使我们的工作比其他选项更容易。
Here’s why. The log of a product of terms is equal to the sum of the terms’ logs:
原因如下:某项的乘积的对数等于该项的对数之和:
For a single movie i with profit of x and Rotten Tomatoes classification of y:
对于盈利为x,烂番茄分级为Y的单部电影i:
[It also turns out this is easier for computers to get the math right.]
[事实证明,这对计算机来说更容易正确计算。]
Now take a deep breath, take the derivative with respect to m, plugging in the prob(x) function where necessary:
现在深呼吸,对m求导,必要时输入prob(x)函数:
In a seemingly miraculous cancellation of terms, the derivative (the direction to tweak m) is given by x*(y-prob(x)), the net profit of a movie times the Rotten Tomatoes goodness minus the probability the function is correct.
在一个看似神奇的消元过程中,导数(用于调整m的方向)由x*(y-prob(x))给出,即电影的净利润乘以烂番茄的优度减去函数预测正确的概率。
To redo the calculation where the derivative is with respect to b, the only difference in the result is the leading x becomes a 1.
重做针对b的导数计算,结果的唯一区别是前面的x变为了1。
The final step after calculating the derivatives for all the data points is to take some tiny fraction of them to add back to the parameters m and b. This will slightly improve our total probability. Repeating the process, the probability will eventually stop getting better and we have optimal values of m and b.
计算所有数据点的导数后的最后一步是取其中的一小部分,将它们加回参数m和b。这将略微提高我们的总概率。重复这个过程,概率最终会停止变得更好,此时我们便得到了m和b的最优值。
which ends with m = 2.815e-09 and b = 0.0001839. To sanity check the result, run the same regression with Scikit Learn and compare:
以m=2.815e-09和b=0.0001839结束。为了地检查结果,请使用Scikit Learn运行相同的回归并进行比较:
giving m = 2.813e-09, and b = -6.792e-17.
得到m=2.813e-09,b=-6.792e-17。
Summary
小结
We set out to predict the probability that a movie will be successful on Rotten Tomatoes given its net profit, which we now have:
我们着手预测一部给定了净利润的电影在烂番茄上获得成功的概率,我们现在有:
We made three reasonable assumptions, justifying them along the way. We are free to stray from them, for example by picking alternatives to the log, but we will lose the simple derivative result x*(y-prob(x)) and we won’t arrive at the logistic regression.
我们做了三个合理的假设,并在此过程中证明了它们的合理性。 我们可以自由地偏离这几个假设,例如通过选择对数的替代方案,但这会使得我们丢失简单的导数结果 x*(y-prob(x)) ,并且我们不会实现Logistic回归。
Any serious application of the logistic regression should rely on packages like Scikit-learn instead of our home-rolled variety. Not only is it easier to read, write and debug, but it’s also packed with features like coefficient optimization, support for multiple variables, support for pipelines, and so much more.
Logistic回归的任何重要应用都应该依赖于Scikit-learn之类的软件包,而不是我们自制的变种。此类软件包不仅更易于阅读、编写和调试,而且还包含系数优化、多变量支持、管道支持等功能。
Jargon and Next Steps
术语和后续步骤
We derived the logistic regression from first principles. However, it is helpful to associate the concepts from above with standard terminology.
我们从第一性原理推导出Logistic回归。但是,将上面的概念与标准术语相关联是有帮助的。
Logistic Function: f(x) = 1/[1 + exp(-x)]
Logistic函数:f(x)=1/[1+exp(-x)]
Logit Function: f(x) = ln[x/(1-x)]
Logit函数:f(x)=ln[x/(1-x)]
Likelihood: The equation with the big ? starting with Probability=...
可能性:等式,大的开始是概率=.。。
Log Likelihood: The equation starting with ln[Probability]=...
对数似然:等式以ln[概率]=...开头
Gradient: The derivative with respect to m (or b)
梯度:对m(或b)的导数
Gradient Ascent: Adding a small fraction of the gradient back to m (or b)
梯度上升:将梯度的一小部分加回到m(或b)
For followup work, check out the Logistic Regression from Scratch in Python post in the references below, where a Numpy-based approach derives a multiple-variable logistic regression in about 20 lines of code. Try coding up a two dimensional extension yourself and play with the plotting code in the references to get an intuition for the meaning of the coefficients.
对于后续工作,请查看下面参考文献中的Scratch in Python中的Logistic回归,其中基于 Numpy 的方法在大约 20 行代码中推导出l 多变量Logistic回归。 尝试自己编写一个二维扩展并使用参考文献中的绘图代码来直观地了解系数的含义。
References
参考文献
Derivation of logistic regression: https://web.stanford.edu/class/archive/cs/cs109/cs109.1178/lectureHandouts/220-logistic-regression.pdf
Logistic回归的推导:https://web.stanford.edu/class/archive/cs/cs109/cs109.1178/lecturehandouts/220-logistic-regression.pdf
Discussion of the logit transformation: https://data.princeton.edu/wws509/notes/c3.pdf#page=6
logit转换的讨论:https://data.princeton.edu/wws509/notes/c3.pdf#page=6
Logistic Regression from Scratch in Python: https://beckernick.github.io/logistic-regression-from-scratch/
Scratch in Python中的Logistic回归:https://beckernick.github.io/logistic-regression-from-scratch/
Plotting the decision boundary of a logistic regression model: https://scipython.com/blog/plotting-the-decision-boundary-of-a-logistic-regression-model/
绘制Logistic回归模型的决策边界:https://scipython.com/blog/plotting-the-decision-bounder-of-a-logistic-regression-model/
LaTeX generated with https://latexeditor.lagrida.com/
使用https://latexeditor.lagrida.com/生成的LaTex代码
All the Code
所有的代码
Did you scroll all the way down to the bottom just to get the full code to copy/paste and see if this works for you locally? If so, enjoy :-)
您是否一直向下滚动到底,只是为了获取完整的代码来复制或粘贴,并查看这是否可以在您的计算机上工作? 如果是这样,请享受:-)