我们知道,对于软间隔支持向量机,我们最后往往是求解其满足KKT条件的对偶问题:
\[\max\quad f(\alpha)=\sum_{i=1}^m\alpha_i-\frac{1}{2}\sum_{i=1}^m\sum_{j=1}^m\alpha_i\alpha_jy_iy_jx_i^\top x_j\\ s.t.\quad\sum_{i=1}^m\alpha_iy_i=0\\\qquad0\leqslant\alpha_i\leqslant C\\\]其中$m$是训练集样本数,$y_i$是第$i$个样本的标签,只能是-1或1。
在周志华的《机器学习》中,除了告诉读者求解该问题得使用SMO算法之外,并没有其他的细节。本文便是对求解上述问题的SMO算法的简单介绍。
SMO算法其实是一种启发式算法:先选择两个变量$\alpha_i$和$\alpha_j$,然后固定其他参数,从而将问题转化成一个二变量的二次规划问题。求出能使目标最大的一对$\alpha_i$和$\alpha_j$后,将它们固定,再选择两个变量,直到目标值收敛。
有人会问,为什么不是让一个参数自由,其他变量固定,这样岂不是更加方便?考虑到该优化问题有这样的等式约束:
\[\sum_{i=1}^m\alpha_i y_i=0\]如果固定其他$m-1$个变量,其实所有变量便被固定,这是行不通的,因此必须选择至少两个变量。
选择合适的两个变量是一门学问,但我们这里不会涉及。
不失一般性,我们选择$\alpha_1$和$\alpha_2$作为活动变量,那么我们将目标函数进行改写:
\[\begin{aligned} f(\alpha_1,\alpha_2)&=\sum_{i=1}^m\alpha_i-\frac{1}{2}\sum_{i=1}^m\sum_{j=1}^m\alpha_i\alpha_jy_iy_jx_i^\top x_j\\ &=\alpha_1+\alpha_2+\sum_{i=3}^m\alpha_i-\frac12\big[\alpha_1\alpha_1y_1y_1x^\top_1x_1+2\alpha_1y_1\alpha_2y_2x_1^\top x_2+\alpha_2y_2\alpha_2y_2x_2^\top x_2\\ &\quad+2\alpha_1y_1\sum_{i=3}^m\alpha_iy_ix_1^\top x_i+2\alpha_2y_2\sum_{i=3}^m\alpha_iy_ix^\top_2x_i+\sum_{i=3}^m\sum_{j=3}^m\alpha_i\alpha_j y_iy_jx^\top_ix_j \big]\\ \end{aligned}\]这里我们对$x^\top_i x_j$进行这样的设置:
\[K=[x_i^\top x_j]_{m\times m}\]这样我们就可以简便的用$K_{ij}$表示向量积,而如果我们设置:
\[K=[\phi(x_i)\phi(x_j)]_{m\times m}\]也就是将其延伸到核函数,使得表达更加统一。根据核函数和向量积的性质,我们也知道两种$K$都是对称矩阵。从而:
\[\begin{aligned} f(\alpha_1,\alpha_2)&=\alpha_1+\alpha_2+\sum_{i=3}^m\alpha_i-\frac12\big[\alpha_1\alpha_1y_1y_1K_{11}+2\alpha_1y_1\alpha_2y_2K_{12}+\alpha_2y_2\alpha_2y_2K_{22}\\ &\quad+2\alpha_1y_1\sum_{i=3}^m\alpha_iy_iK_{1i}+2\alpha_2y_2\sum_{i=3}^m\alpha_iy_iK_{2i}+\sum_{i=3}^m\sum_{j=3}^m\alpha_i\alpha_j y_iy_jK_{ij}\big]\\ \end{aligned}\]由于$\alpha_2\sim\alpha_m$都是固定的,且$y_i^2=1$,根据这两点可以进一步写出:
\[\begin{aligned} f(\alpha_1,\alpha_2)&=\alpha_1+\alpha_2-\frac12\big[\alpha_1^2K_{11}+2\alpha_1y_1\alpha_2y_2K_{12}+\alpha_2^2K_{22}\\ &\quad+2\alpha_1y_1\sum_{i=3}^m\alpha_iy_iK_{1i}+2\alpha_2y_2\sum_{i=3}^m\alpha_iy_iK_{2i}\big]+C\\ \end{aligned}\]其中$C$是常数,在求极值点过程中可以忽略,即令$f(\alpha_1,\alpha_2)\gets f(\alpha_1,\alpha_2)-C$。考虑后$m-2$个变量被固定:
\[\begin{aligned} \alpha_1y_1+\alpha_2y_2&=-\sum_{i=3}^m\alpha_iy_i=C\\ \alpha_1y_1&=C-\alpha_2y_2\\ \alpha_1&=y_1(C-\alpha_2y_2)\cdots\to y_1=\dfrac1{y_1}\\ \end{aligned}\]将$\alpha_1$用$\alpha_2$的函数代入,从而将$f(\alpha_1,\alpha_2)$变成$f(\alpha_2)$:
\[\begin{aligned} f(\alpha_2)&=y_1(C-\alpha_2y_2)+\alpha_2-\dfrac{1}{2}\big[ (C-\alpha_2y_2)^2K_{11}+2(C-\alpha_2y_2)\alpha_2y_2K_{12}+\alpha_2^2K_{22}+\\ &\qquad 2(C-\alpha_2y_2)\sum_{i=3}^m\alpha_iy_iK_{1i}+2\alpha_2y_2\sum_{i=3}^m\alpha_iy_iK_{2i} \big] \end{aligned}\]我们对$f(\alpha_2)$求导并令其为0:
\[\begin{aligned} f'(\alpha_2)&=-y_1y_2+1-\dfrac{1}{2}\big[ -2(C-\alpha_2y_2)y_2K_{11}+2Cy_2K_{12}-4\alpha_2K_{12}+2\alpha_2K_{22}\\ &\quad-2y_2\sum_{i=3}^m\alpha_iy_iK_{1i}+2y_2\sum_{i=3}^m\alpha_iy_iK_{2i} \big]\\ &=1-y_1y_2+Cy_2K_{11}-\alpha_2K_{11}-Cy_2K_{12}+2\alpha_2K_{12}-\alpha_2K_{22}+y_2\sum_{i=3}^m\alpha_iy_iK_{1i}\\ &\qquad-y_2\sum_{i=3}^m\alpha_iy_iK_{2i} \\ &=0 \end{aligned}\]我们有等式:
\[\begin{aligned} (K_{11}-2K_{12}+K_{22})\alpha_2&=1-y_1y_2+Cy_2K_{11}-Cy_2K_{12}+y_2\sum_{i=3}^m\alpha_iy_iK_{1i}-y_2\sum_{i=3}^m\alpha_iy_iK_{2i}\\ &=y_2(y_2-y_1+CK_{11}-CK_{12}+\sum_{i=3}^m\alpha_iy_iK_{1i}-\sum_{i=3}^m\alpha_iy_iK_{2i})\\ \end{aligned}\]这里只要将$C$用$-\sum_{i=1}^3\alpha_iy_i$替代,就可以得到$\alpha_2$的解析解,但这样的计算代价是巨大的,同时对于计算机来说,迭代算法更加适合它的计算方法,因此我们将$C$设为$\alpha_1^{\text{old}}y_1+\alpha_2^\text{old}y_2$,从而解出新的$\alpha_2$,也就是$\alpha_2^\text{new}$:
\[\begin{aligned} (K_{11}-2K_{12}+K_{22})\alpha_2^\text{new} &=y_2\bigg(y_2-y_1+(\alpha_1^{\text{old}}y_1+\alpha_2^\text{old}y_2)K_{11}-\\&\qquad(\alpha_1^{\text{old}}y_1+\alpha_2^\text{old}y_2)K_{12}+\sum_{i=3}^m\alpha_iy_iK_{1i}-\sum_{i=3}^m\alpha_iy_iK_{2i}\bigg)\\ \end{aligned}\]考虑支持向量机的表达式:
\[f(x)=\sum_{i=1}^m\alpha_iy_ix_i^\top x+b\]所以我们就有
\[f(x_1)=\sum_{i=1}^m\alpha_iy_iK_{1i}+b\\ f(x_2)=\sum_{i=1}^m\alpha_iy_iK_{2i}+b\\\]用上式代替迭代式右端的两个求和部分:
\[\begin{aligned} (K_{11}-2K_{12}+K_{22})\alpha_2^\text{new} &=y_2\bigg(y_2-y_1+(\alpha_1^{\text{old}}y_1+\alpha_2^\text{old}y_2)K_{11}-\\&\qquad(\alpha_1^{\text{old}}y_1+\alpha_2^\text{old}y_2)K_{12}+ f(x_1)-\alpha_1^\text{old}y_1K_{11}-\alpha_2^\text{old}y_2K_{12}-b\\ &\qquad-f(x_2)+\alpha_1^\text{old}y_1K_{12}+\alpha_2^\text{old}y_2K_{22}+b \bigg)\\ &=y_2\bigg(f(x_1)-y_1-\big(f(x_2)-y_2\big)+\alpha^\text{old}_2y_2(K_{11}-2K_{12}+K_{22})\bigg)\\ \end{aligned}\]我们设$E_1\gets f(x_1)-y_1$,$E_2\gets f(x_2)-y_2$,这两者都是误差,然后再设$\eta=K_{11}-2K_{12}+K_{22}$,那么迭代式就是
\[\alpha_2^\text{new}=\dfrac{y_2}{\eta}(E_1-E_2)+\alpha_2^\text{old}\]但我们不能高兴太早,因为我们忽略了$\alpha_i$的范围条件:
\[0\leqslant\alpha_i\leqslant C,i=1,2\]再考虑到$\alpha_1y_1+\alpha_2y_2=$常数,因此我们的$(\alpha_1^\text{new}, \alpha_2^\text{new})$其实被限制在$[0,C]^2$的一条线段上。因此$\alpha_2$真正的更新公式应该是:
\[\alpha_2^\text{new}\gets \begin{cases} H,\quad\alpha_2^\text{new}>H\\ \alpha_2^\text{new},L\leqslant\alpha_2^\text{new}\leqslant H \\ L,\quad\alpha_2^\text{new}<L\\ \end{cases}\]这里的$H$和$L$是利用不等式约束和等式约束计算得到的$\alpha_2$上下界。我们也可以求得$\alpha_1$的迭代式:
\[\alpha_1^\text{new}=\alpha_1^\text{old}+y_1y_2(\alpha_2^\text{old}-\alpha_2^\text{new})\]迭代直到收敛,便能得到选择$\alpha_1$和$\alpha_2$活动时两者的最优解,然后选择另两个变量进行类似的操作,直到目标函数收敛。这就是SMO算法的核心流程,还有一些方法我们没有提及,比如变量的选择,这显然会影响收敛的速度。