Support Vector Machine Implementation With Python

 2018/03/01

In this post, a SVM classifier is implemented. The SVM is implemented with “Hard Margin” and “Soft Margin”. The “Hard Margin” is used to classify separable data, while the soft margin is used to classifier inseparable data. In addition, kernel can be used in this SVM classifier. In this post, the SVM is implemented with linear kernel and Gaussian Kernel(RBF kernel).

SVM Classifier

We will use a Quadratic Program Solver CVXOPT to solve the Lagrangian of SVM. A tutorial of CVXOPT can be found here (cvsopt).

import numpy as np
import matplotlib.pyplot as plt
import cvxopt
from cvxopt import matrix
from cvxopt import solvers

Hard Margin SVM

The Primal problem of SVM is

$min_{\lambda, \lambda_0} \frac{1}{2}||\lambda||^2_2 \text{ }\text{ }\text{ s.t. } y_i(\lambda^Tx_i + \lambda_0) - 1 \geq 0$

The Lagrangian of SVM is

$\mathcal{L}([\lambda, \lambda_0], \alpha) = \frac{1}{2} \sum_{j=1}^n \lambda_j^2 + \sum_{i=1}^n \alpha_i[-y_i(\lambda^Tx_i + \lambda_0) + 1]$

The $KKT$ condition is

$\begin{cases} \nabla_{\lambda}\mathcal{L} = \lambda - \sum_{i=1}^n\alpha_iy_ix_i = 0 \implies \lambda = \sum_{i=1}^n \alpha_iy_ix_i\\ \frac{\partial}{\partial \lambda_0}\mathcal{L} = -\sum_{i=1}^n\alpha_iy_i = 0\\ \alpha_i \geq 0\\ \alpha_i[-y_i(\lambda^Tx_i + \lambda_0) + 1] = 0\\ -y_i(\lambda^Tx_i + \lambda_0) + 1 \leq 0\\ \end{cases}$

The Dual Problem is

$max_{\alpha}\mathcal{L}(\alpha) = \sum_{i=1}^n\alpha_i - \frac{1}{2}\sum_{i,j}\alpha_i\alpha_ky_iy_kx_i^Tx_k$ $s.t. \text{ }\alpha_i \geq 0 \text{ }\forall i \text{ and } \sum_{i=1}^n \alpha_iy_i = 0$

This dual problem can be solved by a quadratic program solver.
That is

$min_{\alpha}: \frac{1}{2}\sum_{i,j}\alpha_i\alpha_ky_iy_kx_i^Tx_k - \sum_{i=1}^n\alpha_i$ $min_{\mathbf{\alpha}}: \frac{1}{2} \mathbf{\alpha}^T \mathbf{P} \mathbf{\alpha} + \mathbf{q}^T\mathbf{\alpha}$ $\mathbf{P} = \begin{bmatrix} y_1y_1\mathbf{x_1}^T\mathbf{x_1} & \dots & y_1y_n\mathbf{x_1}^T\mathbf{x_n} \\ \dots & \dots & \dots \\ y_ny_1\mathbf{x_n}^T\mathbf{x_1} & \dots & y_ny_n\mathbf{x_n}^T\mathbf{x_n} \\ \end{bmatrix}$ $\mathbf{q} = [-1,-1,-1,\dots, -1]^T$ $s.t. \begin{cases} \mathbf{G}\mathbf{\alpha} \leq \mathbf{h} \\ \mathbf{A} \mathbf{\alpha} = \mathbf{b}\\ \end{cases}$ $\mathbf{G} = -1 \times \mathbf{I_{n\times n}}$ $\mathbf{h} = [0,0,0,\dots,0]^T$ $A = [y_1, y_2, y_3, \dots, y_n]^T$ $b = 0$

The optimal value of $\alpha_i$ can be obtained by solving the above quadratic problem using CVXOPT.

Making prediction using Linear Kernel

For linear kernel, we make prediction by

$f(x_{new}) = \mathbf{\lambda}^{\star T}\mathbf{x_{new} + \lambda_0^{\star}}$

we can compute the $\lambda^{\star}$ by

$\lambda^{\star} = \sum_{i=1}^n \alpha_i^{\star} y_i \mathbf{x_i}$

Then we can use the $\lambda^{\star}$ and support vectors to compute $\lambda_0^{\star}$ by

$1 = y_i(\lambda^{\star T}\mathbf{x_i} + \lambda_0^{\star})$

For this question, all the support vectors are used to compute the $\lambda_0^{\star}$ and then take the average as the final $\lambda_0^{\star}$

Gaussian Kernel(RBF)

For a non-linear kernel, if we do not know how the kernel map features to a new feature space, the $\lambda^{\star}$ can not be computed by $\lambda^{\star} = \sum_{i=1}^n \alpha_i^{\star} y_i \mathbf{x_i^{\mathcal{H_k}}}$ if we do not know the $\mathbf{x_i^{\mathcal{H_k}}}$
However, all we need to know is the value of inner product a $\mathbf{x_i^{\mathcal{H_k}}}$ in the new feature space.
For solving the Dual problem,

$max_{\alpha}\mathcal{L}(\alpha) = \sum_{i=1}^n\alpha_i - \frac{1}{2}\sum_{i,j}\alpha_i\alpha_ky_iy_k\langle x_i, x_k \rangle_{\mathcal{H_k}}$

we just need to know the value of $\langle x_i, x_k \rangle$
For making prediction,

$\begin{align} f(x_{new}) & = \sum_j^p \lambda_j^{\star}x_{new(j)} + \lambda_0\\ & = \sum_{j}^p \sum_{i}^n \alpha_i^{\star}y_ix_{i,j} x_{new(j)} + \lambda_0^{\star}\\ & = \sum_i^n \alpha_i^{\star}y_i \sum_j^p x_{i,j} x_{new(j)} + \lambda_0^{\star}\\ & = \sum_i^n \alpha_i^{\star}y_i \langle \mathbf{x_i}, \mathbf{x_{new}}\rangle_{\mathcal{H_k}} + \lambda_0^{\star}\\ & = \sum_i^n \alpha_i^{\star}y_i k(\mathbf{x_i}, \mathbf{x_{new}}) + \lambda_0^{\star}\\ \end{align}$

We know that we just need to know the inner product value of $x$ in the new feature space to make a prediction.
Thus, for the SVM using Gaussian kernel, all we need is the value of two vector in the Gaussian kernel. Then we can use this inner product value to solve the Lagrangian and make prediction.

Soft Margin

The Primal problem of SVM is

$min_{\lambda, \lambda_0} \frac{1}{2}||\lambda||^2_2 + C\sum_{i=1}^n \xi_i$ $s.t. \begin{cases} y_i(\lambda^Tx_i + \lambda_0) \geq 1 - \xi_i\\ \xi_i \geq 0 \text{ } \forall i\\ \end{cases}$

The Lagrangian form of this prime is

$\mathcal{L}([\lambda, \lambda_0], \xi, \alpha, \gamma) = \frac{1}{2} \sum_{j=1}^p \lambda_j^2 + C\sum_{i=1}^n \xi_i -\sum_{i=1}^n \alpha_i[y_i(\lambda^Tx_i + \lambda_0) - 1 + \xi_i] - \sum_{i=1}^n \gamma_i\xi_i$

The Dual problem is

$max_{\alpha}\mathcal{L}(\alpha) = \sum_{i=1}^n\alpha_i - \frac{1}{2}\sum_{i,j}\alpha_i\alpha_ky_iy_kx_i^Tx_k$ $s.t. \begin{cases} 0 \leq \alpha_i \leq C \text{ }\forall i \\ \sum_{i=1}^n \alpha_iy_i = 0\\ \end{cases}$

The only difference is $0 \leq \alpha_i \leq C \text{ }\forall i$. To solve the quadratic problem, all we need to change is the matrix $\mathbf{P}$, $\mathbf{G}$ and $\mathbf{h}$.

$\mathbf{P} = \begin{bmatrix} y_1y_1k(\mathbf{x_1}, \mathbf{x_1}) & \dots & y_1y_nk(\mathbf{x_1}, \mathbf{x_n}) \\ \dots & \dots & \dots \\ y_ny_1k(\mathbf{x_n}, \mathbf{x_1}) & \dots & y_ny_nk(\mathbf{x_n}, \mathbf{x_n}) \\ \end{bmatrix}$ $\mathbf{G} = \begin{bmatrix} -\mathbf{I} \\ \mathbf{I} \\ \end{bmatrix}$ $\mathbf{h} = \begin{bmatrix} \mathbf{0_{n\times 1}}\\ \mathbf{C_{n\times 1}}\\ \end{bmatrix}$

The code of SVM implemented in Python is shown as below.

class SVM:
    ## default kernel is linear
    ## default sigma for gaussian kernel is 5.0
    def __init__(self, kernel='linear', margin='hard', C=1, sigma=5):
        self.k = kernel
        self.a = None
        self.sv_x = None
        self.sv_y = None
        self.w = None
        self.w_0 = None
        self.sigma = sigma
        self.margin = margin
        self.C = C
        
    def kernel(self, x1, x2):
        if self.k == 'linear':
            return np.dot(x1, x2)
        elif self.k == 'guassian':
            return np.exp(-np.linalg.norm(x1-x2)**2 / (2 * (self.sigma ** 2)))
        
    def train(self, X, y):
        caches = {}
        ## y is (1,-1)
        self.train_x = X ## point to the training data
        self.train_y = y ## point to the training data
        n, f = X.shape
        P = np.zeros((n, n))
        K = np.zeros((n,n))
        for i in range(n):
            for j in range(n):
                k_ij = self.kernel(X[i], X[j])
                P[i,j] = y[i] * y[j] * k_ij
                K[i, j] = k_ij
        P = matrix(P)
        q = matrix(-1 * np.ones((n, 1)))
        if self.margin == 'hard':
            G = matrix(-1 * np.identity(n))
            h = matrix(np.zeros((n, 1)))
        if self.margin == 'soft':
            G = matrix(np.vstack((-1 * np.identity(n), 1 * np.identity(n))))
            h = matrix(np.vstack((np.zeros((n,1)), self.C * np.ones((n, 1)))))
        
        A = matrix(y.reshape(1, n))
        b = matrix(0.0)
        sol = solvers.qp(P,q,G,h,A,b)
        a = np.ravel(sol['x'])
        
        sv_idx = np.where(a > 1e-5)
        self.sv_x = X[sv_idx]
        self.sv_y = y[sv_idx]
        self.a = a[sv_idx]
        
        if self.k == 'linear':
            self.w = np.dot((self.a * self.sv_y), self.sv_x)
            self.w_0 = np.mean(self.sv_y - np.dot(self.sv_x, self.w))
            print("The computed intercept term is {0:f}".format(self.w_0))
        else:
            ## can not compute w
            w_0s = np.zeros_like(self.a)
            for i in range(len(self.a)):
                first_term = 0
                first_term = np.dot(self.a * self.sv_y, K[sv_idx, sv_idx[0][i]].T)
                w_0s[i] = self.sv_y[i] - first_term
            self.w_0 = np.mean(w_0s)
            print("The computed intercept term is {0:f}".format(self.w_0))
        
        if self.margin == 'hard' and self.evaluate_acc(X, y) < 1.0:
            raise Exception('Hard margin, but data not separable')
            
        print ("The rate of support vectors: {0:0.6f}".format(float(len(self.a))/n))
        caches['sol'] = sol
        caches['K'] = K
        caches['sv_idx'] = sv_idx
        return caches
    
    def predict_no_bias(self, x):
        if self.k == 'linear':
            return np.dot(x, self.w)
        else:
            n, f = x.shape
            preds = []
            for i in range(n):
                pred = 0
                for j in range(len(self.a)):
                    pred += self.sv_y[j] * self.a[j] * self.kernel(self.sv_x[j], x[i])
                preds.append(pred)
            return np.array(preds)
    def predict(self, x):
        ## add bias to the prediction
        pred = self.predict_no_bias(x)
        return np.sign(pred + self.w_0)
    
    def evaluate_acc(self, x, y):
        n , f = x.shape
        pred = self.predict(x)
        acc = np.sum(pred == y) / len(y)
        return acc
    
    def plot_Roc(self, x_input, y_input):
        n, f = x_input.shape
        pred = self.predict_no_bias(x_input)
        ## argsort
        p = np.argsort(pred, axis=0)
        x = x_input[p] 
        y = y_input[p] ## labels after sorting
        pre = pred[p] ## prediction without bias after sorting
        min_dis = min(pred)
        max_dis = max(pred)
        
        true_lbs = np.sum(y == 1)
        false_lbs = np.sum(y == -1)
        
        b = np.linspace(min_dis, max_dis, 200)
        b = b[::-1]
        tprs = []
        fprs = []
        
        for i, t in enumerate(b):
            TP = 0
            FP = 0
            for j in range(n):
                if pre[j] > t:
                    if y[j] == 1:
                        TP += 1
                    if y[j] == -1:
                        FP += 1
            
            tprs.append(TP/(true_lbs))
            fprs.append(FP/(false_lbs))
        auc = np.trapz(tprs, fprs)
        plt.figure(figsize=(5,5))
        plt.plot(fprs, tprs, color='darkorange', label='ROC curve: AUC={0:0.8f}'.format(auc))
        plt.xlabel('False Positive Rate',fontsize=12)
        plt.ylabel('True Positive Rate',fontsize=12)
        plt.legend(loc="lower right", fontsize=12)
        plt.show()  
        return tprs, fprs

That’s it.

CATALOG

1. SVM Classifier
1. 1.1. Hard Margin SVM
  1. 1.1.1. Making prediction using Linear Kernel
  2. 1.1.2. Gaussian Kernel(RBF)
2. 1.2. Soft Margin



缺失模块。
1、请确保node版本大于6.2
2、在博客根目录（注意不是archer根目录）执行以下命令：
npm i hexo-generator-json-content --save
3、在根目录_config.yml里添加配置：

jsonContent:
  meta: false
  pages: false
  posts:
    title: true
    date: true
    path: true
    text: false
    raw: false
    content: false
    slug: false
    updated: false
    comments: false
    link: false
    permalink: false
    excerpt: false
    categories: false
    tags: true