Support Vector Machines: From Hard Margin to Soft Margin

harsimranjit_singh_0133dc

Harsimranjit Singh

Posted on August 12, 2024

Support Vector Machines: From Hard Margin to Soft Margin

Support Vector Machines are powerful tools in machine learning for classifying data and predicting values. They are popular in various fields like bioinformatics and financial forecasting because they handle complex data problems well. This article aims to explain SVMs in a simple way, covering topics like maximal margin classifier, support vectors, kernel trick and infinite dimensional mapping.

What is SVMs

An SVM performs classification at its core by finding the hyperplane that best divides a dataset into classes. The unique aspect of SVMs lies in their ability to find the optimal hyperplane that maximizes the margin- the distance between the hyperplane and the nearest data points from each class. A larger margin helps the SVM make better predictions on new data as decision boundaries are much clearer. The SVM does this with the help of Support Vectors.

Support Vectors

Support Vectors are the subset of training data that are directly involved in constructing the optimal separating hyperplane These points are crucial because they lie on the edge of the margin and influence the position and orientation of the hyperplane.

Image description

Mathematically, if you have a dataset with class labels yiy_i \in {-1, +1} and feature XiX_i the support vectors satisfy the condition:

yi(wxi+b)=1y_i (w \cdot x_i + b) = 1

From where this equation comes?

The hyperplane in d-dimensional space is defined by the equation:

W.X - b = 0
where:

  • w is the normal vector( normal to the hyperplane).
  • b is the bias term (offset from the origin).

For a given data point XiX_i with class label yiy_i :

  • if yi=1y_i=1 , the data point belongs to the positive class.
  • if yi=1y_i=-1 , the data point belongs to the negative class.

We want the hyperplane to correctly classify these data points, so we need to ensure that:

  • Points from the positive class are on one side of the hyperplane. For points on the positive side: wxib1w \cdot x_i - b \geq 1
  • Points from the negative class are on the other side.

    For points on the negative side:
    wxib1w \cdot x_i - b \leq -1

To combine these constraints into simple form, we use the class label yiy_i , the constraint can be written as:

yi(wxi+b)1y_i (w \cdot x_i + b) \geq 1
Here's why:

  • if yiy_i = 1, the constraint becomes wxib1w \cdot x_i - b \geq 1 , which ensures correct classification for positive class points.
  • if if yiy_i = -1, the constraint becomes wxib1w \cdot x_i - b \leq 1 , which ensure correct classification for negative class points.

Margin Calculation

In SVMs, the margin is the distance between the hyperplane and the closest data points from each class(support vectors).

Image description
To calculate the margin, we use the following formula:
Margin=2w\text{Margin} = \frac{2}{||w||}

From where do we get this formula

The perpendicular distance d from point XiX_i to hyperplane is:
d=wxibwd = \frac{|w \cdot x_i - b|}{||w||}
Now for the support vectors, the distance from the hyperplane is exactly
d=1wd = \frac{1}{||w||}
This is because support vectors lie on the boundaries of the margin, where
wxib=1w \cdot x_i - b = 1
Therefore:
d=1wd = \frac{1}{||w||}

Now we need the distance between the both hyperplanes
wxib=1w \cdot x_i - b = 1
wxib=1w \cdot x_i - b = -1
Therefore the distance will be:
Margin=2w\text{Margin} = \frac{2}{||w||}

Understanding Hard Margin SVM

The term "Hard Margin" comes from the fact that the algorithm requires all data points to be classified with a margin of at least 1. In other words, there are no allowances for misclassification. These strict requirements are why it's called a "hard" margin

Formulating the Hard Margin SVM

1. Objective function

The goal of Hard margin SVM is to maximize the margin between the two classes. As we previously discussed:

Margin=2w\text{Margin} = \frac{2}{||w||}

To maximize the margin, we need to minimize its reciprocal
Minimize 12w2\text{Minimize } \frac{1}{2} |w|^2

Why squaring the norm? Because it provides smoothness and differentiability. This makes it easier to compute gradients and perform optimization using gradient-based methods.
Minimizing the squared norm is equivalent to minimizing the norm because minimizing the w2||w||^2 will always lead to the same optimal W as minimizing w||w||

2. Constraints:

The constraints ensure that each point is correctly classified and lies at least at a distance of 1 from the hyperplane.
yi(wxi+b)1y_i (w \cdot x_i + b) \geq 1

3. Hard Margin SVM Optimization Problem:

Putting it all together, the hard-margin SVM optimization problem is:
Minimize 12w2 subject to yi(wxib)1,i \text{Minimize } \frac{1}{2} |w|^2 \text{ subject to } y_i (w \cdot x_i - b) \geq 1, \, \forall i

Now we need to solve this problem to find the solution

Problem with Hard Margin

While Hard Margin SVMs are effective for linearly separable data, they come with certain limitation

It fails in the case of outliers and misclassified data

Image description
In this two points are outliers, in this scenario, hard SVM fails to plot the decision boundary as it tries to classify all the points but is unable to classify these two points.

To tackle this Soft-Margin SVM is used

Image description

Soft Margin SVM

While Hard Margin SVM works well with linearly separable data, it struggles with datasets containing outliers or overlapping classes. To address these limitations, Soft Margin SVM introduces a concept called "slack Variables"

Image description

Slack Variables

Slack variables ξi\xi_i is introduced to measure the degree to which a data point is allowed to violate the margin constraints. They allow for some misclassification and margin violations. For each data point i, the slack variable ξi\xi_i is non-negative and represents the extent of the constraint violation.

  • if ξi\xi_i =0, the data point is correctly classified and lies on or outside the margin.
  • if 0< ξi\xi_i <1, the data point is correctly classified but withing the margin.
  • if ξi\xi_i >=1, the data point is misclassified.

Soft-Margin SVM: Objective Function and Constraints

Objective Function
The soft-margin SVM modifies the objective function to incorporate slack variables. The goal is now to balance between maximizing the margin and minimizing the classification error. The new objective function is:
Minimize 12w2+Ci=1nξi\text{Minimize } \frac{1}{2} ||w||^2 + C \sum_{i=1}^n \xi_i
where:

  • 12w2\frac{1}{2} ||w||^2 still represents the margin maximization term.
  • C is the regularization parameter that controls the trade-off between the margin size and classification error.
  • ξi\xi_i are the slack variables representing the extent of misclassification for each data point.

Constraints
The constraints are modified to account for slack variables:
yi(wTxi+b)1ξiy_i (w^T x_i + b) \geq 1 - \xi_i
ξi0\xi_i \geq 0
where:

  • The term 1ξi1 - \xi_i allows some flexibility by permitting data points to lie within or beyond the margin.
  • ξi0\xi_i \geq 0 ensure that slack variables are non-negative, representing non-negative deviations from the ideal margin.

What is ξi\xi_i

Hinge Loss
The hinge loss function is crucial concept in understanding the role of ξi\xi_i in SVMs. it measures the cost associated with misclassifications and margin violations. The hinge loss for data point ( xi,yix_i,y_i ) is defined as:
L(yi,f(xi))=max(0,1yif(xi)) L(y_i, f(x_i)) = \max(0, 1 - y_i f(x_i))
where f(xi)=wxi+b \text{where } f(x_i) = w \cdot x_i + b

Intuition behind the Hinge Loss Function

1. Decision Function:

in SVMs, the decision function f(xi)f(x_i)
where w is weight vector, x is input feature vector, and b is bias term.

2. Margin and Classification:

  • For a correctly classified point, we want yif(xi)1y_if(x_i) \geq 1 . This ensures that point is on the correct side of the margin.
  • For a misclassified point yif(xi)<1y_if(x_i) < 1 . ### 3. Hinge Loss: The hinge loss penalizes points that do not satisfy the margin requirement yif(xi)1y_if(x_i) \geq 1 . The Loss is defined as:

L(yi,f(xi))=max(0,1yif(xi)) L(y_i, f(x_i)) = \max(0, 1 - y_i f(x_i))

  • Correct Classification with Margin:
    where yif(xi)1y_if(x_i) \geq 1
    L(yi,f(xi))=max(0,1yif(xi))=0 L(y_i, f(x_i)) = \max(0, 1 - y_i f(x_i)) =** 0**
    This means there is no penalty because the point is correctly classified and lies outside the margin.

  • Margin Violation:
    when 0yif(xi)<10 \leq y_if(x_i) < 1
    L(yi,f(xi))=max(0,1yif(xi))=1yif(xi) L(y_i, f(x_i)) = \max(0, 1 - y_i f(x_i)) = 1-y_if(x_i)
    This means the point is correctly classified but lies within the margin, resulting a penalty proportional to distance from the margin.

  • Misclassification:
    when yif(xi)<0y_if(x_i) < 0
    L(yi,f(xi))=max(0,1yif(xi))=1yif(xi) L(y_i, f(x_i)) = \max(0, 1 - y_i f(x_i)) = 1-y_if(x_i)
    This means the point is misclassified, resulting in penalty that increases as the point moves further away from the correct side of the decision boundary.

    Explaining Hinge Loss with example

    Let's use small dataset consisting of five points:

  • (1,2) with label 1

  • (2,2) with label 1

  • (2.5,2.5) with label -1

  • (3,1) with label -1

  • (3,3) with label

Let's fit a linear SVM model with soft margin(C=1)

Image description
from the trained model:

  • w = [0.67, 0.67]
  • b = -2

Hinge Loss Calculation

For each point, we calculate the hinge loss:

  1. Point(1,2):

    • y=1
    • f(x)f(x) = 0.67.1 + 0.67.2 - 2 = 1.00
    • Hinge Loss: max(0,1- y.f(x)y.f(x) ) = max(0,1-1.1.00) = 0.00
    • Correctly classified and outside margin
  2. Point (2,2):

    • y=1
    • f(x)f(x) = 0.67 x 2 + 0.67 x 2 - 2 = 0.50
    • Hinge Loss: max(0,1- y.f(x)y.f(x) ) = max(0,1-1 x 0.50) = 0.50
    • Correctly classified but within margin
  3. Point (2.5,2.5):

    • y = -1
    • f(x)f(x) = 0.67 x 2.5 + 0.67 x 2.5 - 2 = 1.33
    • Hinge Loss: max(0,1- y.f(x)y.f(x) ) = max(0,1+ 1.33) = 2.33
    • Misclassified
  4. Point (3,1):

    • y = -1
    • f(x)f(x) = 0.67 x 3 + 0.67 x 1 - 2 = -1.00
    • Hinge Loss: max(0,1- y.f(x)y.f(x) ) = max(0,1 - (-1)*(-1)) = 0.00
    • Correctly classified and outside margin
  5. Point (3,3):

    • y = 1
    • f(x)f(x) = 0.67 x 3 + 0.67 x 3 - 2 = 2.00
    • Hinge Loss: max(0,1- y.f(x)y.f(x) ) = max(0,1 - 2.00) = 0.00
    • Correctly classified and outside margin
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC

X = np.array([[1, 2], [2, 2], [2.5, 2.5], [3, 1], [3, 3]])
y = np.array([1, 1, -1, -1, 1])  


model = SVC(kernel='linear', C=1)
model.fit(X, y)
w = model.coef_[0]
b = model.intercept_[0]
x_plot = np.linspace(0, 4, 100)
y_plot = -w[0] / w[1] * x_plot - b / w[1]


margin = 1 / np.sqrt(np.sum(w ** 2))
y_margin1 = y_plot + margin
y_margin2 = y_plot - margin


plt.scatter(X[:, 0], X[:, 1], c=y, cmap='bwr', s=50)
plt.plot(x_plot, y_plot, 'k-', label='Decision Boundary')
plt.plot(x_plot, y_margin1, 'k--', label='Margin')
plt.plot(x_plot, y_margin2, 'k--')


for i, txt in enumerate(y):
    plt.annotate(f'({X[i][0]},{X[i][1]})', (X[i][0], X[i][1]))

plt.annotate('Misclassified Point', xy=(2.5, 2.5), xytext=(3, 1.5),
             arrowprops=dict(facecolor='red', shrink=0.05),
             fontsize=10, color='red')
plt.annotate('Correctly Classified\nBut Within Margin', xy=(2, 2), xytext=(0.5, 2.5),
             arrowprops=dict(facecolor='blue', shrink=0.05),
             fontsize=10, color='blue')

plt.xlim(0, 4)
plt.ylim(0, 4)
plt.title('Soft Margin SVM')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

def hinge_loss_point(x, y, w, b):
    f_x = np.dot(w, x) + b
    return max(0, 1 - y * f_x)

for i in range(len(X)):
    x_i = X[i]
    y_i = y[i]
    loss = hinge_loss_point(x_i, y_i, w, b)
    print(f'Point {i+1}: x={x_i}, y={y_i}, f(x)={np.dot(w, x_i) + b:.2f}, Hinge Loss={loss:.2f}')

Enter fullscreen mode Exit fullscreen mode

Short Summary

  • Correctly classified and outside the margin: No penalty like Point (1,2), Point (3,1) and Point (3,3)
  • Correctly classified but within the margin: Penalty proportional to distance from margin like Point (2,2)
  • Misclassified: High penalty, increasing with distance from the correct side of the boundary

Constraint for Soft-Margin

yi(wTxi+b)1ξiy_i (w^T x_i + b) \geq 1 - \xi_i
ξi0\xi_i \geq 0

Why 1ξi1 - \xi_i

Let's take simple examples in different cases:

  1. Point A: Correctly classified and outside the margin

    • Slack Variable: ξA=0\xi_A = 0
    • Constraint: yA(wTxA+b)1ξAy_A (w^T x_A + b) \geq 1 - \xi_A
    • Satisfies: The constraint is met because ξA\xi_A is 0, so no margin violation
  2. Point B: Correctly classified but within the margin 0<xii<10 < xi_i < 1

    • Slack Variable: ξA=0.5\xi_A = 0.5
    • Constraint: yB(wTxB+b)1ξBy_B (w^T x_B + b) \geq 1 - \xi_B
    • Satisfies: The point is within the margin but still classified correctly. The slack variable ξB\xi_B of 0.5 allows this.
    • Interpretation
      • Without Slack Variable: The constraint would be yB(wTxB+b)1y_B (w^T x_B + b) \geq 1 . This means the point xB would need to be classified correctly and be at least 1 uint away from the decision boundary
      • With Slack Variables: The slack variable xiBxi_B of 0.5 allows the constraint to be relaxed. it changes the requirements from 1 to 0.5, meaning the point can be within the margin but still satisfy the relaxed constraint.
  3. Point C: Misclassified

    • Slack Variable: ξA=1\xi_A = 1
    • Constraint: yB(wTxB+b)0y_B (w^T x_B + b) \geq 0
    • Satisfies: The point is misclassified but the slack variable ξC\xi_C of 1 relaxes the constraint enough to allow this.

Real Objective Function

in the Soft Margin SVM, we aim to balance two objectives:

  1. Maximizing the margin (which is equivalent to minimizing w2| \mathbf{w} |^2
  2. Minimizing the classification error Which is related to the slack variables

Now we need to

Minimize 12w2+Ci=1nξi\text{Minimize } \frac{1}{2} ||w||^2 + C \sum_{i=1}^n \xi_i
under
yi(wTxi+b)1ξiy_i (w^T x_i + b) \geq 1 - \xi_i
ξi0\xi_i \geq 0
The Soft Margin SVM problem is a type of quadratic programming problem. It involves a quadratic objective function and linear constraints. To solve the problem efficiently, it is often useful to convert it to its dual form using Lagrange multipliers.
The dual form of this equations is
maxα(i=1nαi12i=1nj=1nαiαjyiyj(xiTxj)) \max_{\alpha} \left( \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j (x_i^T x_j) \right)

Derivation

Image description

Image description

Image description

Regularization Parameter (C) in SVM

The regularization parameter C in SVM plays a crucial role in the bias-variance tradeoff:

  • Hign C Value:
    • Effect:A high C value aims to minimize the training error by allowing fewer misclassifications on the training set. This reduces bias but increases variance, making the model more prone to overfitting.
    • Result: The decision boundary becomes very tight around the support vectors, potentially fitting noise in the training data
  • Low C Value:
    • Effect: A low C value allows more misclassifications on the training set, which increases bias but reduces variance. The model is less flexible and smoother
    • Result: The decision boundary is smoother and less sensitive to training data, which can prevent overfitting but might lead to underfitting.

Conclusion

Support Vector Machines offer a robust approach for classification tasks, excelling in scenarios where a clear decision boundary is essential. By understanding key concepts such as support vectors, the margin, slack variables, and hinge loss, practitioners can effectively leverage SVMs for a wide range of machine learning applications.

💖 💪 🙅 🚩
harsimranjit_singh_0133dc
Harsimranjit Singh

Posted on August 12, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related