Support Vector Machines: From Hard Margin to Soft Margin
Harsimranjit Singh
Posted on August 12, 2024
Support Vector Machines are powerful tools in machine learning for classifying data and predicting values. They are popular in various fields like bioinformatics and financial forecasting because they handle complex data problems well. This article aims to explain SVMs in a simple way, covering topics like maximal margin classifier, support vectors, kernel trick and infinite dimensional mapping.
What is SVMs
An SVM performs classification at its core by finding the hyperplane that best divides a dataset into classes. The unique aspect of SVMs lies in their ability to find the optimal hyperplane that maximizes the margin- the distance between the hyperplane and the nearest data points from each class. A larger margin helps the SVM make better predictions on new data as decision boundaries are much clearer. The SVM does this with the help of Support Vectors.
Support Vectors
Support Vectors are the subset of training data that are directly involved in constructing the optimal separating hyperplane These points are crucial because they lie on the edge of the margin and influence the position and orientation of the hyperplane.
Mathematically, if you have a dataset with class labels {-1, +1} and feature the support vectors satisfy the condition:
From where this equation comes?
The hyperplane in d-dimensional space is defined by the equation:
W.X - b = 0
where:
- w is the normal vector( normal to the hyperplane).
- b is the bias term (offset from the origin).
For a given data point with class label :
- if , the data point belongs to the positive class.
- if , the data point belongs to the negative class.
We want the hyperplane to correctly classify these data points, so we need to ensure that:
- Points from the positive class are on one side of the hyperplane. For points on the positive side:
-
Points from the negative class are on the other side.
For points on the negative side:
To combine these constraints into simple form, we use the class label , the constraint can be written as:
Here's why:
- if = 1, the constraint becomes , which ensures correct classification for positive class points.
- if if = -1, the constraint becomes , which ensure correct classification for negative class points.
Margin Calculation
In SVMs, the margin is the distance between the hyperplane and the closest data points from each class(support vectors).
To calculate the margin, we use the following formula:
From where do we get this formula
The perpendicular distance d from point
to hyperplane is:
Now for the support vectors, the distance from the hyperplane is exactly
This is because support vectors lie on the boundaries of the margin, where
Therefore:
Now we need the distance between the both hyperplanes
Therefore the distance will be:
Understanding Hard Margin SVM
The term "Hard Margin" comes from the fact that the algorithm requires all data points to be classified with a margin of at least 1. In other words, there are no allowances for misclassification. These strict requirements are why it's called a "hard" margin
Formulating the Hard Margin SVM
1. Objective function
The goal of Hard margin SVM is to maximize the margin between the two classes. As we previously discussed:
To maximize the margin, we need to minimize its reciprocal
Why squaring the norm? Because it provides smoothness and differentiability. This makes it easier to compute gradients and perform optimization using gradient-based methods.
Minimizing the squared norm is equivalent to minimizing the norm because minimizing the
will always lead to the same optimal W as minimizing
2. Constraints:
The constraints ensure that each point is correctly classified and lies at least at a distance of 1 from the hyperplane.
3. Hard Margin SVM Optimization Problem:
Putting it all together, the hard-margin SVM optimization problem is:
Now we need to solve this problem to find the solution
Problem with Hard Margin
While Hard Margin SVMs are effective for linearly separable data, they come with certain limitation
It fails in the case of outliers and misclassified data
In this two points are outliers, in this scenario, hard SVM fails to plot the decision boundary as it tries to classify all the points but is unable to classify these two points.
To tackle this Soft-Margin SVM is used
Soft Margin SVM
While Hard Margin SVM works well with linearly separable data, it struggles with datasets containing outliers or overlapping classes. To address these limitations, Soft Margin SVM introduces a concept called "slack Variables"
Slack Variables
Slack variables is introduced to measure the degree to which a data point is allowed to violate the margin constraints. They allow for some misclassification and margin violations. For each data point i, the slack variable is non-negative and represents the extent of the constraint violation.
- if =0, the data point is correctly classified and lies on or outside the margin.
- if 0< <1, the data point is correctly classified but withing the margin.
- if >=1, the data point is misclassified.
Soft-Margin SVM: Objective Function and Constraints
Objective Function
The soft-margin SVM modifies the objective function to incorporate slack variables. The goal is now to balance between maximizing the margin and minimizing the classification error. The new objective function is:
where:
- still represents the margin maximization term.
- C is the regularization parameter that controls the trade-off between the margin size and classification error.
- are the slack variables representing the extent of misclassification for each data point.
Constraints
The constraints are modified to account for slack variables:
where:
- The term allows some flexibility by permitting data points to lie within or beyond the margin.
- ensure that slack variables are non-negative, representing non-negative deviations from the ideal margin.
What is
Hinge Loss
The hinge loss function is crucial concept in understanding the role of
in SVMs. it measures the cost associated with misclassifications and margin violations. The hinge loss for data point (
) is defined as:
Intuition behind the Hinge Loss Function
1. Decision Function:
in SVMs, the decision function
where w is weight vector, x is input feature vector, and b is bias term.
2. Margin and Classification:
- For a correctly classified point, we want . This ensures that point is on the correct side of the margin.
- For a misclassified point . ### 3. Hinge Loss: The hinge loss penalizes points that do not satisfy the margin requirement . The Loss is defined as:
Correct Classification with Margin:
where
This means there is no penalty because the point is correctly classified and lies outside the margin.Margin Violation:
when
This means the point is correctly classified but lies within the margin, resulting a penalty proportional to distance from the margin.-
Misclassification:
when
This means the point is misclassified, resulting in penalty that increases as the point moves further away from the correct side of the decision boundary.Explaining Hinge Loss with example
Let's use small dataset consisting of five points:
(1,2) with label 1
(2,2) with label 1
(2.5,2.5) with label -1
(3,1) with label -1
(3,3) with label
Let's fit a linear SVM model with soft margin(C=1)
- w = [0.67, 0.67]
- b = -2
Hinge Loss Calculation
For each point, we calculate the hinge loss:
-
Point(1,2):
- y=1
- = 0.67.1 + 0.67.2 - 2 = 1.00
- Hinge Loss: max(0,1- ) = max(0,1-1.1.00) = 0.00
- Correctly classified and outside margin
-
Point (2,2):
- y=1
- = 0.67 x 2 + 0.67 x 2 - 2 = 0.50
- Hinge Loss: max(0,1- ) = max(0,1-1 x 0.50) = 0.50
- Correctly classified but within margin
-
Point (2.5,2.5):
- y = -1
- = 0.67 x 2.5 + 0.67 x 2.5 - 2 = 1.33
- Hinge Loss: max(0,1- ) = max(0,1+ 1.33) = 2.33
- Misclassified
-
Point (3,1):
- y = -1
- = 0.67 x 3 + 0.67 x 1 - 2 = -1.00
- Hinge Loss: max(0,1- ) = max(0,1 - (-1)*(-1)) = 0.00
- Correctly classified and outside margin
-
Point (3,3):
- y = 1
- = 0.67 x 3 + 0.67 x 3 - 2 = 2.00
- Hinge Loss: max(0,1- ) = max(0,1 - 2.00) = 0.00
- Correctly classified and outside margin
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
X = np.array([[1, 2], [2, 2], [2.5, 2.5], [3, 1], [3, 3]])
y = np.array([1, 1, -1, -1, 1])
model = SVC(kernel='linear', C=1)
model.fit(X, y)
w = model.coef_[0]
b = model.intercept_[0]
x_plot = np.linspace(0, 4, 100)
y_plot = -w[0] / w[1] * x_plot - b / w[1]
margin = 1 / np.sqrt(np.sum(w ** 2))
y_margin1 = y_plot + margin
y_margin2 = y_plot - margin
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='bwr', s=50)
plt.plot(x_plot, y_plot, 'k-', label='Decision Boundary')
plt.plot(x_plot, y_margin1, 'k--', label='Margin')
plt.plot(x_plot, y_margin2, 'k--')
for i, txt in enumerate(y):
plt.annotate(f'({X[i][0]},{X[i][1]})', (X[i][0], X[i][1]))
plt.annotate('Misclassified Point', xy=(2.5, 2.5), xytext=(3, 1.5),
arrowprops=dict(facecolor='red', shrink=0.05),
fontsize=10, color='red')
plt.annotate('Correctly Classified\nBut Within Margin', xy=(2, 2), xytext=(0.5, 2.5),
arrowprops=dict(facecolor='blue', shrink=0.05),
fontsize=10, color='blue')
plt.xlim(0, 4)
plt.ylim(0, 4)
plt.title('Soft Margin SVM')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()
def hinge_loss_point(x, y, w, b):
f_x = np.dot(w, x) + b
return max(0, 1 - y * f_x)
for i in range(len(X)):
x_i = X[i]
y_i = y[i]
loss = hinge_loss_point(x_i, y_i, w, b)
print(f'Point {i+1}: x={x_i}, y={y_i}, f(x)={np.dot(w, x_i) + b:.2f}, Hinge Loss={loss:.2f}')
Short Summary
- Correctly classified and outside the margin: No penalty like Point (1,2), Point (3,1) and Point (3,3)
- Correctly classified but within the margin: Penalty proportional to distance from margin like Point (2,2)
- Misclassified: High penalty, increasing with distance from the correct side of the boundary
Constraint for Soft-Margin
Why
Let's take simple examples in different cases:
-
Point A: Correctly classified and outside the margin
- Slack Variable:
- Constraint:
- Satisfies: The constraint is met because is 0, so no margin violation
-
Point B: Correctly classified but within the margin
- Slack Variable:
- Constraint:
- Satisfies: The point is within the margin but still classified correctly. The slack variable of 0.5 allows this.
-
Interpretation
- Without Slack Variable: The constraint would be . This means the point xB would need to be classified correctly and be at least 1 uint away from the decision boundary
- With Slack Variables: The slack variable of 0.5 allows the constraint to be relaxed. it changes the requirements from 1 to 0.5, meaning the point can be within the margin but still satisfy the relaxed constraint.
-
Point C: Misclassified
- Slack Variable:
- Constraint:
- Satisfies: The point is misclassified but the slack variable of 1 relaxes the constraint enough to allow this.
Real Objective Function
in the Soft Margin SVM, we aim to balance two objectives:
- Maximizing the margin (which is equivalent to minimizing
- Minimizing the classification error Which is related to the slack variables
Now we need to
under
The Soft Margin SVM problem is a type of quadratic programming problem. It involves a quadratic objective function and linear constraints. To solve the problem efficiently, it is often useful to convert it to its dual form using Lagrange multipliers.
The dual form of this equations is
Derivation
Regularization Parameter (C) in SVM
The regularization parameter C in SVM plays a crucial role in the bias-variance tradeoff:
-
Hign C Value:
- Effect:A high C value aims to minimize the training error by allowing fewer misclassifications on the training set. This reduces bias but increases variance, making the model more prone to overfitting.
- Result: The decision boundary becomes very tight around the support vectors, potentially fitting noise in the training data
-
Low C Value:
- Effect: A low C value allows more misclassifications on the training set, which increases bias but reduces variance. The model is less flexible and smoother
- Result: The decision boundary is smoother and less sensitive to training data, which can prevent overfitting but might lead to underfitting.
Conclusion
Support Vector Machines offer a robust approach for classification tasks, excelling in scenarios where a clear decision boundary is essential. By understanding key concepts such as support vectors, the margin, slack variables, and hinge loss, practitioners can effectively leverage SVMs for a wide range of machine learning applications.
Posted on August 12, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.