Beginner's Guide to Scikit-Learn (sklearn) π
Anand
Posted on August 2, 2024
What is Scikit-Learn? π€
Scikit-Learn is a popular Python library that provides simple and efficient tools for data mining, data analysis, and machine learning. Itβs built on top of other libraries like NumPy, SciPy, and Matplotlib, making it a great choice for building both simple and complex models.
Key Features of Scikit-Learn π
- Simple and Consistent Interface: All machine learning models in sklearn follow the same basic interface. Once you learn one, you can use them all!
- Wide Range of Algorithms: It includes algorithms for classification, regression, clustering, and more.
- Preprocessing Tools: Easily clean and prepare your data with tools for scaling, normalization, and encoding.
- Model Evaluation: Multiple metrics and tools for validating your models.
Installing Scikit-Learn π₯
Before we begin, let's make sure you have Scikit-Learn installed. If you don't have it installed yet, you can easily get it using pip:
pip install scikit-learn
Now, let's get started with some basic examples! π
Example 1: Loading and Understanding Data ποΈ
First things first, letβs load some data! Scikit-Learn comes with a bunch of built-in datasets. Weβll use the famous Iris dataset for our example.
from sklearn.datasets import load_iris
# Load the Iris dataset
iris = load_iris()
# Check out the features and labels
print("Features:", iris.feature_names)
print("Labels:", iris.target_names)
# Display the first 5 records
print("First 5 records:\n", iris.data[:5])
Output:
Features: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Labels: ['setosa' 'versicolor' 'virginica']
First 5 records:
[[5.1 3.5 1.4 0.2]
[4.9 3.0 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5.0 3.6 1.4 0.2]]
Example 2: Splitting the Data π²
Before training a model, itβs important to split your data into training and testing sets. This helps you evaluate the performance of your model on unseen data.
from sklearn.model_selection import train_test_split
# Split the data (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)
print("Training set size:", len(X_train))
print("Testing set size:", len(X_test))
Output:
Training set size: 120
Testing set size: 30
Example 3: Building a Simple Classifier π§
Letβs build a simple classification model using the k-Nearest Neighbors (k-NN) algorithm. It's a great algorithm for beginners!
from sklearn.neighbors import KNeighborsClassifier
# Initialize the model
knn = KNeighborsClassifier(n_neighbors=3)
# Train the model
knn.fit(X_train, y_train)
# Make predictions on the test set
predictions = knn.predict(X_test)
# Display predictions
print("Predictions:", predictions)
print("Actual Labels:", y_test)
Output:
Predictions: [0 1 2 1 1 0 2 1 1 2 2 1 0 0 2 1 1 1 2 0 0 0 2 2 0 1 2 0 0 1]
Actual Labels: [0 1 2 1 1 0 2 1 1 2 2 1 0 0 2 1 1 1 2 0 0 0 2 2 0 1 2 0 0 1]
Example 4: Evaluating the Model π
Now, let's evaluate our model's performance using accuracy, one of the simplest metrics.
from sklearn.metrics import accuracy_score
# Calculate the accuracy
accuracy = accuracy_score(y_test, predictions)
print("Model Accuracy:", accuracy)
Output:
Model Accuracy: 1.0
Machine Learning with Scikit-Learn π
letβs explore some more machine learning techniques using Scikit-Learn. Weβll look at examples of Regression, Clustering, and Dimensionality Reduction. These are key concepts in machine learning, and Scikit-Learn makes it super easy to implement them. Letβs dive in! πββοΈ
Example 1: Linear Regression π
Linear Regression predicts a continuous value, such as house prices or temperature. It's one of the simplest and most widely used regression techniques.
Problem Statement:
Letβs predict the relationship between a person's BMI and their weight.
from sklearn.linear_model import LinearRegression
import numpy as np
# Sample data: BMI and corresponding weights
X = np.array([[18.5], [24.9], [30.0], [35.0], [40.0]]) # BMI
y = np.array([60, 70, 80, 90, 100]) # Weight in kg
# Initialize and train the model
model = LinearRegression()
model.fit(X, y)
# Predict weight for a BMI of 28.0
predicted_weight = model.predict([[28.0]])
print("Predicted weight for BMI 28.0:", predicted_weight[0], "kg")
Output:
Predicted weight for BMI 28.0: 76.923 kg
Example 2: K-Means Clustering π¨
K-Means Clustering is an unsupervised learning algorithm used to group similar data points into clusters. Itβs useful when you want to identify patterns or groupings in your data.
Problem Statement:
Group customers based on their spending habits.
from sklearn.cluster import KMeans
# Sample data: Annual Income and Spending Score
X = np.array([[15, 39], [16, 81], [17, 6], [18, 77], [19, 40], [20, 76]])
# Initialize the model with 2 clusters
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X)
# Predict the cluster for a new customer with income 18 and spending score 50
cluster = kmeans.predict([[18, 50]])
print("Cluster for new customer:", cluster[0])
Output:
Cluster for new customer: 1
Example 3: Principal Component Analysis (PCA) π
Principal Component Analysis (PCA) is a dimensionality reduction technique. Itβs often used to reduce the number of features in a dataset while retaining most of the variance (information).
Problem Statement:
Reduce the dimensionality of the Iris dataset to 2 components.
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
# Load the Iris dataset
iris = load_iris()
X = iris.data
# Initialize PCA with 2 components
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
# Print the reduced feature set
print("Reduced feature set:\n", X_reduced[:5])
Output:
Reduced feature set:
[[-2.68412563 0.31939725]
[-2.71414169 -0.17700123]
[-2.88899057 -0.14494943]
[-2.74534286 -0.31829898]
[-2.72871654 0.32675451]]
Conclusion π
Congrats! You've just built and evaluated your first machine-learning model using Scikit-Learn. πͺ As you can see, Scikit-Learn makes it easy to get started with machine learning, thanks to its simple and consistent interface.
These examples are just the tip of the iceberg! The more you practice, the better you'll get. Machine learning is a vast field, but with tools like Scikit-Learn, you can explore it one step at a time.Keep exploring, try out different datasets and algorithms, and most importantly, have fun! Machine learning is a vast and exciting field, and Scikit-Learn is a fantastic tool for helping you.
Happy coding! π
NOTE: If youβre excited to learn more, donβt hesitate to experiment with other algorithms in sklearn.The possibilities are endless! π
Posted on August 2, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.