How to identify IoT devices with machine learning

In today's connected world, the Internet of Things (IoT) has revolutionised the way we live and work. From optimising transportation and healthcare systems to transforming agriculture and various industries, IoT technology has undoubtedly brought efficiency, convenience, and automation into our lives.

However, as with any technological advancement, the IoT landscape is not without its challenges. Alongside the countless benefits, there are also new risks that emerge. Malicious actors are quick to exploit vulnerabilities in IoT devices, employing various tactics to compromise security. One such concerning tactic involves infiltrating an organisation's network by adding a malicious IoT device.

This is where the critical need for identifying and authenticating connected IoT devices becomes apparent. Ensuring the security of your network hinges on accurately recognising the multitude of devices connected to it.

In this article, we will explore IoT device identification using machine learning techniques. We'll also see the benefits that machine learning brings to the table as a solution for this pivotal task.

Prerequisite

To get the most out of this blog, you should:

Possess a fundamental understanding of machine learning.
Be Familiar with Python data science libraries like pandas, NumPy, and scikit-learn
Be familiar with IoT (Internet of Things) concepts and devices.
Have a basic knowledge of cybersecurity terms.

Nevertheless, this article will introduce you to the basic concepts you need to follow this tutorial.

What are IoT devices?

IoT (Internet of Things) devices are specialised computing hardware equipped with sensors and the capability to connect to the Internet or other networks. These devices collect and transmit data, often in real-time, to facilitate remote monitoring, control, and data analysis.
Examples of such devices include smartwatches, security cameras, TVs, medical devices, and various others. Unlike smartphones and personal computers, IoT devices are designed for specific functions within the IoT ecosystem; for instance, an IoT device like a security camera, which is designed for surveillance and monitoring purposes, cannot be used to watch movies.

Importance of identifying IoT devices in a network

1. Device type recognition

Device type recognition can be used to prevent susceptible devices from connecting to a network. Knowing the sort of device connected to a network will aid in enforcing security within the network. Below are instances of how device type recognition can help enforce security:

Smart speakers: Smart speakers, such as Amazon Echo and Google Home, can be used to control IoT devices in the home, such as lights, thermostats, and locks. Device type recognition can be used to prevent unauthorised access to these devices. For example, a network administrator could create a rule that only allows smart speakers from a specific manufacturer to connect to the network.

Industrial sensors: Industrial sensors are used to collect data from machines and processes in factories and other industrial settings. Device type recognition can be used to identify and track these sensors, which can help improve efficiency and productivity.

Medical devices: Medical devices, such as pacemakers and insulin pumps, are used to monitor and treat patients. Device type recognition can be used to prevent unauthorised access to these devices, which can help protect patient safety.

Vehicles: Vehicles, such as cars and trucks, are increasingly connected to the internet. Device type recognition can be used to identify and track vehicles, which can help improve traffic safety and security.

Device type recognition is a powerful tool that can be used to improve the security and efficiency of IoT networks. By identifying the type of IoT device, network administrators can create rules and policies that protect these devices from attack and misuse.

2. Malicious IoT device identification

Hackers use malicious IoT devices to launch attacks on organisations because IoT devices are often poorly secured. If a malicious IoT devices is connected to a network of IoT device, it can disrupt the operation of other devices.
The risks posed by malicious IoT devices are significant. These devices can be used to steal data, disrupt critical infrastructure, or even cause physical harm.
Some of the common types of malicious IoT devices include:

Botnets: A botnet is a network of infected IoT devices that are controlled by an attacker. Botnets can be used to launch DDoS attacks, steal data, or spread malware.

Trojans: A trojan is a type of malware that disguises itself as a legitimate file or programme. When the user opens the file or runs the programme, the trojan is installed on the device and can then steal data or install other malware.

Ransomware: Ransomware is a type of malware that encrypts the victim's files and demands a ransom payment in order to decrypt them. Ransomware can be deployed on IoT devices, such as security cameras or smart locks.

Firmware attacks: Firmware is the software that controls the operation of an IoT device. Firmware attacks can be used to take control of an IoT device and modify its behaviour.

Building IoT device detector with machine learning

Machine learning is applied in various industries, with cybersecurity being no exception.

Machine learning can be used as a network security solution to identify the types of IoT devices.
Machine learning algorithms can be trained on data from known IoT devices to learn the characteristics of these devices. This data can include the device's MAC address, IP address, communication patterns, and other features. Once the machine learning algorithm has been trained, it can be used to identify new IoT devices that connect to the network.

Machine learning can be used to identify IoT devices in a number of ways, including:

Device fingerprinting: This involves collecting data from the device, such as its MAC address, IP address, and communication patterns, and then using machine learning algorithms to identify the device's manufacturer and model.

Network traffic analysis: This involves analysing the network traffic that the device generates to identify its type. For example, a machine learning algorithm could be trained to identify the network traffic patterns of different types of IoT devices, such as security cameras, smart thermostats, and smart speakers.

Behavioral analysis: This involves analysing the behaviour of the device to identify its type. For example, a machine learning algorithm could be trained to identify the patterns of behaviour that are common for different types of IoT devices, such as devices that are constantly sending data or devices that are only active at certain times of day.

In this tutorial, we will follow a step-by-step guide to building a machine learning IoT device detector by leveraging datasets generated from IoT device network traffic analysis. Unlike device fingerprinting data, network traffic analysis data cannot be easily spoofed, making it a more reliable data source for building this solution. Without further ado, let's get started!

Data collection

This step involves gathering data generated from network traffic. You can construct an experimental smart home network comprising various IoT devices to generate network traffic data. Alternatively, for the sake of this tutorial, we have available datasets on Kaggle. This dataset consists of IoT devices network traffic analysis data generated by other researchers.

Read the data

The dataset consists of two parts: the training data and the test data, all stored in a CSV format. Let's read the data with pandas.



import pandas as pd
train_data = pd.read_csv("iot_device_train.csv") # read training data
test_data = pd.read_csv("iot_device_test.csv") # read testing data
train_data.head() # display first five rows of the training data

Explore the data



train_target = train_data["device_category"].unique()
train_target



test_target = test_data["device_category"].unique()
test_target

Observation: water_sensor is not in the testing data, but is in the training data. Solutions to this include:

In the training data, we can remove all rows with water sensors as target
Join the training and test data to make a new dataframe

We need water sensors in the training data since it is another IoT device our model should be able to detect. Therefore, it is important to combine the training and test data



new_data = pd.concat([train_data,test_data],ignore_index=True) # Combine train and test data

Feature engineering

It is essential to eliminate features with low variance, typically those close to zero, because such features often exhibit constant values that do not significantly contribute to the model's learning process. We can do this by using scikit-learn class called VarianceThreshold



# Import the VarianceThreshold class from scikit-learn
from sklearn.feature_selection import VarianceThreshold 

# Create an instance of VarianceThreshold with a threshold of 0.1
var_thresh = VarianceThreshold(threshold=0.1)

# Drop the target column ("device_category") from your dataset
X = new_data.drop(["device_category"], axis=1)

# Assign the target column ("device_category") to the variable y
y = new_data["device_category"]

# Fit the VarianceThreshold instance to your feature data
var_thresh.fit(X)

# Create a list of columns to remove based on low variance
col_to_remove = [column for column in X.columns 
                 if column not in X.columns[var_thresh.get_support()]]

# Drop the columns with low variance from your feature data
X = X.drop(col_to_remove, axis=1)

Data pre-processing (Standardization)

Standardization is a method that transforms the training data to have a mean of 0 and a standard deviation of 1. This transformation gives the data a standard normal distribution, which can be beneficial for machine learning algorithms. We can achieve this by using the scikit-learn class called StandardScaler.



# Import the StandardScaler class from scikit-learn
from sklearn.preprocessing import StandardScaler

# Create an instance of the StandardScaler
scaler = StandardScaler()

# Use the scaler to standardize (normalize) the data
X_scaled = scaler.fit_transform(X)

Building the model

The algorithm we will use to build this solution is a random forest classifier. A Random Forest is an ensemble learning technique. It combines multiple decision trees, where each tree is trained independently and makes predictions. The predictions from all individual trees are then aggregated to make a final prediction. This ensemble approach improves accuracy and reduces the risk of overfitting, making Random Forests a powerful tool for this task.



# Import the train_test_split function from scikit-learn
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=0)

# Import the RandomForestClassifier from scikit-learn
from sklearn.ensemble import RandomForestClassifier

# Create a Random Forest Classifier instance
rf = RandomForestClassifier()

# Fit the Random Forest model to the training data
rf.fit(X_train, y_train)

Evaluating the model (Accuracy score)

After building the model, it is important to evaluate the model's performance to see how well it performs on test data and to gain an idea of how effectively the model will detect new, unseen IoT devices.
Let's start by evaluating the model's accuracy.



# Import the accuracy_score function from sklearn.metrics
from sklearn.metrics import accuracy_score

# Use the trained random forest model to make predictions on the test data
y_pred = rf.predict(X_test)

# Calculate the accuracy score by comparing predicted labels (y_pred) with actual labels (y_test)
accuracy = accuracy_score(y_test, y_pred)

# Print the accuracy score to evaluate the model's performance
print(f"Accuracy: {accuracy}")

The model is 91% accurate, which means that the model will be able to correctly classify approximately 91 out of every 100 IoT devices it encounters.

Evaluating the model (Cross-validation)

Accuracy score alone doesn't tell us how well the model will perform in the real world. It is also important to perform cross-validation on the model to assess its performance on unseen data.
Cross-validation is a technique that splits the training data into different subsets or folds, typically 5 to 10. The model is trained on one subset and tested on another until it has been tested on all subsets. This process helps ensure that the model does not overfit the training data.
Overfitting occurs when the model learns the training data too well but struggles to generalise to new, unseen data.



# Import the necessary library for cross-validation
from sklearn.model_selection import cross_val_score

# Define the number of folds for cross-validation (usually 5-10)
k = 10

# Perform cross-validation on the Random Forest model
scores = cross_val_score(rf, X_train, y_train, cv=k, scoring='accuracy')

# Calculate the mean accuracy score across all folds
mean_accuracy = scores.mean()

# Calculate the standard deviation of accuracy scores
std_accuracy = scores.std()

# Print the accuracy scores for each fold
print(scores)

# Print the mean accuracy score
print(f"Mean Accuracy: {mean_accuracy:.2f}")

# Print the standard deviation of accuracy scores
print(f"Standard Deviation of Accuracy: {std_accuracy:.2f}")

Overall, this result tells us that the model has an average accuracy of approximately 86% on the cross-validation folds, with a relatively low standard deviation of 0.03. This indicates that the model's performance is consistent across different subsets of the data, in other words, the model is not overfitting.

Deployment

Building a machine learning model is pointless if it isn't deployed to the real world to make inferences. Similarly, this IoT device detection model must be deployed in the real world, where it can be utilised to detect new IoT devices based on their network traffic analysis data.
There are a few different places where organisations can deploy this model to detect new IoT devices in their network:

Network perimeter: This is the boundary between the organisation's network and the outside world. The model can be deployed at the network perimeter to identify new devices that are trying to connect to the network.

Intrusion detection systems (IDS): IDSs are used to monitor network traffic for malicious activity. The model can be integrated with IDSs to identify new IoT devices that are behaving in a suspicious manner.

Security information and event management (SIEM) systems: SIEM systems collect and store security logs from different sources. The model can be integrated with SIEM systems to identify new IoT devices that are generating unusual traffic.

Final thought

IoT device identification using machine learning is a crucial aspect of modern cybersecurity. As the IoT ecosystem continues to grow, so do the potential risks associated with unsecured devices. Machine learning offers a powerful solution to accurately identify and monitor these devices within a network.

Through this tutorial, we've explored the significance of IoT device recognition and how machine learning can be leveraged for this task. We've learned about the importance of data gathering, preprocessing, and feature selection. We've also delved into the application of the Random Forest Classifier, a robust algorithm for this purpose.

Additionally, we discussed the importance of model evaluation through techniques like cross-validation and the limitations of accuracy as a sole performance metric. Finally, we emphasised the necessity of deploying such models in real-world scenarios to bolster network security.

As technology continues to advance and the IoT landscape evolves, staying ahead of potential threats becomes increasingly critical. By implementing machine learning-based IoT device identification, organisations can enhance their cybersecurity posture and protect against malicious actors seeking to exploit vulnerabilities in connected devices.

I hope you found this tutorial informative and valuable. If you have any questions or comments, please don't hesitate to reach out to me on LinkedIn or via email at victorkingoshimua@gmail.com. I look forward to hearing from you and engaging in discussions about machine learning in cybersecurity. Thank you for reading!

Source code

Blog