Convolutional Neural Networks: How Does AI Understand Images?

Have you ever asked yourself how AI systems can understand images? Like how they can tell whether an animal in a picture is a cat or a dog?

Recently, I’ve been learning about deep learning through a TensorFlow course by IBM on edX, and I’ve been really amazed by how powerful Convolutional Neural Networks (CNNs) are in recognizing images.

If you're a developer, no matter your background, I think you’ll find CNNs both interesting and very useful.

In this article, I want to guide you through the basics of how CNNs work, breaking down each layer to show you how they can classify images.

What Is a Convolutional Neural Network (CNN)?

A CNN is a type of artificial neural network specifically designed to process and classify visual data,

CNNs are great at capturing patterns like edges, textures, and shapes which makes them perfect for tasks like identifying objects in pictures or distinguishing between a cat and a dog.

The Building Blocks of a CNN

A CNN is made up of several layers, each serving a different purpose:

1. Convolutional Layer

Of course, Every image you see is made up of pixels. A pixel is the smallest part of a digital image, and each one holds a number.

In a grayscale image, each pixel has a value between 0 and 255, representing a single channel of intensity. For a colored image (RGB), each pixel has three channels (Red, Green, Blue), with each channel holding a value between 0 and 255.

In this layer, A CNN uses something called a filter or kernel* (small matrices of numbers) that moves over the image, applying a **dot product operation on the pixels to detect specific patterns, as shown in the image below:

Typically, we initialize the kernels with random values, and during the training phase, the values will be updated with optimum values.

Why Sliding Matters

The sliding motion of the filter across the image allows the CNN to learn spatial hierarchies of patterns.

As the filter moves, it captures basic features like edges in the early layers and more complex patterns, such as shapes or objects, in deeper layers.

For example, applying different kernels to the digit 2 can highlight various patterns:

As you see in the image above, applying those filters to an image highlights its edges, creating a new image (in CNN terms, a feature map), that represents the detected features.

This process, known as convolution, results in a feature map that shows the detected features in the image.

2. ReLU Activation:

After the convolution step, where the CNN identifies patterns like edges or textures, we need to decide which patterns are important. This is where ReLU (Rectified Linear Unit) comes in.

What Does ReLU Do ?

Think of ReLU as a filter that cleans up the image further. After the convolution layer finds patterns, some of those patterns are helpful (positive values) and some are not useful (negative values)

The ReLU layer takes the output from the convolution layer and simply changes all the negative values to zero as shown in the image below:

By doing this, ReLU discards the unhelpful parts, keeping only the meaningful patterns, But ReLU’s impact goes beyond just setting negatives to zero—it introduces non-linearity into the model,

What Does Non-Linearity Mean ?

If you were to only draw straight lines, you’d miss out on capturing the curves and complexities of real-world objects. A linear function sees only straightforward, flat changes, like a gradual slope.

For example, picture a shadow or a gradient on an image. A linear function would only see it as a flat, dull line of change.

ReLU, by cutting off negative values, highlights differences more sharply, helping the CNN to “see” and react to those changes in more dynamic ways. in other words, it can start capturing rounded shapes, complex textures, or variations.

So, ReLU helps the CNN "break out" of basic, straight-line thinking, allowing the model process images in a richer, more detailed way—much like how our brains can see the fine differences between shadows, colors, and textures.

3. Pooling Layer:

Next up is the Pooling layer, which is used to reduce the spatial dimensions (height and width) of the feature maps as shown in the image above.

This layer helps to make the CNN more computationally efficient by reducing the number of parameters and ensuring that the model focuses on the most important features.

How does it work ?

The most common pooling method is Max-Pooling, which selects the highest value in a small window (e.g., 2x2) and slides it across the feature map, creating a smaller, downsampled version

4. Fully-Connected Layer:

After the convolution and pooling layers have extracted the relevant features from the image, the final step is classification. This is done by the Fully-Connected layer.

How does it actually work ?

1. Flattening :

First, the high-level features (like shapes and patterns) extracted by the previous layers are flattened into a single, long list of numbers:

Why do we need flattening layers?, so the network can consider all the features together, which helps in making a final decision or classification.

2. Processing Through Neurons

Once flattened, the data is passed through the Fully-Connected layers. Each neuron in these layers is connected to every neuron in the previous layer, This means the FC layer can look at all the features at once and understand how they relate to each other.

Each neuron receives inputs from all the neurons in the previous layer. These inputs are multiplied by weights, Weights are numbers that tell the network how important each feature is. These weights are learned during training to improve the network's accuracy.

The results are then summed up with a small value called a bias, The bias helps by shifting the neuron's activation threshold, allowing it to still respond appropriately even if the input values aren't perfect.

This flexibility helps the model adjust better to the data.

3. Decision Making with Softmax activation function:

In the final step, the last FC layer outputs a set of scores, one for each possible class. The Softmax function is then applied to these scores, converting them into probabilities. The class with the highest probability is selected as the network’s final prediction.

In summary, the Fully-Connected layers integrate all extracted features, perform complex transformations, and make the final classification decision using probabilities.

Conclusion ✨

At its core, a CNN is just a series of numbers and math, but those numbers allow it to "see" and interpret images in incredible ways. I hope you now have a good idea of the deep learning world, especially CNNs.

As a bonus, if you’re looking to dive deeper, feel free to check out my Deep Learning with TensorFlow GitHub repo, which contains Jupyter Notebook labs from the "Deep Learning with TensorFlow" course by IBM on edX, and my Machine Learning with Python repo, featuring labs from IBM's Coursera course.

If you have any questions or just want to chat more about it, feel free to reach out. Thanks for reading 😊.

Blog