Building Real-time Object Detection on Live-streams

Artificial Intelligence (AI), or more specifically object detection is a fascinating topic that opens a gateway to a wide variety of projects and ideas. I recently came across YOLO (You Only Look Once) from Ultralytics, which is an fast, accurate, and super easy to implement object detection model. In this post I will walk you through my process of building real-time object detection on live streams.

I have built this to work on RTSP (Real-time streaming protocol) and HLS (HTTP Live Streaming).

What Makes YOLO Special?

For starters, YOLO is super fast and excels with real time object detection due to the way it works internally which is very different from other models such as R-CNN.

Algorithms like Faster R-CNN use Region Proposal Network to detect regions of interest, then performs detection on those regions over multiple iterations. While YOLO does it in a single iteration, hence the name "You Only Look Once".

In addition, YOLO requires very little training data because the first 20 convolution layers have been pre-trained on the ImageNet dataset.

Now The Project!

Now that all the terminology is out of the way, I can dive into how I set up real-time object detection using YOLO.

First step is to install the required dependencies:

torch (only needed if you plan on utilizing GPU for training)
opencv-python for video processing
ultralytics for the YOLO model

Since I have a Nvidia graphics card I utilized CUDA to train on my GPU (which is much faster).

First we load the YOLO model, I used YOLOv11 trained on the COCO (Common Objects in Context) dataset.

model = YOLO(r"C:\path\to\your\yolo_model.pt")

Next, we capture the stream using opencv-python, read each frame within a loop, and run that frame through our YOLO model - very straight forward.

video_cap = cv2.VideoCapture(STREAM_URL)
cv2.namedWindow("Detection Output", cv2.WINDOW_NORMAL)

while True:
    ret, frame = video_cap.read() # read the frame from the capture
    if not ret:
        break

    results = model(frame) # get prediction on frame from YOLO model

    cv2.imshow("Detection Output", frame) # Draw the frame

    if cv2.waitKey(1) == ord("q"): # Quit on "q" key press.
        break

# Don't forget to quit gracefully!
video_cap.release()
cv2.destroyAllWindows()

That easy, now this will give you the predictions in your terminal, but what if you want to draw your bounding boxes for example?

results = model(frame) - results represents a list of predictions YOLO has made, and each of these predictions have additional data; such as bounding box coordinates, confidence, and labels.

With this you can loop through the results list, and draw whatever data you want to display from the predictions to your frame.

Here is an example where I drew bounding boxes around the predictions:

    for box in results[0].boxes.xywh.tolist():
        center_x, center_y, width, height = box
        x1 = int(center_x - width / 2)  # top left x
        y1 = int(center_y - height / 2)  # top left y
        x2 = int(center_x + width / 2)  # bottom right x
        y2 = int(center_y + height / 2)  # bottom right y

        # rectangle parameters: frame, point1, point2, BGR color, thickness
        cv2.rectangle(frame, (x1, y1), (x2, y2), (255, 0, 0), 2)