How to create YOLOv8-based object detection web service using Python, Julia, Node.js, JavaScript, Go and Rust
Andrey Germanov
Posted on May 13, 2023
Table of contents
Introduction
YOLOv8 deployment options
Export YOLOv8 model to ONNX
Explore object detection on image using ONNX
Prepare the input
Run the model
Process the output
Intersection over Union
Non-maximum Suppression
Create a web service on Python
Setup the project
Prepare the input
Run the model
Process the output
Create a web service on Julia
Setup the project
Prepare the input
Run the model
Process the output
Create a web service on Node.js
Setup the project
Prepare the input
Run the model
Process the output
Create a web service on JavaScript
Setup the project
Prepare the input
Run the model and process the output
Create a web service on Go
Setup the project
Prepare the input
Run the model
Process the output
Create a web service on Rust
Setup the project
Prepare the input
Run the model
Process the output
Conclusion
Introduction
This is a second part of my article about the YOLOv8 neural network. In the previous article I provided a practical introduction to this model, and it's common API. Then I showed how to create a web service that detects objects on images using Python and official YOLOv8 library based on PyTorch.
In this article, I am going to show how to work with the YOLOv8 model in low level, without the PyTorch and the official API. It will open a lot of new opportunities for deployment. Using concepts and examples of this post you will be able to create an AI powered object detection services that use ten time less resources, and you will be able to create these services not only on Python, but on most of the other programming languages. In particular, I will show how to create the YOLOv8 powered web service on Julia, Node.js, JavaScript, Go and Rust.
As a base, we will use the web service, developed in the previous article, which is available in this repository. We will just rewrite the backend of this web service on different languages. That is why it's required to read the first article before continue reading this.
YOLOv8 deployment options
The YOLOv8 neural network, initially created using the PyTorch framework and exported as a set of ".pt" files. We used the Ultralytics API to train these models or make predictions based on them. To run them, it's required to have an environment with Python and PyTorch.
PyTorch is a great framework to design, train and evaluate neural network models. In addition, it has tools to prepare or even generate the datasets to train the models and many other great utils. However, we do not need all this in production. If we talk about YOLOv8, then all that you need in production is to run the model with input image and receive resulting bounding boxes. However, the YOLOv8 implemented on Python. Does it mean that all programmers who want to use this great object detector must become the Python programmers? Does it mean that they must rewrite their applications on Python or integrate them with Python code? Fortunately not. The Ultralytics API has a great export
function to convert any YOLOv8 model to a format, that can be used by external applications.
The following formats are supported at the moment:
Format |
format Argument |
---|---|
TorchScript | torchscript |
ONNX | onnx |
OpenVINO | openvino |
TensorRT | engine |
CoreML | coreml |
TF SavedModel | saved_model |
TF GraphDef | pb |
TF Lite | tflite |
TF Edge TPU | edgetpu |
TF.js | tfjs |
PaddlePaddle | paddle |
For example, the CoreML
is a neural network format, that can be used in iOS applications that run on iPhone.
Using the links in this table, you can read an overview of each of these formats.
The most interesting of them for us today is ONNX which is a lightweight runtime, created by Microsoft, that can be used to run neural network models on a wide range of platforms and programming languages. This is not a framework, but it's just a shared library written in C. It's just 16 MB in size for Linux, but it has interface bindings for most programming languages, including Python, PHP, JavaScript, Node.js, C++, Go and Rust. It has a simple API and if you wrote an ONNX code to run a model on one programming language, then it will not be difficult to rewrite it and use on other, which we will see today.
To follow the sections started from this one, you need to have Python and Jupyter Notebooks installed.
Export YOLOv8 model to ONNX
First, let's load the YOLOv8 model and export in to ONNX format to make it usable. Run the Jupyter notebook and execute the following code in it.
from ultralytics import YOLO
model = YOLO("yolov8m.pt")
model.export(format="onnx")
In the code above, you loaded the middle-sized YOLOv8 model for object detection and exported it to the ONNX format. This model is pretrained on COCO dataset and can detect 80 object classes.
After running this code, you should see the exported model in a file with the same name and the .onnx
extension. In this case, you will see the yolov8m.onnx
file in a folder where you run this code.
Before writing a web service based on ONNX, let's discover how this library works in Jupyter Notebook to understand the main concepts.
Explore object detection on image using ONNX
Now when you have a model, let's use ONNX to work with it. For simplicity, we will start with Python, because we already have a Python web application, that uses PyTorch and Ultralytics APIs. So, it will be easier to move it to ONNX.
Install the ONNX runtime library for Python by running the following command in your Jupyter notebook:
!pip install onnxruntime
and import it:
import onnxruntime as ort
We set the ort
alias to it. Remember this abbreviation because in other programming languages you will often see ort
instead on ONNX runtime.
The ort
module is a root of the ONNX API. The main object of this API is the InferenceSession
which used to instantiate a model to run prediction on it. Model instantiation works very similar to what we did before with Ultralytics:
model = ort.InferenceSession("yolov8m.onnx", providers=['CPUExecutionProvider'])
Here we loaded the model, but from ".onnx" file instead on ".pt". And now it's ready to run.
And this is a moment when similarities between Ultralytics and ONNX end. If you remember, with Ultralytics you just run: outputs = model.predict("image_file")
and received result. The smart predict method did the following for you automatically:
- Read the image from file
- Convert it to the format of the YOLOv8 neural network input layer
- Pass it through the model
- Receive the raw model output
- Parse the raw model output
- Return structured information about detected objects and their bounding boxes
The ONNX session object has a similar method run
, but it implements only steps 3 and 4. Everything else is up to you, because ONNX does not know that this is the YOLOv8 model. It does not know which input this neural network expects to get and what the raw output of this model means. This is universal API for any kind of neural networks, it does not know about concrete use cases like object detection on images.
In terms of ONNX, the neural network is a black box that receives a multidimensional array of float numbers as an input and transforms it to other multidimensional array of numbers. Which numbers should be in the input and what mean the numbers in the output, it does not know. So, and what we can do with it?
Fortunately, the things are not so worst and something we can research. The shapes of input and output layers of a neural network are fixed, they are defined when neural network created and information about them exists in a model.
The ONNX session object has a helpful method get_inputs()
to get the information about inputs that this model expects to receive and the get_outputs()
to get the information about the outputs, that the model returns after processing the inputs.
Let's get the inputs first:
inputs = model.get_inputs();
len(inputs)
1
Here we got the array of inputs and displayed the length of this array. The result is obvious: the network expects to get a single input. Let's get it:
input = inputs[0]
The input object has three fields: name
, type
and shape
. Let's get these values for our YOLOv8 model:
print("Name:",input.name)
print("Type:",input.type)
print("Shape:",input.shape)
And this is the output that you will get:
Name: images
Type: tensor(float)
Shape: [1, 3, 640, 640]
This is what we can discover from this:
- The name of expected input is
images
which is obvious. The YOLOv8 model receives the images as an input - The type of input is
tensor of float numbers
. The tensor can have many definitions, but from practical point of view which is important for us now, this is a multidimensional array of numbers, the array of float numbers. So, we can deduce that we need to convert our image to a multidimensional array of float numbers. - The shape shows the dimensions of this tensor. Here, you see that this array should be four dimensional. This should be a single image (1), that contains 3 matrices of 640x640 float numbers. What numbers should be in these matrices? The matrix of color components. As you should know, each color pixel has Red, Green and Blue components. Each color component can have values from 0 to 255. Also, you can deduce that the image must have 640x640 size. Finally, there should be 3 matrices: one 640x640 matrix that contain red component of each pixel, one for green and one for blue.
Now you have enough observations to understand what need to do in the code to prepare the input data.
Prepare the input
We need to load an image, resize it to 640x640, extract information about Red, Green and Blue component of each pixel and construct 3 matrices of intensities of appropriate colors.
Let's just do it using the Pillow python package, that we already used before. Ensure that it's installed:
!pip install pillow
For example, we will use the cat_dog.jpg
image, that we used in the previous article:
Let's load and resize it:
from PIL import Image
img = Image.open("cat_dog.jpg")
img_width, img_height = img.size;
img = img.resize((640,640))
First, you loaded the Image object from the Pillow library. Then you created the img
object from the cat_dog.jpg
file. Then we saved the original size of the image to the img_width
and img_height
variables, that will be needed later. Finally, we resized it, providing the new size as a (640,640) tuple.
Now we need to extract each color component of each pixel and construct 3 matrices from them. But here we have one thing that can lead to inconsistencies in the future. Each pixel has four color channels: Red, Green, Blue and Alpha. The alpha channel describes the transparency of a pixel. We do not need Alpha channel in the image for YOLOv8 predictions. Let's remove it:
img = img.convert("RGB");
By default, the image with Alpha channel has "RGBA" color model. By this line, you converted it to "RGB". This way, you've removed the alpha channel.
Now it's time to create 3 matrices of color channel values. We can do this manually, but Python has a great interoperability between libraries. The NumPy library, that usually used to work with multidimensional arrays, can just load the Pillow image object as an array as simple as this:
import numpy as np
input = np.array(img)
Here, you imported NumPy and just loaded the image to the input
NumPy array. Let's see the shape of this array now:
input.shape
(640, 640, 3)
Almost fine, but the dimensions go in wrong order. We need to put 3
in the beginning. The transpose
function can switch dimensions of NumPy array:
input = input.transpose(2,0,1)
input.shape
(3,640,640)
The numbering of dimensions starts from 0. So, we had 0=640, 1=640, 2=3. Then, using the transpose
function, we moved the dimension number 2
to the first place. Finally, received the shape (3,640,640).
But we need to add one more dimension to the beginning to make it (1,3,640,640). The reshape function can do this:
input = input.reshape(1,3,640,640)
Now we have correct input shape, but if you try to see contents of this array, like for example, the red component of the first pixel:
input[0,0,0,0]
you'll probably see the integer:
71
but the float numbers required. Moreover, as a rule, the numbers for machine learning must be scaled, e.g. scaled to a range from 0 to 1. Having a knowledge, that the color value can be in a range from 0 to 255, we can scale all pixels to a 0-1 range if divide them by 255.0
. The NumPy allows doing this in a single line of code:
input = input/255.0
input[0,0,0,0]
0.2784313725490196
In the code above, you divided all numbers in array and displayed the first of them: the red color component intensity for the first pixel. So, this is how the input data should look.
Run the model
Now, before running the prediction process, let's see, which output the YOLOv8 model should return. As said above, this can be done using the get_outputs()
method of ONNX session object. The result value of this method has the same type as the value of the get_inputs()
, because as I said before: "the only work of neural network is to transform one array of numbers provided as an input to other array of numbers". So, let's see the form of the output of pretrained YOLOv8 model:
outputs = model.get_outputs()
output = outputs[0]
print("Name:",output.name)
print("Type:",output.type)
print("Shape:",output.shape)
Name: output0
Type: tensor(float)
Shape: [1, 84, 8400]
The ONNX is a universal platform to run neural networks of any kind. That is why it assumes, that the network can have many inputs and many outputs, and it accepts array of inputs and array of outputs, even if these arrays have only single item. YOLOv8 has a single output, which is a first item of the outputs
object.
Here you see that the output has an output0
name, it also has a form of tensor of float numbers and a shape of this output is [1,84,8400] which means that this is a single 84x8400 matrix, that nested to a single array. In practice, it means that the YOLOv8 network returns, 8400 bounding boxes and each bounding box has 84 parameters. It's a little bit ugly that each bounding box is column here, but not row. It's a technical requirement of neural network algorithm. I think it would be better to transpose it to 8400x84, so, it will be clear that there are 8400 rows that match detected objects and that each row is a bounding box with 84 parameters.
We will discuss why there are so many parameters for a single bounding box later. First, we should run the model to get the data for this output. We have everything for this now.
To run prediction for YOLOv8 model, we need to execute the run
method, which has the following signature:
model.run(output_names,inputs)
-
output_names
- the array of names of outputs that you want to receive. In YOLOv8 model, it will be an array with a single item. -
inputs
- the dictionary of inputs, that you pass to the network in a format {name:tensor} wherename
is a name of input and thetensor
is an image data array that we prepared before.
To run the prediction for the data that you prepared, you can run the following:
outputs = model.run(["output0"], {"images":input})
len(outputs)
1
As you seen earlier, the only output of this model has a name output0
and the name of the only input is images
. The data tensor for the input you prepared in the input
variable.
If everything went well, it will display that the length of received outputs
array is 1
which means that you have only single output. However, if you receive the error that says that the input must be in float
format, then convert it to float32
using the following line:
input = input.astype(np.float32)
and then run again.
Then we are close to the most interesting part of the work: process the output.
Process the output
There is an only single output, so we can extract it from outputs:
output = outputs[0]
output.shape
(1, 84, 8400)
So, as you see, it returned the output of correct shape. As the first dimension has only single item, we can just get it:
output = output[0]
output.shape
(84, 8400)
We turned it out to a matrix with 84 rows and, 8400 columns. As I said before, it has a transposed form which is not very suitable for work, let's transpose it again:
output = output.transpose()
(8400, 84)
Now it's more clear: 8400 rows with 84 parameters. 8400 is a maximum number of bounding boxes that the YOLOv8 model can detect, and it returns 8400 lines for any image regardless of how many objects really detected on it, because the output of the neural network is fixed and defined during the neural network design. It can't be variable. So, it returns 8400 rows every time, but the most of these rows contain just garbage. How to detect, which of these rows have meaningful data and which of them are garbage? To do that, we need to discover 84 parameters that each of these row has.
The first 4 elements are coordinates of the bounding box, and all others are the probabilities of all object classes that this model can detect. The pretrained model that you use in this tutorial can detect 80 object classes, that is why, each bounding box has 84 parameters: 4+80. If you use another model, that, for example, trained to detect 3 object classes, then it will have 7 parameters in a row because of 4+3.
Let's for example display the row number 0:
row = output[0]
print(row)
[ 5.1182 8.9662 13.247 19.459 2.5034e-06 2.0862e-07 5.6624e-07 1.1921e-07 2.0862e-07 1.1921e-07 1.7881e-07 1.4901e-07 1.1921e-07 2.6822e-07 1.7881e-07 1.1921e-07 1.7881e-07 4.1723e-07 5.6624e-07 2.0862e-07 1.7881e-07 2.3842e-07 3.8743e-07 3.2783e-07 1.4901e-07 8.9407e-08
3.8743e-07 2.9802e-07 2.6822e-07 2.6822e-07 2.3842e-07 2.0862e-07 5.9605e-08 2.0862e-07 1.4901e-07 1.1921e-07 4.7684e-07 2.6822e-07 1.7881e-07 1.1921e-07 8.9407e-08 1.4901e-07 1.7881e-07 2.6822e-07 8.9407e-08 2.6822e-07 3.8743e-07 1.4901e-07 2.0862e-07 4.1723e-07 1.9372e-06 6.5565e-07
2.6822e-07 5.3644e-07 1.2815e-06 3.5763e-07 2.0862e-07 2.3842e-07 4.1723e-07 2.6822e-07 8.3447e-07 8.9407e-08 4.1723e-07 1.4901e-07 3.5763e-07 2.0862e-07 1.1921e-07 5.9605e-08 5.9605e-08 1.1921e-07 1.4901e-07 1.4901e-07 1.7881e-07 5.9605e-08 8.9407e-08 2.3842e-07 1.4901e-07 2.0862e-07
2.9802e-07 1.7881e-07 1.1921e-07 2.3842e-07 1.1921e-07 1.1921e-07]
Here you see that this row represents a bounding box with coordinates [5.1182, 8.9662, 13.247, 19.459]. These values are coordinates of a center of this bounding box, the width and the height:
x_center = 5.1182
y_center = 8.9662
width = 13.247
height = 19.459
Let's slice out these variables from the row:
xc,yc,w,h = row[:4]
All other values are the probabilities that the detected object belongs to each of 80 classes. So, assuming that the array numbering starts from 0, the item number 4 contains the probability that the object belongs to class 0 (2.5034e-06), item number 5 contains the probability that the object belongs to class 1 (2.0862e-07) etc.
Now lets remove all garbage and parse this row to a format, that we got in the previous article: [x1,y1,x2,y2,class_label,probability].
To calculate coordinates of bounding box corners you can use the following formulas:
x1 = xc-w/2
y1 = yc-h/2
x2 = xc+w/2
y2 = yc+h/2
but there is a very important reminder: do you remember that we scaled the image to 640x640 in the beginning? It means that these coordinates returned in assumption that the image has this size. To get coordinates of this bounding box for the original image, we need to scale them in proportion to the dimensions of the original image. We saved the original width and height to the img_width
and img_height
variables, and to scale the corners of the bounding box, we need to modify the formulas:
x1 = (xc - w/2) / 640 * img_width
y1 = (yc - h/2) / 640 * img_height
x2 = (xc + w/2) / 640 * img_width
y2 = (yc + h/2) / 640 * img_height
Then you need to find the object with a maximum probability. On the one hand you can do this in a loop, iterating from 4 to 84 items of this array and select the item index with maximum probability value, but the NumPy has the convenient methods for this:
prob = row[4:].max()
class_id = row[4:].argmax()
print(prob, class_id)
2.503395e-06 0
The first line returns the maximum value of subarray from 4 until the end of the row. The second line returns the index of the element with this maximum value. So, here you see that the first probability has a maximum value, and it means that this bounding box belongs to class 0.
To replace class ID with class label, you should have an array of classes, that the model can predict. In case of this model, this is 80 classes from the COCO dataset. Here they are:
yolo_classes = [
"person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat",
"traffic light", "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse",
"sheep", "cow", "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella", "handbag", "tie",
"suitcase", "frisbee", "skis", "snowboard", "sports ball", "kite", "baseball bat", "baseball glove",
"skateboard", "surfboard", "tennis racket", "bottle", "wine glass", "cup", "fork", "knife", "spoon",
"bowl", "banana", "apple", "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut",
"cake", "chair", "couch", "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse",
"remote", "keyboard", "cell phone", "microwave", "oven", "toaster", "sink", "refrigerator", "book",
"clock", "vase", "scissors", "teddy bear", "hair drier", "toothbrush"
]
In case if you use other custom trained model, then you can get this array from the YAML file, that used for training. You can find about YAML files that used to train YOLOv8 models in my previous article.
Then you can just get a class label by ID:
label = yolo_classes[class_id]
This is how you should parse each row of YOLOv8 model output.
However, this probability is too low, because 2.503395e-06 = 2.503395 / 1000000 = 0.000002503. So, this bounding box, perhaps just garbage that should be filtered out. I recommend filtering out all bounding boxes with probability less than 0.5.
Let's write all the row parsing code above as a function, to parse any row this way:
def parse_row(row):
xc,yc,w,h = row[:4]
x1 = (xc-w/2)/640*img_width
y1 = (yc-h/2)/640*img_height
x2 = (xc+w/2)/640*img_width
y2 = (yc+h/2)/640*img_height
prob = row[4:].max()
class_id = row[4:].argmax()
label = yolo_classes[class_id]
return [x1,y1,x2,y2,label,prob]
Now you can write a code that parses and filter outs all rows from output:
boxes = [row for row in [parse_row(row) for row in output] if row[5]>0.5]
len(boxes)
20
Here I used the Python list comprehensions. The internal list:
[parse_row(row) for row in output]
used to parse each row and return an array of parsed rows in
a format [x1,y1,x2,y2,class_id,prob].
and then, the external list used to filter all of these rows if their probability is less than 0.5
[row for row in [((parsed_rows))] in row[5]>0.5]
After this, the len(boxes)
shows that only 20 boxes left after filtering. Much closer to expected result than 8400, but still it's too much, because we have an image with only one cat and one dog. Curious, what else detected? Let's show this data:
[261.28302669525146, 95.53291285037994, 461.15666942596437, 313.4492515325546, 'dog', 0.9220365]
[261.16701192855834, 95.61400711536407, 460.9202187538147, 314.0579136610031, 'dog', 0.92195505]
[261.0219168663025, 95.50403118133545, 460.9265221595764, 313.81584787368774, 'dog, 0.9269446]
[260.7873046875, 95.70514416694641, 461.4101188659668, 313.7423722743988, 'dog', 0.9269207]
[139.5556526184082, 169.4101345539093, 255.12585411071777, 314.7275745868683, 'cat', 0.8986903]
[139.5316062927246, 169.63674533367157, 255.05698356628417, 314.6878091096878, 'cat', 0.90628827]
[139.68495998382568, 169.5753903388977, 255.12413234710692, 315.06962299346924, 'cat', 0.88975877]
[261.1445414543152, 95.70124578475952, 461.0543995857239, 313.6095304489136, 'dog', 0.926944]
[260.9405124664307, 95.77976751327515, 460.99450263977053, 313.57664155960083, 'dog', 0.9247296]
[260.49400663375854, 95.79500484466553, 461.3895306587219, 313.5762457847595, 'dog', 0.9034922]
[139.59658827781678, 169.2822597026825, 255.2673086643219, 314.9018738269806, 'cat', 0.88215613]
[139.46405625343323, 169.3733571767807, 255.28112654685975, 314.9132820367813, 'cat', 0.8780577]
[139.633131980896, 169.65343713760376, 255.49261894226075, 314.88970375061035, 'cat', 0.8653987]
[261.18754177093507, 95.68838310241699, 461.0297842025757, 313.1688747406006, 'dog', 0.9215225]
[260.8274451255798, 95.74608707427979, 461.32597131729125, 313.3906273841858, 'dog', 0.9093932]
[260.5131794929504, 95.89693665504456, 461.3481791496277, 313.24405217170715, 'dog', 0.8848127]
[139.4986301422119, 169.38371658325195, 255.34583129882813, 314.9019331932068, 'cat', 0.836439]
[139.55282192230223, 169.58951950073242, 255.61378440856933, 314.92880630493164, 'cat', 0.87574947]
[139.65414333343506, 169.62119138240814, 255.79856758117677, 315.1192432641983, 'cat', 0.8512477]
[139.86577434539797, 169.38782274723053, 255.5904968261719, 314.77193105220795, 'cat', 0.8271704]
All these boxes have high probability and their coordinates overlap each other. Let's draw these boxes on the image to see why is it.
The PIL
package has the ImageDraw
module, that allows to draw rectangles or other figures on top of images. Let's load the image using this object:
from PIL import ImageDraw
img = Image.open("cat_dog.jpg")
draw = ImageDraw(img)
and draw each bounding box on the image using the created draw
object in a loop:
for box in boxes:
x1,y1,x2,y2,class_id,prob = box
draw.rectangle((x1,y1,x2,y2),None,"#00ff00")
img
This code draws the green rectangles for each bounding box and displays the resulting image, which will look like this:
It draws all these 20 boxes on top of each other, so they look like just 2 boxes. As a human, you can see that all these 20 boxes belong to the same 2 objects. However, the neural network is not a human, and it thinks that it found 20 different cats and dogs that overlap each other, because it's theoretically possible that different objects on the image can overlap each other. Perhaps it sounds crazy, but this is how it works.
It's up to you to select which of these boxes should stay and which to filter out. How you can do this? On the one hand, you can select the box with the highest probability for dog and the box with the highest probability for cat and remove all others. However, it's not a useful solution for all cases, because you can have images with several dogs and several cats at the same time. You should find and use some general purpose algorithm that removes all boxes that closely overlap each other. Fortunately, this algorithm already exists and it's called the Non-maximum suppression. These are the steps that you should implement to make it working:
- Create an empty resulting array that will contain a list of boxes that you want to keep.
- Start a loop
- From source boxes array, select the box with the highest probability and move it to the resulting array.
- Compare the selected box with each other box from the source array and remove all of them that overlap the selected one too much.
- If the source array contains more boxes, move to step 2 and repeat
After loop finished, the source boxes array will be empty, and the resulting array will contain only different boxes. Now let's understand how to implement step 4, how to compare two boxes and find that they overlap each other too much. To find it, we will use other algorithm - "Intersection over Union" or IoU. This algorithm is actually a formula:
The idea of this algorithm is:
- Calculate the area of intersection of two boxes.
- Calculate the area of their union.
- Divide first by second.
The closer the result to 1, the more two boxes overlap each other. You can see this visually: the closer the area of intersection of two boxes to the area of their union, the more it looks like the same box. In the left box below the formula these boxes overlap each other, but not too much, and the IoU in this case could be about 0.3. Definitely, these two boxes can be treated as different objects, even if they overlap. On the second example it's clear that the area of intersection is much closer to the area of their union, perhaps the IoU will be about 0.8 here. Highly likely that one of these boxes should be removed. Finally, the boxes on the right sample represent almost the same area and definitely only one of them should stay.
Now let's implement both IoU and Non-Maximum suppression in code.
Intersection over union
1 Calculate the area of intersection
def intersection(box1,box2):
box1_x1,box1_y1,box1_x2,box1_y2 = box1[:4]
box2_x1,box2_y1,box2_x2,box2_y2 = box2[:4]
x1 = max(box1_x1,box2_x1)
y1 = max(box1_y1,box2_y1)
x2 = min(box1_x2,box2_x2)
y2 = min(box1_y2,box2_y2)
return (x2-x1)*(y2-y1)
Here, we calculate the area of intersection rectangle using its width (x2-x1) and height (y2-y1).
2 Calculate the area of union
def union(box1,box2):
box1_x1,box1_y1,box1_x2,box1_y2 = box1[:4]
box2_x1,box2_y1,box2_x2,box2_y2 = box2[:4]
box1_area = (box1_x2-box1_x1)*(box1_y2-box1_y1)
box2_area = (box2_x2-box2_x1)*(box2_y2-box2_y1)
return box1_area + box2_area - intersection(box1,box2)
3 Divide first by second
def iou(box1,box2):
return intersection(box1,box2)/union(box1,box2)
Non-maximum suppression
So, we have an array of boxes in the boxes
variable, and we need to leave only different items in it, using the created iou
function as a criterion of difference. Let's say that if IoU
of two boxes less than 0.7, then they both should stay. Otherwise, one of them with lesser probability should leave. Let's implement it:
boxes.sort(key=lambda x: x[5], reverse=True)
result = []
while len(boxes)>0:
result.append(boxes[0])
boxes = [box for box in boxes if iou(box,boxes[0])<0.7]
For convenience, in the first line, we sorted all boxes by probability in reverse order to move the boxes with the highest probabilities to the top.
Then the code defines the array for resulting boxes. In a loop it puts the first box (which is a box with the highest probability) in the resulting array and on the next line it overwrites the boxes array with only boxes, that have the 'IoU' with selected box that is less than 0.7.
It continues doing that in a loop until the boxes
contains no items.
After running it, you can print the result
array:
print(result)
[
[261.0219168663025, 95.50403118133545, 460.9265221595764, 313.81584787368774, 'dog', 0.9269446],
[139.5316062927246, 169.63674533367157, 255.05698356628417, 314.6878091096878, 'cat', 0.90628827]
]
Now it has just 2 items, as it should. The IoU did it magic work and selected the best boxes for cat and dog with the highest probabilities.
So, finally, you did it! Can you realize how much code you had to write instead of single model.predict()
line in Ultralytics API? However, now you have a knowledge how it really works, and awareness of these algorithms makes you independent of PyTorch environment. Now you can create applications which use the YOLOv8 models using any programming language supported by ONNX and I will show you how to do this.
In the next sections we will refactor the object detection web service, written in the previous article, to use ONNX instead of PyTorch. We will rewrite it on Python, Julia, Node.js, JavaScript, Go and Rust.
The first section with Python defines the project structure, the functions, and their relations, and then we will rewrite all these functions in other programming languages without changing the structure of the project.
The Python section is recommended for everyone, then you can move on to sections related to your chosen language. Using the defined project structure and algorithms, you will be able to write the web service on any other language, that supports ONNX.
I assume that you are familiar with all languages that you choose and have all required IDE's and tools to write, compile and run that code. I will focus only on ONNX and algorithms, described above, and will not teach you programming on these languages. Furthermore, I will not dive to their standard libraries. However, I will provide links to API docs of all external packages and frameworks that we will use, and you should either know APIs of these libraries or be able to learn them using that documentation.
Create a web service on Python
Setup the project
We will use the project, created in the previous article as a base. You can get it from this repository.
Create a new folder and copy the following files to it from the project above:
-
index.html
- frontend -
object_detector.py
- backend -
requirements.txt
- list of external dependencies
also copy the ONNX model yolov8m.onnx
that you exported in the beginning of the article.
Then, open the requirements.txt
file and replace the ultralytics
dependence to onnxruntime
. Also, add the numpy
package to the list. It will be used to convert image to array. Finally, the list of dependencies should look like this:
onnxruntime
flask
waitress
pillow
numpy
Ensure that all these packages installed: you can install them one by one using PIP, or the better option is to install all them at once:
pip install -r requirements.txt
We will not change frontend, so index.html
will stay the same. The only file that we will change is the object_detector.py
, where we will rewrite the object detection code, that previously used Ultralytics APIs to use ONNX runtime.
Let's make a few changes to the structure of this file:
import onnxruntime as ort
from flask import request, Flask, jsonify
from waitress import serve
from PIL import Image
import numpy as np
import json
app = Flask(__name__)
def main():
serve(app, host='0.0.0.0', port=8080)
@app.route("/")
def root():
with open("index.html") as file:
return file.read()
@app.route("/detect", methods=["POST"])
def detect():
buf = request.files["image_file"]
boxes = detect_objects_on_image(buf.stream)
return jsonify(boxes)
def detect_objects_on_image(buf):
model = YOLO("best.pt")
results = model.predict(buf)
result = results[0]
output = []
for box in result.boxes:
x1, y1, x2, y2 = [
round(x) for x in box.xyxy[0].tolist()
]
class_id = box.cls[0].item()
prob = round(box.conf[0].item(), 2)
output.append([
x1, y1, x2, y2, result.names[class_id], prob
])
return output
main()
If you compare this listing with the original object_detector.py
, you'll see that I removed the ultralytics
package and put the line that imports the ONNX runtime: import onnxruntime as ort
. Also, I've imported numpy as np
.
Then, I put the code that runs a web server to the main
function and put it to the beginning. Finally, I call the main() as a last line.
We will not change the routes inside the main function, so the root
and detect
functions will remain the same. We will rewrite only the detect_objects_on_image
to use ONNX runtime instead of Ultralytics. The implementation will be more complex than now, but you already know everything if followed the previous section of this article.
We will split the dected_objects_on_image
function to three parts:
- Prepare the input
- Run the model
- Process the output
Each phase we will put to a separate function, which the detect_objects_on_image
will call. Replace the content of this function to the following:
def detect_objects_on_image(buf):
input, img_width, img_height = prepare_input(buf)
output = run_model(input)
return process_output(output,img_width,img_height)
def prepare_input(buf):
pass
def run_model(input):
pass
def process_output(output,img_width,img_height):
pass
- In the first line, the
prepare_input
function receives the uploaded file content, converts it to theinput
array and returns it. In addition, it returns the original dimensions of the image:image_width
andimage_height
, that will be used later to scale detected bounding boxes. - Then, the
run_model
function receives theinput
and runs the ONNX session with it. It returns theoutput
which is an array with (1,84,8400) shape. - Finally, the
output
passed to theprocess_output
function, along with the original image size (img_width
,img_height
). This function should return the array of bounding boxes. Each item of this array has the following format:[x1,y1,x2,y2,class_label,prob]
.
Let's write these functions one by one.
Prepare the input
The prepare_input
function uses the code that you have written in the Prepare the input section. This is how it looks:
def prepare_input(buf):
img = Image.open(buf)
img_width, img_height = img.size
img = img.resize((640, 640))
img = img.convert("RGB")
input = np.array(img)
input = input.transpose(2, 0, 1)
input = input.reshape(1, 3, 640, 640) / 255.0
return input.astype(np.float32), img_width, img_height
- This code loads the image, saves its size to
img_width
andimg_height
variables. - Then it resizes it, removes the transparency by converting to RGB, and converts to a tensor of pixels by loading as an
np.array()
. - Then it transposes and reshapes the array to convert it from (640,640,3) shape to the (1,3,640,640) shape, divides all values by 255.0 to scale it and make compatible with ONNX model input format.
- Finally, it returns the input array converted to "Float32" data type along with original
img_width
andimg_height
. It's important here to convert tonp.float32
, because by default, Python uses thedouble
as a type for floating point numbers, but ONNX runtime model requires the Float32.
Run the model
In this function you can reuse the code, that we wrote in the Run the model section.
def run_model(input):
model = ort.InferenceSession("yolov8m.onnx", providers=['CPUExecutionProvider'])
outputs = model.run(["output0"], {"images":input})
return outputs[0]
First, you load the model from the yolov8m.onnx
file and then use the run
method to process the input
and return the outputs. Finally, it returns the first output which is an array of (1,84,8400) shape.
Now, it's time to process and convert this output to the array of bounding boxes.
Process the output
The code to process the output will include the functions from the Process the output section to filter out all overlapping boxes using the "Intersection over Union" algorithm. Also, it will use the array of YOLO classes to obtain the labels for each detected object. This code you can just copy/paste from the appropriate places:
def iou(box1,box2):
return intersection(box1,box2)/union(box1,box2)
def union(box1,box2):
box1_x1,box1_y1,box1_x2,box1_y2 = box1[:4]
box2_x1,box2_y1,box2_x2,box2_y2 = box2[:4]
box1_area = (box1_x2-box1_x1)*(box1_y2-box1_y1)
box2_area = (box2_x2-box2_x1)*(box2_y2-box2_y1)
return box1_area + box2_area - intersection(box1,box2)
def intersection(box1,box2):
box1_x1,box1_y1,box1_x2,box1_y2 = box1[:4]
box2_x1,box2_y1,box2_x2,box2_y2 = box2[:4]
x1 = max(box1_x1,box2_x1)
y1 = max(box1_y1,box2_y1)
x2 = min(box1_x2,box2_x2)
y2 = min(box1_y2,box2_y2)
return (x2-x1)*(y2-y1)
yolo_classes = [
"person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat",
"traffic light", "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse",
"sheep", "cow", "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella", "handbag", "tie",
"suitcase", "frisbee", "skis", "snowboard", "sports ball", "kite", "baseball bat", "baseball glove",
"skateboard", "surfboard", "tennis racket", "bottle", "wine glass", "cup", "fork", "knife", "spoon",
"bowl", "banana", "apple", "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut",
"cake", "chair", "couch", "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse",
"remote", "keyboard", "cell phone", "microwave", "oven", "toaster", "sink", "refrigerator", "book",
"clock", "vase", "scissors", "teddy bear", "hair drier", "toothbrush"
]
This is the iou
function and it's dependencies to calculate the intersection
over union
coefficient. Also, there is an array of YOLO classes, that the model can detect.
Now, having all that, you can implement the process_output
function:
def process_output(output, img_width, img_height):
output = output[0].astype(float)
output = output.transpose()
boxes = []
for row in output:
prob = row[4:].max()
if prob < 0.5:
continue
class_id = row[4:].argmax()
label = yolo_classes[class_id]
xc, yc, w, h = row[:4]
x1 = (xc - w/2) / 640 * img_width
y1 = (yc - h/2) / 640 * img_height
x2 = (xc + w/2) / 640 * img_width
y2 = (yc + h/2) / 640 * img_height
boxes.append([x1, y1, x2, y2, label, prob])
boxes.sort(key=lambda x: x[5], reverse=True)
result = []
while len(boxes) > 0:
result.append(boxes[0])
boxes = [box for box in boxes if iou(box, boxes[0]) < 0.7]
return result
- First two lines convert the output shape from (1,84,8400) to (8400,84) which is 8400 rows with 84 columns. Also, it converts the values of array from
np.float32
tofloat
data type. It's required to serialize result to JSON finally. - The first loop used to go through the rows. For each row, it calculates the probability of this prediction and skips all rows if the probability less than 0.5.
- For rows that passed the probability check, it determines the detected object
class_id
and the textlabel
of this class, using theyolo_classes
array. - Then it calculates the corner coordinates of the bounding box using coordinates of its center, width and height. Also, it scales it to the original image size using the
img_width
andimg_height
parameters. - Then it appends the calculated bounding box to the
boxes
array. - The last part of the function filters the detected
boxes
using the "Non-maximum suppression" algorithm. It filters all boxes that overlap the box with the highest probability, using theiou
function to determine the overlapping criteria value. - Finally, all boxes that passed the filter returned as a
result
array.
That is it for Python implementation.
If everything implemented without mistakes, you can run this web service this way:
python object_detector.py
then open http://localhost:8080
in a web browser, and it should work exactly the same, as an original service, implemented using the PyTorch version of YOLOv8 model.
The ONNX runtime is a low level library, so it requires much more code to make the model work, however, the solution built this way is better to deploy in production, because it requires 10 times less hard disk space.
You can find the whole project with comments in this GitHub repository.
The code that we developed here is oversimplified. It intended only to demonstrate how to load and run the YOLOv8 models using ONNX runtime. It does not include any error processing and exception handling. These tasks depend on real use cases, and it's up to you how to implement it for your projects.
We used only a small subset of ONNX runtime Python API required for basic operations. Full reference available here.
If you followed this guide step by step and implemented this web service on Python, then by this moment you know the foundational algorithm on how the ONNX runtime works in general and ready to try implementing this on other languages.
In the sections below, we will implement the same projects with the same functions on other programming languages. If curious, you can read all next sections or move directly to the language that is interesting for you the most.
Create a web service on Julia
Julia is a modern programming language well suited for data science and machine learning. It combines simple syntax with superfast runtime performance. Sometimes it's stated as a future of machine learning and the most natural replacement for Python in this field.
The Julia has good libraries for machine learning and deep learning. You can read my articles which introduces these libraries to create and run classical machine learning models and neural networks.
Furthermore, having a binding to the ONNX runtime library, you can use any machine learning model, created using Python, including neural networks, created in PyTorch and TensorFlow. The YOLOv8 is not an exception, and you can run that models, exprorted to ONNX format in Julia.
Below, we will implement the same object detection project on Julia.
Setup the project
Enter the Julia REPL by running the following command:
julia
In the REPL, switch to pkg
mode by pressing the ]
key and then, enter this command:
generate object_detector
This command will create a folder object_detector
and will generate the new project in it.
Enter the shell mode by pressing the ;
key and move to the project folder by running the following command:
cd object_detector
Return to the pkg
mode by pressing Esc
and then press the ]
key. Then exec this command to activate the project:
activate .
Then you need to install dependencies that will be used. They are ONNX runtime, the Images
package and the Genie web framework.
add ONNXRunTime
add Images
add Genie
- ONNXRuntime - this is the Julia bindings for ONNX runtime library.
- Images - this is the Julia Images package, which we will use to read images and convert them to pixel color arrays.
- Genie - this is a web framework for Julia, similar to Flask in Python.
Then you can exit the Julia REPL by pressing Ctrl+D
.
Open the project folder to see what is there:
-
src
- the folder with Julia source code -
Project.toml
- the project properties file -
Manifest.toml
- the project package cache file
Also, it already generated the template source code file object_detector.jl
in the src
folder. In this file we will do all the work. However, before we start, copy the index.html
and the yolov8m.onnx
files from Python project to this project root. The frontend will be the same.
After you've done that, open the src/object_detector.jl
, erase all content from it and add the following boilerplate code:
using Images, ONNXRunTime, Genie, Genie.Router, Genie.Requests, Genie.Renderer.Json
function main()
route("/") do
String(read("index.html"))
end
route("/detect", method=POST) do
buf = IOBuffer(filespayload()["image_file"].data)
json(detect_objects_on_image(buf))
end
up(8080, host="0.0.0.0", async=false)
end
function detect_objects_on_image(buf)
input, img_width, img_height = prepare_input(buf)
output = run_model(input)
return process_output(output, img_width,img_height)
end
function prepare_input(buf)
end
function run_model(input)
end
function process_output(output, img_width, img_height)
end
main()
This is a template of the whole application. You can compare this with the Python project and see that it has almost the same structure.
- First you import dependencies, including ONNX Runtime, Genie Web framework and Images library.
- Then, in the main function, you create two endpoints: one for main
index.html
page and one/detect
, which will receive the image file and pass it to thedetect_objects_on_image
function. Then you start the web server on port 8080 which serves these two endpoints. - The
detect_objects_on_image
has exactly the same content as the Python one. It prepares input from the image, passes it through the model, processes the model output and returns the array of bounding boxes. - Then, the processed output returned to client as a JSON.
In the next sections we will implement prepare_input
, run_model
and process_output
functions one by one.
Prepare the input
function prepare_input(buf)
img = load(buf)
img_height, img_width = size(img)
img = imresize(img,(640,640))
img = RGB.(img)
input = channelview(img)
input = reshape(input,1,3,640,640)
return Float32.(input), img_width, img_height
end
- This code loads the image, saves its size to
img_width
andimg_height
variables. - Then it resizes it, removes the transparency by converting to RGB, and converts to a tensor of pixels using the
channelview
function. - Then it reshapes the array to convert it from (640,640,3) shape to the (1,3,640,640) shape, that required for the ONNX model.
- Finally, it returns the input array converted to "Float32" data type along with original
img_width
andimg_height
.
Run the model
function run_model(input)
model = load_inference("yolov8m.onnx")
outputs = model(Dict("images" => input))
return outputs["output0"]
end
This code is almost the same as appropriate Python code.
First, you load the model from the yolov8m.onnx
file and then run this model to process the input
and return the outputs. Finally, it returns the first output which is an array of (1,84,8400) shape.
Now, it's time to process and convert this output to the array of bounding boxes.
Process the output
The code of the process_output
function will use the Intersection Over Union algorithm to filter out all overlapped boxes. It's easy to rewrite the iou, intersect and union functions from Python to Julia. Include them to your code below the process_output
function:
function iou(box1,box2)
return intersection(box1,box2) / union(box1,box2)
end
function union(box1,box2)
box1_x1,box1_y1,box1_x2,box1_y2 = box1[1:4]
box2_x1,box2_y1,box2_x2,box2_y2 = box2[1:4]
box1_area = (box1_x2-box1_x1)*(box1_y2-box1_y1)
box2_area = (box2_x2-box2_x1)*(box2_y2-box2_y1)
return box1_area + box2_area - intersection(box1,box2)
end
function intersection(box1,box2)
box1_x1,box1_y1,box1_x2,box1_y2 = box1[1:4]
box2_x1,box2_y1,box2_x2,box2_y2 = box2[1:4]
x1 = max(box1_x1,box2_x1)
y1 = max(box1_y1,box2_y1)
x2 = min(box1_x2,box2_x2)
y2 = min(box1_y2,box2_y2)
return (x2-x1)*(y2-y1)
end
Also, include the array of YOLOv8 class labels, which will be used to convert class IDs to text labels:
yolo_classes = [
"person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat",
"traffic light", "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse",
"sheep", "cow", "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella", "handbag", "tie",
"suitcase", "frisbee", "skis", "snowboard", "sports ball", "kite", "baseball bat", "baseball glove",
"skateboard", "surfboard", "tennis racket", "bottle", "wine glass", "cup", "fork", "knife", "spoon",
"bowl", "banana", "apple", "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut",
"cake", "chair", "couch", "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse",
"remote", "keyboard", "cell phone", "microwave", "oven", "toaster", "sink", "refrigerator", "book",
"clock", "vase", "scissors", "teddy bear", "hair drier", "toothbrush"
]
Now, it's time to write the process_output
function:
function process_output(output, img_width, img_height)
output = output[1,:,:]
output = transpose(output)
boxes = []
for row in eachrow(output)
prob = maximum(row[5:end])
if prob < 0.5
continue
end
class_id = Int(argmax(row[5:end]))
label = yolo_classes[class_id]
xc,yc,w,h = row[1:4]
x1 = (xc-w/2)/640*img_width
y1 = (yc-h/2)/640*img_height
x2 = (xc+w/2)/640*img_width
y2 = (yc+h/2)/640*img_height
push!(boxes,[x1,y1,x2,y2,label,prob])
end
boxes = sort(boxes, by = item -> item[6], rev=true)
result = []
while length(boxes)>0
push!(result,boxes[1])
boxes = filter(box -> iou(box,boxes[1])<0.7,boxes)
end
return result
end
As a python version, it consists of three parts.
- In the first two lines it converts the output array from (1,84,8400) shape to the (8400,84).
- The first loop used to go through the rows. For each row, it calculates the probability of this prediction and skips all rows if the probability less than 0.5.
- For rows that passed the probability check, it determines the
class_id
of the detected object and the textlabel
of this class, using theyolo_classes
array. - Then it calculates the corner coordinates of the bounding box from coordinates of its center, width and height. Also, it scales it to the original image size using the
img_width
andimg_height
parameters. - Then it appends the calculated bounding box to the boxes array.
- The last part of the function filters the detected
boxes
using the "Non-maximum suppression" algorithm. It filters all boxes that overlap the box with the highest probability, using theiou
function to determine the overlapping criteria value. - Finally, all boxes that passed the filter returned as a
result
array.
That is it for Julia implementation.
If everything implemented without mistakes, you can run this web service from the project folder using the following command:
julia src/object_detector.jl
then open http://localhost:8080
in a web browser, and it should work exactly the same, as Python version.
The code that we developed here is oversimplified. It intended only to demonstrate how to load and run the YOLOv8 models using ONNX runtime. It does not include any error processing and exception handling. These tasks depend on real use cases, and it's up to you how to implement it for your projects.
We used only a small subset of ONNX runtime Julia API required for basic operations. Full reference available here.
You can find the source code of the Julia project in this repository.
Create a web service on Node.js
The Node.js needs no introduction. This is the most used platform to develop server side JavaScript applications, including backends for web services. Obviously, it would be great to have a feature to use neural networks in it. Fortunately, the ONNX runtime for Node.js opens the door to all machine learning models trained on PyTorch, TensorFlow and other frameworks. The YOLOv8 is not an exception. In this section, I will show how to rewrite our object detection web service on Node.js, using the ONNX runtime.
Setup the project
Create new folder for the project like object_detector
, open it and run:
npm init
to create new Node.js project. After answering all questions about project, install required dependencies:
npm i --save onnxruntime-node
npm i --save express
npm i --save multer
npm i --save sharp
- onnxruntime-node - The Node.js library for ONNX Runtime
- express - Express.js web framework
- multer - Middleware for Express.js to handle file uploads
- sharp - An image processing library
We are not going to change frontend, so you can copy the index.html
file from the previous project as is to the folder of this project. Also, copy the model file yolov8m.onnx
.
Create a object_detector.js
file in which you will write the whole backend. Add the following boilerplate code to it:
const ort = require("onnxruntime-node");
const express = require('express');
const multer = require("multer");
const sharp = require("sharp");
const fs = require("fs");
function main() {
const app = express();
const upload = multer();
app.get("/", (req,res) => {
res.end(fs.readFileSync("index.html", "utf8"))
})
app.post('/detect', upload.single('image_file'), async function (req, res) {
const boxes = await detect_objects_on_image(req.file.buffer);
res.json(boxes);
});
app.listen(8080, () => {
console.log('Server is listening on port 8080')
});
}
async function detect_objects_on_image(buf) {
const [input,img_width,img_height] = await prepare_input(buf);
const output = await run_model(input);
return process_output(output,img_width,img_height);
}
async function prepare_input(buf) {
}
async function run_model(input) {
}
async function process_output(output, img_width, img_height) {
}
main()
- In the first block of
require
lines you import all required external modules:ort
for ONNX runtime,express
for web framework,multer
to support file uploads in Express framework,sharp
to load the uploaded file as an image and convert it to array of pixel colors andfs
to read static files. - In the
main
function, it creates a new Express web application in theapp
variable and instantiates theuploads
module for it. - Then it defines two routes: the root route that reads and returns a content of the
index.html
file and the/detect
route that used to get uploaded file, to pass it to thedetect_objects_on_image
function and to return bounding boxes of detected objects to client. - The
detect_objects_on_image
looks almost the same as in Python and Julia projects: first it converts the uploaded file to the array of numbers, passes it to the model, processes the output and returns the array of detected objects. - Then function stubs for all actions defined
- Finally, the
main()
function called to start a web server on port 8080.
The project is ready, and it's time to implement the prepare_input
, run_model
and process_output
functions one by one.
Prepare the input
We will use the Sharp
library to load the image as an array of pixel colors. However, JavaScript does not have such packages as NumPy, which support multidimensional arrays. All arrays in JavaScript are flat. We can make "array of arrays", but it's not true multidimensional array with shape. For example, we can't make the array with shape (3,640,640) which means the array of 3 matrices: first one for reds, second one for greens and third one for blues. Instead, the ONNX runtime for Javascript requires the flat array with 3*640*640=1228800 elements in which reds will go in the beginning, greens will go next and blues will go at the end. This is the result that the prepare_input
function should return. Now let's do it step by step.
First, let's do the same actions with image as we did in other languages:
function prepare_input(buf) {
const img = sharp(buf);
const md = await img.metadata();
const [img_width,img_height] = [md.width, md.height];
const pixels = await img.removeAlpha()
.resize({width:640,height:640,fit:'fill'})
.raw()
.toBuffer();
- It loads the file as an image using
sharp
. - It saves the original image dimensions to
img_width
andimg_height
- on the next line, it uses the chain of operations to
- remove the transparency channel,
- resize the image to 640x640,
- return the image as a raw array of pixels to buffer
The Sharp also can't return a matrix of pixels because there are no matrices in JavaScript. That is why, now, you have the pixels
array, that contains a single dimensional array of image pixels. Each pixel consists of 3 numbers: R, G, B, There are no rows and columns and pixels just go one after another. To convert it to required format, you need to convert it to 3 arrays: array of reds, array of greens and array of blues and then concatenate these 3 arrays to one in which the reds will go first, greens will go next and blues will go at the end.
The next image shows what you need to do with the pixels
array and return from the function:
The first step is to create 3 arrays for reds, greens and blues:
const red = [], green = [], blue = [];
Then, traverse the pixels
array and collect numbers to appropriate arrays:
for (let index=0; index<pixels.length; index+=3) {
red.push(pixels[index]/255.0);
green.push(pixels[index+1]/255.0);
blue.push(pixels[index+2]/255.0);
}
This loop jumps from pixel to pixel with step=3. On each iteration, the index
is equal to the red component of the current pixel, the index+1
is equal to the green component and the index+2
is equal to the blue. As you see, we divide components by 255.0 to scale and put to appropriate arrays.
The only thing that left to do after this, is to concatenate these arrays in correct order and return along with img_width
and img_height
.
Here is a full code of the prepare_input
function:
async function prepare_input(buf) {
const img = sharp(buf);
const md = await img.metadata();
const [img_width,img_height] = [md.width, md.height];
const pixels = await img.removeAlpha()
.resize({width:640,height:640,fit:'fill'})
.raw()
.toBuffer();
const red = [], green = [], blue = [];
for (let index=0; index<pixels.length; index+=3) {
red.push(pixels[index]/255.0);
green.push(pixels[index+1]/255.0);
blue.push(pixels[index+2]/255.0);
}
const input = [...red, ...green, ...blue];
return [input, img_width, img_height];
}
Perhaps there are other less resource consuming ways exist to convert the pixels
array to required form without temporary arrays (you can try your options), but I just wanted to be logical and simple in this implementation.
Now, let's run this input through the YOLOv8 model using the ONNX runtime.
Run the model
The code of the run_model
function follows:
async function run_model(input) {
const model = await ort.InferenceSession.create("yolov8m.onnx");
input = new ort.Tensor(Float32Array.from(input),[1, 3, 640, 640]);
const outputs = await model.run({images:input});
return outputs["output0"].data;
}
- On the first line, we load the model from
yolov8m.onnx
file. - On the second line, we prepare the input array. The ONNX Runtime requires to convert it to an internal
ort.Tensor
object. Constructor of this object require specifying the flat numbers array, converted to Float32 and a shape, that this array should have, which is as usual [1,3,640,640]. - On the third line, we run the model with constructed tensor and receive
outputs
. - Finally, we return the data of the first output. In JavaScript version, we require specifying the name of this output, instead of index. The name of the YOLOv8 output, as you have seen in the beginning of this article, is
output0
.
As a result, the function returns the array with (1,84,8400) shape, or you can think about this as about 84x8400 matrix. However, JavaScript does not support matrices, that is why, it returns an output as a single dimension array. The numbers in this array ordered as 84x8400, but as a flat array of 705600 items. So, you can't transpose it, and you can't traverse it by rows in a loop, because it's required to specify the absolute position of the item. But do not worry, in the next section we will learn how to deal with it.
Process the output
The code of the process_output
function will use the Intersection Over Union algorithm to filter out all overlapped boxes. It's easy to rewrite the iou, intersect and union functions from Python to JavaScript. Include them to your code below the process_output
function:
function iou(box1,box2) {
return intersection(box1,box2)/union(box1,box2);
}
function union(box1,box2) {
const [box1_x1,box1_y1,box1_x2,box1_y2] = box1;
const [box2_x1,box2_y1,box2_x2,box2_y2] = box2;
const box1_area = (box1_x2-box1_x1)*(box1_y2-box1_y1)
const box2_area = (box2_x2-box2_x1)*(box2_y2-box2_y1)
return box1_area + box2_area - intersection(box1,box2)
}
function intersection(box1,box2) {
const [box1_x1,box1_y1,box1_x2,box1_y2] = box1;
const [box2_x1,box2_y1,box2_x2,box2_y2] = box2;
const x1 = Math.max(box1_x1,box2_x1);
const y1 = Math.max(box1_y1,box2_y1);
const x2 = Math.min(box1_x2,box2_x2);
const y2 = Math.min(box1_y2,box2_y2);
return (x2-x1)*(y2-y1)
}
also, you will need to find YOLO class label by ID, so add the yolo_classes
array to your code:
const yolo_classes = [
'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat',
'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse',
'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase',
'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard',
'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple',
'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant',
'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven',
'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
];
Now let's implement the process_output
function. As mentioned above, the function receives output as a flat array that ordered as 84x8400 matrix. When work in Python, we had a NumPy to transform it to 8400x84 and then traverse in a loop by row. Here, we can't transform it this way, so, we need to traverse it by columns.
boxes=[];
for (index=0;index<8400;index++) {
}
Moreover, you do not have row indexes and column indexes, but have only absolute indexes. You can only virtually reshape this flat array to 84x8400 matrix in your head and use this representation to calculate these absolute indexes, using those "virtual rows" and "virtual columns".
Let's display how the output
array looks to clarify this:
Here we virtually reshaped the output
array with 705600 items to a 84x8400 matrix. It has 8400 columns with indexes from 0 to 8399 and 84 rows with indexes from 0 to 83. The absolute indexes of items have written inside boxes. Each detected object represented by a column in this matrix. The first 4 rows of each column with indexes from 0 to 3 are coordinates of the bounding box of the appropriate object: x_center, y_center, width and height. Cells in the other 80 rows, starting from 4 to 83 contain the probabilities that the object belongs to each of the 80 YOLO classes.
I drew this table to understand how to calculate the absolute index of any item in it, knowing the row and column indexes. For example, how you calculate the index of first greyed item that stands on row 2 and column 2, which is a bounding box width
of the third detected object? If you think about this a little more, you will find, that to calculate this you need to multiply the row index by the length of the row (8400) and add the column index to this. Let's check it: 8400*2+2=16802. Now, let's calculate the index of the item below it, which is a height of the same object: 8400*3+2=25202. Bingo! Matched again! Finally, let's check the bottom gray box, which is a probability that object 8399 belongs to class 79 (toothbrush): 8400*83+8398=705598. Great, so you have a formula to calculate absolute index: 8400*row_index+column_index
.
Let's return to our empty loop. Assuming that the index
loop counter is an index of current column and that coordinates of bounding box located in rows 0-3 of current column, we can extract them this way:
boxes=[];
for (index=0;index<8400;index++) {
const xc = output[8400*0+index];
const yc = output[8400*1+index];
const w = output[8400*2+index];
const h = output[8400*3+index];
}
Then you can calculate the corners of the bounding box and scale them to the size of the original image:
const x1 = (xc-w/2)/640*img_width;
const y1 = (yc-h/2)/640*img_height;
const x2 = (xc+w/2)/640*img_width;
const y2 = (yc+h/2)/640*img_height;
Now similarly you need to get probabilities of the object, that goes in rows from 4 to 83, find which of them is biggest and the index of this probability, and save these values to the prob
and the class_id
variables. You can write a nested loop, that traverses rows from 4 to 83 and saves the highest value, and it's index:
let class_id = 0, prob = 0;
for (let col=4;col<84;col++) {
if (output[8400*col+index]>prob) {
prob = output[8400*col+index];
class_id = col - 4;
}
}
It works fine, but I'd better rewrite this in a functional way:
const [class_id,prob] = [...Array(80).keys()]
.map(col => [col, output[8400*(col+4)+index]])
.reduce((accum, item) => item[1]>accum[1] ? item : accum,[0,0]);
- The first line
[...Array(80).keys()]
generates a range array with numbers from 0 to 79 - Then, the
map
function constructs the array of probabilities for each class_id where each item collected as a[class_id,probability]
array - The
reduce
function reduces the array to a single item, that contains maximum probability and its class id. - This item finally returned and destructured to
class_id
andprob
variables.
Then, having the maximum probability and class_id, you can either skip that object, if the probability is less than 0.5 or find the label of this class.
Here is a final code, that processes and collects bounding boxes to the boxes
array:
let boxes = [];
for (let index=0;index<8400;index++) {
const [class_id,prob] = [...Array(80).keys()]
.map(col => [col, output[8400*(col+4)+index]])
.reduce((accum, item) => item[1]>accum[1] ? item : accum,[0,0]);
if (prob < 0.5) {
continue;
}
const label = yolo_classes[class_id];
const xc = output[index];
const yc = output[8400+index];
const w = output[2*8400+index];
const h = output[3*8400+index];
const x1 = (xc-w/2)/640*img_width;
const y1 = (yc-h/2)/640*img_height;
const x2 = (xc+w/2)/640*img_width;
const y2 = (yc+h/2)/640*img_height;
boxes.push([x1,y1,x2,y2,label,prob]);
}
The last step is to filter the boxes
array using "Non-maximum suppression", to exclude all overlapping boxes from it. This code is close to the Python implementation:
boxes = boxes.sort((box1,box2) => box2[5]-box1[5])
const result = [];
while (boxes.length>0) {
result.push(boxes[0]);
boxes = boxes.filter(box => iou(boxes[0],box)<0.7);
}
- We sort the boxes by probability in reverse order to put the boxes with the highest probability to the top
- In a loop, we put the box with the highest probability to
result
- Then we filter out all boxes that overlap the selected box too much (all boxes that have IoU>0.7 with this box)
That's all! For convenience, here is a full code of the process_output
function:
function process_output(output, img_width, img_height) {
let boxes = [];
for (let index=0;index<8400;index++) {
const [class_id,prob] = [...Array(80).keys()]
.map(col => [col, output[8400*(col+4)+index]])
.reduce((accum, item) => item[1]>accum[1] ? item : accum,[0,0]);
if (prob < 0.5) {
continue;
}
const label = yolo_classes[class_id];
const xc = output[index];
const yc = output[8400+index];
const w = output[2*8400+index];
const h = output[3*8400+index];
const x1 = (xc-w/2)/640*img_width;
const y1 = (yc-h/2)/640*img_height;
const x2 = (xc+w/2)/640*img_width;
const y2 = (yc+h/2)/640*img_height;
boxes.push([x1,y1,x2,y2,label,prob]);
}
boxes = boxes.sort((box1,box2) => box2[5]-box1[5])
const result = [];
while (boxes.length>0) {
result.push(boxes[0]);
boxes = boxes.filter(box => iou(boxes[0],box)<0.7);
}
return result;
}
If you like to work with this output in a more convenient "Pythonic" way, there is a NumJS library that emulates NumPy in JavaScript. You can use it to physically reshape the output to 84x8400, then transpose to 8400x84 and then traverse detected objects by row.
However, the option to work with single dimension array as with matrix described in this section is the most efficient, because we got all values we need without additional array transformations. I think that installing additional external dependency is overkill for this case.
That is it for Node.js implementation. If you wrote everything correctly, then you can start this web service by running the following command:
node object_detector.js
and open http://localhost:8080
in a web browser.
The code that we developed here is oversimplified. It intended only to demonstrate how to load and run the YOLOv8 models using ONNX runtime. It does not include any error processing and exception handling. These tasks depend on real use cases, and it's up to you how to implement it for your projects.
We used only a small subset of ONNX runtime JavaScript API required for basic operations. Full reference available here.
You can find a source code of Node.js object detector web service in this repository.
Create a web service on JavaScript
Could you ever realize that you can write all code for object detector right in the HTML page? Using the ONNX library for JavaScript, you can process the image right in the frontend, without sending it to any server. Furthermore, you can reuse most code that we wrote for Node.js because the underlying ONNX runtime API is the same.
Setup the project
You can reuse the frontend from Node.js project. Create a new folder and copy the index.html
and yolov8m.onnx
files to it.
Then, open the index.html
and add the JavaScript library for ONNX runtime to the head section of the HTML:
<script src="https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/ort.min.js"></script>
This library exposes the ort
global variable, that is a root of the ONNX runtime API. You can use it to instantiate and run models the same way as we used the ort
variable in the Node.js project.
Perhaps in a moment when you read it, the URL to the library will change, so you can look in the official documentation for installation instructions.
This is an index.html
file that you should have in the beginning:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>YOLOv8 Object Detection</title>
<style>
canvas {
display:block;
border: 1px solid black;
margin-top:10px;
}
</style>
<script src="https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/ort.min.js"></script>
</head>
<body>
<input id="uploadInput" type="file"/>
<canvas></canvas>
<script>
const input = document.getElementById("uploadInput");
input.addEventListener("change",async(event) => {
const data = new FormData();
data.append("image_file",event.target.files[0],"image_file");
const response = await fetch("/detect",{
method:"post",
body:data
});
const boxes = await response.json();
draw_image_and_boxes(event.target.files[0],boxes);
})
function draw_image_and_boxes(file,boxes) {
const img = new Image()
img.src = URL.createObjectURL(file);
img.onload = () => {
const canvas = document.querySelector("canvas");
canvas.width = img.width;
canvas.height = img.height;
const ctx = canvas.getContext("2d");
ctx.drawImage(img,0,0);
ctx.strokeStyle = "#00FF00";
ctx.lineWidth = 3;
ctx.font = "18px serif";
boxes.forEach(([x1,y1,x2,y2,label]) => {
ctx.strokeRect(x1,y1,x2-x1,y2-y1);
ctx.fillStyle = "#00ff00";
const width = ctx.measureText(label).width;
ctx.fillRect(x1,y1,width+10,25);
ctx.fillStyle = "#000000";
ctx.fillText(label, x1, y1+18);
});
}
}
</script>
</body>
</html>
To run ONNX runtime in a browser, you need to run the content of this folder on a web server. You can use VS Code embedded web server to run the index.html
in it.
When it works, let's load the image and prepare an input array from it.
Prepare the input
User loads the image by using the upload file field to select the image file. This process implemented in the change
event listener:
input.addEventListener("change",async(event) => {
const data = new FormData();
data.append("image_file",event.target.files[0],"image_file");
const response = await fetch("/detect",{
method:"post",
body:data
});
const boxes = await response.json();
draw_image_and_boxes(event.target.files[0],boxes);
})
In this code, you used fetch
to post the file from event.target.files[0]
variable to the backend. Then backend returns the array of bounding boxes that decoded to a boxes
array.
However, in this version, we will not have a backend to load the image to. All code we will write here, in the index.html
file, including the detect_objects_on_image
and all other functions. So you need to remove this fetch
call and just pass the file to the detect_objects_on_image
function:
input.addEventListener("change",async(event) => {
const boxes = await detect_objects_on_image(event.target.files[0]);
draw_image_and_boxes(event.target.files[0],boxes);
})
Then, define the detect_objects_on_image
function, which is the same as in Node.js example:
async function detect_objects_on_image(buf) {
const [input,img_width,img_height] = await prepare_input(buf);
const output = await run_model(input);
return process_output(output,img_width,img_height);
}
The only difference here is that buf
is a File object, that user selected in the upload file field. You need to load this file as an image in the browser and convert to array of pixels. The most common way to load an image in HTML and JavaScript is using the HTML5 canvas object. This object loads the image as a flat array of pixel colors, almost the same, as the Sharp
library loaded it in the Node.js version. This work we will do in the prepare_input
function:
async function prepare_input(buf) {
const img = new Image();
img.src = URL.createObjectURL(buf);
img.onload = () => {
const [img_width,img_height] = [img.width, img.height]
const canvas = document.createElement("canvas");
canvas.width = 640;
canvas.height = 640;
const context = canvas.getContext("2d");
context.drawImage(img,0,0,640,640);
const imgData = context.getImageData(0,0,640,640);
const pixels = imgData.data;
}
}
- The HTML5 Canvas element can draw the HTML images, that is why, we need to load the file to the
Image()
object first. - Then, before drawing it on the canvas, we need to ensure that the image is loaded. That is why, all next code we write in the
onload()
event handler of the image object, that executed only after the image is loaded. - We save the original image size to
img_width
andimg_height
. - Then we create a
canvas
object and set it size to 640x640, because this is a size, that required by the YOLOv8 model. - Then we get the HTML5 canvas drawing
context
of created canvas to draw the image on the canvas. ThedrawImage
method allows drawing and resize at the same time, that is why we set the size of image on the canvas to 640x640. - Then the getImageData() used to get the imageData object with image pixels.
- The only required property of the ImageData object is the
data
which contains the array of pixels that we need.
Now you have the pixels
array, that contains one dimensional array of image pixels. Each pixel consists of 4 numbers that define the color components: R, G, B, A where R=red, G=green, B=blue and A=transparency(Alpha channel). There are no rows and columns in this array, and pixels just go one after another. To convert it to required format, you need to convert it to 3 arrays: array of reds, array of greens and array of blues first and then concatenate these 3 arrays to one in which the reds will go first, greens will go next and blues will go at the end.
The next image shows what you need to do with the pixels
array and return from the function:
The first step is to create 3 arrays for reds, greens and blues:
const red = [], green = [], blue = [];
Then, traverse the pixels
array and collect numbers to appropriate arrays:
for (let index=0; index<pixels.length; index+=4) {
red.push(pixels[index]/255.0);
green.push(pixels[index+1]/255.0);
blue.push(pixels[index+2]/255.0);
}
This loop jumps from pixel to pixel with step=4. On each iteration, the index
is equal to the red component of the current pixel, the index+1
is equal to the green component and the index+2
is equal to blue. The fourth component of color is skipped in this loop. As you see, we divide components by 255.0 to scale and put to appropriate arrays.
The only thing that left to do after this, is to concatenate these arrays in correct order and return along with img_width
and img_height
. But we can't add the return
from the prepare_input
function here, because we write all this code inside an internal function, in the onload
event handler and by writing return, we are just returning from this handler but not from the prepare_input
function.
To handle this issue, we wrap the code of the prepare_input
function to the Promise
and return it. Then, inside the event handler, we will use the resolve([input, img_width, img_height])
to resolve that promise with results, that will be returned.
Here is a full code of the prepare_input
function:
async function prepare_input(buf) {
return new Promise(resolve => {
const img = new Image();
img.src = URL.createObjectURL(buf);
img.onload = () => {
const [img_width,img_height] = [img.width, img.height]
const canvas = document.createElement("canvas");
canvas.width = 640;
canvas.height = 640;
const context = canvas.getContext("2d");
context.drawImage(img,0,0,640,640);
const imgData = context.getImageData(0,0,640,640);
const pixels = imgData.data;
const red = [], green = [], blue = [];
for (let index=0; index<pixels.length; index+=4) {
red.push(pixels[index]/255.0);
green.push(pixels[index+1]/255.0);
blue.push(pixels[index+2]/255.0);
}
const input = [...red, ...green, ...blue];
resolve([input, img_width, img_height])
}
})
}
Run the model and process the output
This prepare_input
function returns the input exactly in the same format as in the Node.js version. That is why, all other code, including run_model, process_output, iou, intersection and union functions can be copy/pasted as is from the Node.js project.
After it's done, the JavaScript web service finished!
Now you can use any web server to run the index.html
file and try this wonderful feature - to run neural network models right in a web browser frontend.
The code that we developed here is oversimplified. It intended only to demonstrate how to load and run the YOLOv8 models using ONNX runtime. It does not include any error processing and exception handling. These tasks depend on real use cases, and it's up to you how to implement it for your projects.
We used only a small subset of ONNX runtime JavaScript API required for basic operations. Full reference available here.
You can find a source code of JavaScript object detector web service in this repository.
Create a web service on Go
Go is the first statically typed and compiled programming language in our journey. From my point of view, the greatest thing about Go is how you can deploy the apps written on it. You can compile all your code and it's dependencies to a single binary executable, then just copy this file to a production server and run. This is how the whole deployment process looks on Go. You do not need to install any third party dependencies to run Go programs, that is why, the Go applications usually compact and convenient to update. Also, the go is faster than Python and JavaScript. Definitely, it would be great to have an opportunity to deploy neural networks this way. Fortunately, there are several ONNX runtime bindings exist that will help us to achieve this goal.
Setup the project
Create a new folder, enter it and run:
go mod init object_detector
This command will initialize the object_detector
project in the current folder.
Install required external modules:
go get github.com/yalue/onnxruntime_go
go get github.com/nfnt/resize
- github.com/yalue/onnxruntime_go - ONNX runtime library bindings for Golang
- github.com/nfnt/resize - the library to resize images. (Perhaps you can find more modern library, but I just used this one because it works properly)
The other thing for which I respect Go, is that all other modules, including web framework and image processing functions, already exist in standard library.
The ONNX module for Go provides the API, but does not contain the Microsoft ONNX runtime library itself. Instead, it has a function to specify a path, in which this library located. Here you have two options: install the Microsoft ONNX runtime library to a well known system path, or download the version for your operating system and put it to the project folder. For this project, I will go the second way, to make the project autonomous and independent of operating system setup.
Go to the Releases page: https://github.com/microsoft/onnxruntime/releases and download the archive for your operating system. After it's done, extract the files from the archive and copy all files from the lib
subfolder to the project.
We are not going to change the frontend, that is why, just copy the index.html
file from one of the previous projects to current folder. Also, copy the yolov8m.onnx
model file.
By convention, the main file of Go project should have a main.go
name. So, create this file and put the following boilerplate code to it:
package main
import (
"encoding/json"
"github.com/nfnt/resize"
ort "github.com/yalue/onnxruntime_go"
"image"
_ "image/gif"
_ "image/jpeg"
_ "image/png"
"io"
"math"
"net/http"
"os"
"sort"
)
func main() {
server := http.Server{
Addr: "0.0.0.0:8080",
}
http.HandleFunc("/", index)
http.HandleFunc("/detect", detect)
server.ListenAndServe()
}
func index(w http.ResponseWriter, _ *http.Request) {
file, _ := os.Open("index.html")
buf, _ := io.ReadAll(file)
w.Write(buf)
}
func detect(w http.ResponseWriter, r *http.Request) {
r.ParseMultipartForm(0)
file, _, _ := r.FormFile("image_file")
boxes := detect_objects_on_image(file)
buf, _ := json.Marshal(&boxes)
w.Write(buf)
}
func detect_objects_on_image(buf io.Reader) [][]interface{} {
input, img_width, img_height := prepare_input(buf)
output := run_model(input)
return process_output(output, img_width, img_height)
}
func prepare_input(buf io.Reader) ([]float32, int64, int64) {
}
func run_model(input []float32) []float32 {
}
func process_output(output []float32, img_width, img_height int64) [][]interface{} {
}
First, we import required packages. Most of them go from Go standard library:
-
encoding/json
- to encode bounding boxes to JSON before sending response -
github.com/nfnt/resize
- to resize image to 640x640 -
ort "github.com/yalue/onnxruntime_go"
- ONNX runtime library. We import it asort
variable -
image
,image/gif
,image/jpeg
,image/png
- image library and libraries to support images of different formats -
io
- to read data from local files -
math
- forMax
anMin
functions -
net/http
- to create and run a web server -
os
- to open local files -
sort
- to sort bounding boxes
Then, the main
function defines two HTTP endpoints: index
and detect
that are handled by appropriate functions and starts the web server on port 8080 that handles these endpoints.
The index
endpoint just returns the content of the index.html
file.
The detect
endpoint receives the uploaded image file, sends it to the detect_objects_on_image
function, which passes it through the YOLOv8 model. Then it receives the array of bounding boxes, encodes them to JSON and returns this JSON to the frontend.
The detect_objects_on_image
is the same as in previous projects. The only difference is the type of value that it returns, which is the [][]interface{}
. The detect_objects_on_image
should return an array of bounding boxes. Each bounding box is an array of 6 items (x1,y1,x2,y2,label, probability). These items have different types. However, the Go as strong typed programming language does not allow having array with items of different types. But it has a special type interface{}
which can hold value of any type. This is a common trick in the Go to define a variable using the interface{}
type, if it can have values of different types. That is why, to have an array of items of different types, you need to create an array of interfaces: []interface{}
. Consequently, the bounding box is an array of interfaces and the array of bounding boxes is an array of interface arrays: [][]interface{}
.
Then there are stubs of prepare_input
, run_model
and process_output
functions defined. In the next sections, we will implement them one by one.
Prepare the input
To prepare the input for the YOLOv8 model, you need to load the image, resize it and convert to a tensor of (3,640,640) shape where the first item is an array of red components of image pixels, second item is an array of greens and the last component is an array of blues. Furthermore, the ONNX library for Go, requires you to provide this tensor as a flat array, e.g. to concat these three arrays one after one, like displayed on the next image.
So, let's load and resize the image first:
func prepare_input(buf io.Reader) ([]float32, int64, int64) {
img, _, _ := image.Decode(buf)
size := img.Bounds().Size()
img_width, img_height := int64(size.X), int64(size.Y)
img = resize.Resize(640, 640, img, resize.Lanczos3)
This code:
- loaded the image,
- saved the size of original image to
img_width
,img_height
variables - resized it to 640x640 pixels
Then you need to collect the colors of pixels to different arrays, that you should define first:
red := []float32{}
green := []float32{}
blue := []float32{}
Then you need to extract pixels and their colors from the image. To do that, the img
object has .At(x,y)
method, that can be used to get the pixel object at a specified point of the image. The color object, returned by this method has an .RGBA()
method, that returns the color components as an array of 4 elements: [R,G,B,A]. You need to extract only R,G,B and scale them.
Now, you have everything to traverse the image and collect pixel colors to created arrays:
for y := 0; y < 640; y++ {
for x := 0; x < 640; x++ {
r, g, b, _ := img.At(x, y).RGBA()
red = append(red, float32(r/257)/255.0)
green = append(green, float32(g/257)/255.0)
blue = append(blue, float32(b/257)/255.0)
}
}
- This code traverses all rows and columns of image.
- It extracts array of color components of each pixel and destructures them to
r
,g
andb
variables. - Then it scales these components and appends them to appropriate arrays.
Finally, you need to concatenate these arrays to a single one in correct order:
input := append(red, green...)
input = append(input, blue...)
So, the input
variable contains the input, required for ONNX runtime. Here is a full code of this function, which returns the input
and the size of original image that will be used later when process the output from the model.
func prepare_input(buf io.Reader) ([]float32, int64, int64) {
img, _, _ := image.Decode(buf)
size := img.Bounds().Size()
img_width, img_height := int64(size.X), int64(size.Y)
img = resize.Resize(640, 640, img, resize.Lanczos3)
red := []float32{}
green := []float32{}
blue := []float32{}
for y := 0; y < 640; y++ {
for x := 0; x < 640; x++ {
r, g, b, _ := img.At(x, y).RGBA()
red = append(red, float32(r/257)/255.0)
green = append(green, float32(g/257)/255.0)
blue = append(blue, float32(b/257)/255.0)
}
}
input := append(red, green...)
input = append(input, blue...)
return input, img_width, img_height
}
Now, let's run it through the model.
Run the model
The run_model
does the same as in Python example, but it is quite wordy, because of Go language specifics:
func run_model(input []float32) []float32 {
ort.SetSharedLibraryPath("./libonnxruntime.so")
_ = ort.InitializeEnvironment()
inputShape := ort.NewShape(1, 3, 640, 640)
inputTensor, _ := ort.NewTensor(inputShape, input)
outputShape := ort.NewShape(1, 84, 8400)
outputTensor, _ := ort.NewEmptyTensor[float32](outputShape)
model, _ := ort.NewSession[float32]("./yolov8m.onnx",
[]string{"images"}, []string{"output0"},
[]*ort.Tensor[float32]{inputTensor},[]*ort.Tensor[float32]{outputTensor})
_ = model.Run()
return outputTensor.GetData()
}
- As written in the setup section, the Go ONNX library needs to know where is ONNX runtime library located. You need to use the
ort.SetSharedLibraryPath()
to specify a location of main file of the ONNX runtime library and initialize the environment with this library. If you downloaded it manually, as suggested earlier, then just specify a name of the file. For Linux, the file name will belibonnxruntime.so
, for macOS -libonnxruntime.dylib
, for Windows -onnxruntime.dll
. I work on Linux, so in this example I use the Linux library. - Then, the library requires converting the
input
to internal tensor format with (1,3,640,640) shape. - Then, the library also requires creating an empty structure for output tensor, and specify its shape. The Go ONNX library does not return the output, but it writes it to the variable, that defined in advance. Here, we defined the
outputTensor
variable as a tensor with (1,84,8400) shape that will be used to receive the data from the model. - Then we create a model using the
NewSession
function, which receives both arrays of input and output names and arrays of input and output tensors. - Then we run this model, that processes input and writes the output to the
outputTensor
variable. - The
outputTensor.GetData()
method returns the output data as a flat array of float numbers.
As a result, the function returns the array with (1,84,8400) shape, or you can think about this as about 84x8400 matrix. However, it returns an output as a single dimension array. The numbers in this array ordered as 84x8400, but as a flat array of 705600 items. So, you can't transpose it, and you can't traverse it by rows in a loop, because it's required to specify the absolute position of each item. But do not worry, in the next section we will learn how to deal with it.
Process the output
The code of the process_output
function will use the Intersection Over Union algorithm to filter out all overlapped boxes. It's easy to rewrite the iou, intersect and union functions from Python to Go. Include them to your code below the process_output
function:
func iou(box1, box2 []interface{}) float64 {
return intersection(box1, box2) / union(box1, box2)
}
func union(box1, box2 []interface{}) float64 {
box1_x1, box1_y1, box1_x2, box1_y2 := box1[0].(float64), box1[1].(float64), box1[2].(float64), box1[3].(float64)
box2_x1, box2_y1, box2_x2, box2_y2 := box2[0].(float64), box2[1].(float64), box2[2].(float64), box2[3].(float64)
box1_area := (box1_x2 - box1_x1) * (box1_y2 - box1_y1)
box2_area := (box2_x2 - box2_x1) * (box2_y2 - box2_y1)
return box1_area + box2_area - intersection(box1, box2)
}
func intersection(box1, box2 []interface{}) float64 {
box1_x1, box1_y1, box1_x2, box1_y2 := box1[0].(float64), box1[1].(float64), box1[2].(float64), box1[3].(float64)
box2_x1, box2_y1, box2_x2, box2_y2 := box2[0].(float64), box2[1].(float64), box2[2].(float64), box2[3].(float64)
x1 := math.Max(box1_x1, box2_x1)
y1 := math.Max(box1_y1, box2_y1)
x2 := math.Min(box1_x2, box2_x2)
y2 := math.Min(box1_y2, box2_y2)
return (x2 - x1) * (y2 - y1)
}
also, you will need to find YOLO class label by ID, so add the yolo_classes
array to your code:
var yolo_classes = []string{
"person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat",
"traffic light", "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse",
"sheep", "cow", "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella", "handbag", "tie",
"suitcase", "frisbee", "skis", "snowboard", "sports ball", "kite", "baseball bat", "baseball glove",
"skateboard", "surfboard", "tennis racket", "bottle", "wine glass", "cup", "fork", "knife", "spoon",
"bowl", "banana", "apple", "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut",
"cake", "chair", "couch", "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse",
"remote", "keyboard", "cell phone", "microwave", "oven", "toaster", "sink", "refrigerator", "book",
"clock", "vase", "scissors", "teddy bear", "hair drier", "toothbrush",
}
Now let's implement the process_output
function. As mentioned above, the function receives output as a flat array that ordered as 84x8400 matrix. When work in Python, we had a NumPy to transform it to 8400x84 and then traverse in a loop by row. Here, we can't transform it this way, so, we need to traverse it by columns.
boxes := [][]interface{}{}
for index := 0; index < 8400; index++ {
}
Moreover, you do not have row indexes and column indexes, but have only absolute indexes. You can only virtually reshape this flat array to 84x8400 matrix in your head and use this representation to calculate these absolute indexes, using those "virtual rows" and "virtual columns".
Let's display how the output
array looks to clarify this:
Here we virtually reshaped the output
array with 705600 items to a 84x8400 matrix. It has 8400 columns with indexes from 0 to 8399 and 84 rows with indexes from 0 to 83. The absolute indexes of items have written inside boxes. Each detected object represented by a column in this matrix. The first 4 rows of each column with indexes from 0 to 3 are coordinates of the bounding box of the appropriate object: x_center, y_center, width and height. Cells in the other 80 rows, starting from 4 to 83 contain the probabilities that the object belongs to each of the 80 YOLO classes.
I drew this table to understand how to calculate the absolute index of any item in it, knowing the row and column indexes. For example, how you calculate the index of first greyed item that stands on row 2 and column 2, which is a bounding box width
of the third detected object? If you think about this a little more, you will find, that to calculate this you need to multiply the row index by the length of the row (8400) and add the column index to this. Let's check it: 8400*2+2=16802. Now, let's calculate the index of the item below it, which is a height of the same object: 8400*3+2=25202. Bingo! Matched again! Finally, let's check the bottom gray box, which is a probability that object 8399 belongs to class 79 (toothbrush): 8400*83+8398=705598. Great, so you have a formula to calculate absolute index: 8400*row_index+column_index
.
Let's return to our empty loop. Assuming that the index
loop counter is an index of current column and that coordinates of bounding box located in rows 0-3 of current column, we can extract them this way:
boxes := [][]interface{}{}
for index := 0; index < 8400; index++ {
xc := output[index]
yc := output[8400+index]
w := output[2*8400+index]
h := output[3*8400+index]
}
Then you can calculate the corners of the bounding box and scale them to the size of the original image:
x1 := (xc - w/2) / 640 * float32(img_width)
y1 := (yc - h/2) / 640 * float32(img_height)
x2 := (xc + w/2) / 640 * float32(img_width)
y2 := (yc + h/2) / 640 * float32(img_height)
Now similarly you need to get probabilities of the object, that goes in rows from 4 to 83, find which of them is biggest and the index of this probability, and save these values to the prob
and the class_id
variables. You can write a nested loop, that traverses rows from 4 to 83 and saves the highest value, and it's index:
class_id, prob := 0, float32(0.0)
for col := 0; col < 80; col++ {
if output[8400*(col+4)+index] > prob {
prob = output[8400*(col+4)+index]
class_id = col
}
}
Then, having the maximum probability and class_id, you can either skip that object, if the probability is less than 0.5 or find the label of this class.
Here is a final code, that processes and collects bounding boxes to the boxes
array:
boxes := [][]interface{}{}
for index := 0; index < 8400; index++ {
class_id, prob := 0, float32(0.0)
for col := 0; col < 80; col++ {
if output[8400*(col+4)+index] > prob {
prob = output[8400*(col+4)+index]
class_id = col
}
}
if prob < 0.5 {
continue
}
label := yolo_classes[class_id]
xc := output[index]
yc := output[8400+index]
w := output[2*8400+index]
h := output[3*8400+index]
x1 := (xc - w/2) / 640 * float32(img_width)
y1 := (yc - h/2) / 640 * float32(img_height)
x2 := (xc + w/2) / 640 * float32(img_width)
y2 := (yc + h/2) / 640 * float32(img_height)
boxes = append(boxes, []interface{}{float64(x1), float64(y1), float64(x2), float64(y2), label, prob})
}
The last step is to filter the boxes
array using "Non-maximum suppression", to exclude all overlapping boxes from it. This code does the same as the Python implementation, but looks slightly different because of the Go language specifics:
sort.Slice(boxes, func(i, j int) bool {
return boxes[i][5].(float32) < boxes[j][5].(float32)
})
result := [][]interface{}{}
for len(boxes) > 0 {
result = append(result, boxes[0])
tmp := [][]interface{}{}
for _, box := range boxes {
if iou(boxes[0], box) < 0.7 {
tmp = append(tmp, box)
}
}
boxes = tmp
}
- First we sort the boxes by probability in reverse order to put the boxes with the highest probability to the top
- In a loop, we put the box with the highest probability to the
result
array - Then we create a temporary
tmp
array and in the inner loop over all boxes, we put to this array only boxes, that do not overlap selected too much (that have IoU<0.7). - Then we overwrite the
boxes
array with thetmp
array. This way, we filter out all overlapping boxes from theboxes
array. - If some boxes exist after filtering, the loop continues going until the
boxes
array becomes empty.
Finally, the result
variable contains all bounding boxes that should be returned.
That's all! For convenience, here is a full code of the process_output
function:
func process_output(output []float32, img_width, img_height int64) [][]interface{} {
boxes := [][]interface{}{}
for index := 0; index < 8400; index++ {
class_id, prob := 0, float32(0.0)
for col := 0; col < 80; col++ {
if output[8400*(col+4)+index] > prob {
prob = output[8400*(col+4)+index]
class_id = col
}
}
if prob < 0.5 {
continue
}
label := yolo_classes[class_id]
xc := output[index]
yc := output[8400+index]
w := output[2*8400+index]
h := output[3*8400+index]
x1 := (xc - w/2) / 640 * float32(img_width)
y1 := (yc - h/2) / 640 * float32(img_height)
x2 := (xc + w/2) / 640 * float32(img_width)
y2 := (yc + h/2) / 640 * float32(img_height)
boxes = append(boxes, []interface{}{float64(x1), float64(y1), float64(x2), float64(y2), label, prob})
}
sort.Slice(boxes, func(i, j int) bool {
return boxes[i][5].(float32) < boxes[j][5].(float32)
})
result := [][]interface{}{}
for len(boxes) > 0 {
result = append(result, boxes[0])
tmp := [][]interface{}{}
for _, box := range boxes {
if iou(boxes[0], box) < 0.7 {
tmp = append(tmp, box)
}
}
boxes = tmp
}
return result
}
If you like to work with this output in a more convenient "Pythonic" way, there is a Gorgonia Tensor library that emulates features of NumPy in Go. You can use it to physically reshape the output to 84x8400, then transpose to 8400x84 and then traverse detected objects by row.
However, the option to work with single dimension array as with matrix described in this section is the most efficient, because we got all values we need without additional array transformations. I think that installing additional external dependency is overkill for this case.
That is it for Go implementation. If you wrote everything correctly, then you can start this web service by running the following command:
go run main.go
and open http://localhost:8080
in a web browser.
The code that we developed here intended only to demonstrate how to load and run the YOLOv8 models using ONNX runtime. I made it as simple as possible, and it does not include any details, except working with ONNX. It does not include any resource management, error processing and exception handling. These tasks depend on real use cases and it's up to you how to implement it for your projects.
Full reference of GO library for ONNX runtime available here.
You can find a source code of Go object detector web service in this repository.
Create a web service on Rust
This article can not be complete without an example of a low level language, the high performance and efficient language, on which developers manage memory by themselves and not rely on a garbage collector. I was thinking which one to choose, either C++ or Rust. Finally, I decided to ask people and created the following poll in the LinkedIn group:
Regardless of received results, I also analyzed comments and understood that highly likely people answered not the question that I have asked. I did not ask "Which of these programming languages do you know?", or "Which of them do you like?" or "Which of them is the most popular?". Instead, I asked: "Which is better to learn TODAY to create NEW high performance server applications?".
Finally, I got only one valuable comment:
It was the only comment that received some likes and I completely agree with that text.
Finally, the choice was made! We are going to create an object detection web service on Rust - the safest low-level programming language today.
Setup the project
Enter the command to create a new Rust project:
cargo new object_detector
This will create an object_detector
folder with a project template in it.
Go to this folder and open the Cargo.toml
file in it.
Write the following packages to the dependencies section
:
[dependencies]
image = "0.24.6"
ndarray = "0.15.6"
ort = "1.14.6"
serde = "1.0.84"
serde_derive = "1.0.84"
serde_json = "1.0.36"
rocket = "=0.5.0-rc.3"
- image - library for image processing.
- ndarray - multidimensional array support library.
- ort - ONNX runtime library.
- serde,serde_derive,serde_json - Serialization library to serialize data to JSON.
- rocket - Web framework.
Create a Rocket.toml
file which will contain configuration for the Rocket web server and add the following lines to it:
[global]
address = "0.0.0.0"
port = 8080
We are not going to change frontend, so copy the index.html
to the project. Also, copy the yolov8m.onnx
model.
Before continue, ensure that the ONNX runtime installed on your operating system, because the library that integrated to the Rust package may not work correctly. To install it, you can download the archive for your operating system from here, extract and copy contents of "lib" subfolder to the system libraries path of your operating system.
The main.rs
, the main project file already generated, and it's located in the src
subfolder. Open this file and add the following boilerplate code to it:
use std::{sync::Arc, path::Path, vec};
use image::{GenericImageView, imageops::FilterType};
use ndarray::{Array, IxDyn, s, Axis};
use ort::{Environment,SessionBuilder,tensor::InputTensor};
use rocket::{response::content,fs::TempFile,form::Form};
#[macro_use] extern crate rocket;
#[rocket::main]
async fn main() {
rocket::build()
.mount("/", routes![index])
.mount("/detect", routes![detect])
.launch().await.unwrap();
}
#[get("/")]
fn index() -> content::RawHtml<String> {
return content::RawHtml(std::fs::read_to_string("index.html").unwrap());
}
#[post("/", data = "<file>")]
fn detect(file: Form<TempFile<'_>>) -> String {
let buf = std::fs::read(file.path().unwrap_or(Path::new(""))).unwrap_or(vec![]);
let boxes = detect_objects_on_image(buf);
return serde_json::to_string(&boxes).unwrap_or_default()
}
fn detect_objects_on_image(buf: Vec<u8>) -> Vec<(f32,f32,f32,f32,&'static str,f32)> {
let (input,img_width,img_height) = prepare_input(buf);
let output = run_model(input);
return process_output(output, img_width, img_height);
}
fn prepare_input(buf: Vec<u8>) -> (Array<f32,IxDyn>, u32, u32) {
}
fn run_model(input:Array<f32,IxDyn>) -> Array<f32,IxDyn> {
}
fn process_output(output:Array<f32,IxDyn>,img_width: u32, img_height: u32) -> Vec<(f32,f32,f32,f32,&'static str, f32)> {
}
First block imports required modules:
-
image
- to process images -
ndarray
- to work with tensors -
ort
- ONNX runtime library -
rocket
- Rocket Web framework -
std
- some objects from Rust standard library
Then, in the main function we start the Rocket
web server and attach index
and detect
routes to it.
The index
function serves the root of the service, it just returns the content of the index.html
file as HTML.
The detect
function serves the /detect
endpoint. It receives the uploaded file, passes it to the detect_objects_on_image
, receives the array of bounding boxes, serializes them to JSON and returns this JSON string to the frontend.
The detect_objects_on_image
implements the same actions as the Python version. It converts the image to the multidimensional array of numbers, passes it to the ONNX runtime and processes the output. Finally, it returns the array of bounding boxes, where each bounding box is a tuple of (x1,y1,x2,y2,label, prob). The Rust is strong typed language, so we have to specify types of all variables in this tuple. That is why it returns Vec<(f32,f32,f32,f32,&'static str,f32)>
which is a vector of bounding box tuples.
Then we define stubs for prepare_input
, run_model
and process_output
functions, that will be implemented one by one in the following sections.
Prepare the input
To prepare the input for the YOLOv8 model, you need to load the image, resize it and convert to a tensor of (1,3,640,640) shape which is an array of single image represented as 3 640x640 matrices. The first item is an array of red components of image pixels, the second item is an array of greens, and the last item is an array of blues. We will use the ndarray
library to construct this tensor and fill it with pixel color values. But first we need to load the image, and resize it to 640x640:
let img = image::load_from_memory(&buf).unwrap();
let (img_width, img_height) = (img.width(), img.height());
let img = img.resize_exact(640, 640, FilterType::CatmullRom);
- In the first line, the image is loaded from uploaded file buffer
- Next, we save the original image width and height for future
- Finally, we resized the image to 640x640
Then, let's construct the input array of required shape:
let mut input = Array::zeros((1, 3, 640, 640)).into_dyn();
This line created a new 4-dimensional tensor filled with zeros.
Now, you need to get access to the image pixels and their color components. The img
object has a pixels()
method, which is an iterator for image pixels. You can use it to get access to each pixel in a loop:
for pixel in img.pixels() {
}
The pixel
is a Pixel
object with properties that we need:
-
x
- the x coordinate of pixel -
y
- the y coordinate of pixel -
color
- the object with an array with 4 items [r,g,b,a]: color components of pixel.
Having this, you can fill the tensor input in a loop:
for pixel in img.pixels() {
let x = pixel.0 as usize;
let y = pixel.1 as usize;
let [r,g,b,_] = pixel.2.0;
input[[0, 0, y, x]] = (r as f32) / 255.0;
input[[0, 1, y, x]] = (g as f32) / 255.0;
input[[0, 2, y, x]] = (b as f32) / 255.0;
};
- First, we extract
x
andy
variables and convert them to the type that can be used as a tensor index - Then we destructure color to
r
,g
andb
variables. - Finally, we put these pixel color components to appropriate cells of the tensor. Notice that the
y
goes first and thex
goes next. This is because in matrices, the first dimension is a row and the second is a column.
So, now you have an input prepared for the neural network. You need to return it from the function along with img_width
and img_height
. Here is a full source of the prepare_input
:
fn prepare_input(buf: Vec<u8>) -> (Array<f32,IxDyn>, u32, u32) {
let img = image::load_from_memory(&buf).unwrap();
let (img_width, img_height) = (img.width(), img.height());
let img = img.resize_exact(640, 640, FilterType::CatmullRom);
let mut input = Array::zeros((1, 3, 640, 640)).into_dyn();
for pixel in img.pixels() {
let x = pixel.0 as usize;
let y = pixel.1 as usize;
let [r,g,b,_] = pixel.2.0;
input[[0, 0, y, x]] = (r as f32) / 255.0;
input[[0, 1, y, x]] = (g as f32) / 255.0;
input[[0, 2, y, x]] = (b as f32) / 255.0;
};
return (input, img_width, img_height);
}
Now, it's time to pass this input through the YOLOv8 model.
Run the model
The run_model
function used to pass the input tensor through the model and return the output tensor. This is its source code:
fn run_model(input:Array<f32,IxDyn>) -> Array<f32,IxDyn> {
let input = InputTensor::FloatTensor(input);
let env = Arc::new(Environment::builder().with_name("YOLOv8").build().unwrap());
let model = SessionBuilder::new(&env).unwrap().with_model_from_file("yolov8m.onnx").unwrap();
let outputs = model.run([input]).unwrap();
let output = outputs.get(0).unwrap().try_extract::<f32>().unwrap().view().t().into_owned();
return output;
}
- First it converts the input to the internal ONNX runtime tensor format
- Then it creates the
env
ironment and instantiates the ONNXmodel
in it from theyolov8m.onnx
file. - Then it runs the model with the
input
tensor and receives the array of outputs. - Finally, it extracts the first
output
and returns it.
The returned output is an Ndarray
tensor, so we can traverse it in a loop. Let's process it.
Process the output
The code of the process_output
function will use the Intersection Over Union algorithm to filter out all overlapped boxes. It's easy to rewrite the iou, intersect and union functions from Python to Rust. Include them to your code below the process_output
function:
fn iou(box1: &(f32, f32, f32, f32, &'static str, f32), box2: &(f32, f32, f32, f32, &'static str, f32)) -> f32 {
return intersection(box1, box2) / union(box1, box2);
}
fn union(box1: &(f32, f32, f32, f32, &'static str, f32), box2: &(f32, f32, f32, f32, &'static str, f32)) -> f32 {
let (box1_x1,box1_y1,box1_x2,box1_y2,_,_) = *box1;
let (box2_x1,box2_y1,box2_x2,box2_y2,_,_) = *box2;
let box1_area = (box1_x2-box1_x1)*(box1_y2-box1_y1);
let box2_area = (box2_x2-box2_x1)*(box2_y2-box2_y1);
return box1_area + box2_area - intersection(box1, box2);
}
fn intersection(box1: &(f32, f32, f32, f32, &'static str, f32), box2: &(f32, f32, f32, f32, &'static str, f32)) -> f32 {
let (box1_x1,box1_y1,box1_x2,box1_y2,_,_) = *box1;
let (box2_x1,box2_y1,box2_x2,box2_y2,_,_) = *box2;
let x1 = box1_x1.max(box2_x1);
let y1 = box1_y1.max(box2_y1);
let x2 = box1_x2.min(box2_x2);
let y2 = box1_y2.min(box2_y2);
return (x2-x1)*(y2-y1);
}
Also, we will need to get labels for detected objects, so include this array of COCO class labels:
const YOLO_CLASSES:[&str;80] = [
"person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat",
"traffic light", "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse",
"sheep", "cow", "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella", "handbag", "tie",
"suitcase", "frisbee", "skis", "snowboard", "sports ball", "kite", "baseball bat", "baseball glove",
"skateboard", "surfboard", "tennis racket", "bottle", "wine glass", "cup", "fork", "knife", "spoon",
"bowl", "banana", "apple", "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut",
"cake", "chair", "couch", "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse",
"remote", "keyboard", "cell phone", "microwave", "oven", "toaster", "sink", "refrigerator", "book",
"clock", "vase", "scissors", "teddy bear", "hair drier", "toothbrush"
];
Now let's start writing the process_output
function.
Let's define an array to which you will put collected bounding boxes:
let mut boxes = Vec::new();
The output
from YOLOv8 model is a tensor and for some reason, it has a shape [8400,84,1], instead of how it looks in other programming languages. It's already ordered by rows, but has an extra dimension at the end. Let's remove it:
let output = output.slice(s![..,..,0])
This line extracted the (8400,84) matrix from this tensor, and we can traverse it by first axis, e.g. by rows:
for row in output.axis_iter(Axis(0)) {
}
Here, the row
is a single dimension NdArray
object that represents a row with 84 float numbers. It will be more convenient to convert it to the basic array, let's do it:
for row in output.axis_iter(Axis(0)) {
let row:Vec<_> = row.iter().map(|x| *x).collect();
}
The first 4 items of this array contain bounding box coordinates, and we can convert and scale them to x1,y1,x2,y2 now:
let xc = row[0]/640.0*(img_width as f32);
let yc = row[1]/640.0*(img_height as f32);
let w = row[2]/640.0*(img_width as f32);
let h = row[3]/640.0*(img_height as f32);
let x1 = xc - w/2.0;
let x2 = xc + w/2.0;
let y1 = yc - h/2.0;
let y2 = yc + h/2.0;
Then, all items from 4 to 83 are probabilities that this bounding box belongs to each of 80 object classes. You need to find maximum of these items and the index of this item, which can be used as an ID of object class. You can do this in a loop:
let mut class_id = 0;
let mut prob:f32 = 0.0;
for index in 4..row.len() {
if row[index]>prob {
prob = row[index];
class_id = index-4;
}
}
let label = YOLO_CLASSES[class_id];
Here we determined the maximum probability, the class_id of object with maximum probability and the label
of object of this class.
It works ok, but I'd better implement it in a functional way instead of loop:
let (class_id, prob) = row.iter().skip(4).enumerate()
.map(|(index,value)| (index,**value))
.reduce(|accum, row| if row.1>accum.1 { row } else {accum}).unwrap();
let label = YOLO_CLASSES[class_id];
- This code gets an iterator for row element that starts from 4th item.
- Then it maps the row items to a tuples (class_id, prob).
- Then it reduces this array of tuples to a single element with maximum
prob
. - The resulting tuple, the destructured to the
class_id
andprob
variables.
Finally, you can skip the row if the prob
< 0.5 or collect all values to a bounding box and push this bounding box to the boxes
array.
Here is all code that we have now, in which operations ordered correctly:
let mut boxes = Vec::new();
let output = output.slice(s![..,..,0]);
for row in output.axis_iter(Axis(0)) {
let row:Vec<_> = row.iter().map(|x| *x).collect();
let (class_id, prob) = row.iter().skip(4).enumerate()
.map(|(index,value)| (index,**value))
.reduce(|accum, row| if row.1>accum.1 { row } else {accum}).unwrap();
if prob < 0.5 {
continue
}
let label = YOLO_CLASSES[class_id];
let xc = row[0]/640.0*(img_width as f32);
let yc = row[1]/640.0*(img_height as f32);
let w = row[2]/640.0*(img_width as f32);
let h = row[3]/640.0*(img_height as f32);
let x1 = xc - w/2.0;
let x2 = xc + w/2.0;
let y1 = yc - h/2.0;
let y2 = yc + h/2.0;
boxes.push((x1,y1,x2,y2,label,prob));
}
P.S. Actually, it's possible to implement all this in a functional way instead of loop. You can do it as a homework.
Finally, you need to filter the boxes
array to exclude the boxes, that overlap each other, using the Intersection over union
. The filtered boxes should be collected to the result
array:
let mut result = Vec::new();
boxes.sort_by(|box1,box2| box2.5.total_cmp(&box1.5));
while boxes.len()>0 {
result.push(boxes[0]);
boxes = boxes.iter().filter(|box1| iou(&boxes[0],box1) < 0.7).map(|x| *x).collect()
}
- First, we sort
boxes
by probability in descending order to put the boxes with the highest probability to the top. - Then, in a loop, we put the first box with highest probability to the resulting array
- Then, we overwrite the boxes array using a filter, that adds to it only those boxes, which
iou
value is less than 0.7 if compare with the selected box. - If after filter, the
boxes
contains more elements, the loop continues.
Finally, after the loop, the boxes
array will be empty and the result
will contain bounding boxes of all different detected objects.
The result
array should be returned by this function. Here is the whole code:
fn process_output(output:Array<f32,IxDyn>,img_width: u32, img_height: u32) -> Vec<(f32,f32,f32,f32,&'static str, f32)> {
let mut boxes = Vec::new();
let output = output.slice(s![..,..,0]);
for row in output.axis_iter(Axis(0)) {
let row:Vec<_> = row.iter().map(|x| *x).collect();
let (class_id, prob) = row.iter().skip(4).enumerate()
.map(|(index,value)| (index,**value))
.reduce(|accum, row| if row.1>accum.1 { row } else {accum}).unwrap();
if prob < 0.5 {
continue
}
let label = YOLO_CLASSES[class_id];
let xc = row[0]/640.0*(img_width as f32);
let yc = row[1]/640.0*(img_height as f32);
let w = row[2]/640.0*(img_width as f32);
let h = row[3]/640.0*(img_height as f32);
let x1 = xc - w/2.0;
let x2 = xc + w/2.0;
let y1 = yc - h/2.0;
let y2 = yc + h/2.0;
boxes.push((x1,y1,x2,y2,label,prob));
}
boxes.sort_by(|box1,box2| box2.5.total_cmp(&box1.5));
let mut result = Vec::new();
while boxes.len()>0 {
result.push(boxes[0]);
boxes = boxes.iter().filter(|box1| iou(&boxes[0],box1) < 0.7).map(|x| *x).collect()
}
return result;
}
That is it for Rust web service. If everything written correctly, you can start web service by running the following command in the project folder:
cargo run
and open http://localhost:8080
in a web browser.
The code that we developed here is oversimplified. It's intended only to demonstrate how to load and run the YOLOv8 models using ONNX runtime. I made it as simple as possible, and it does not include any other details, except working with ONNX. It does not include any resource management, error processing and exception handling. These tasks depend on real use cases, and it's up to you how to implement it for your projects.
Full reference of Rust library for ONNX runtime available here.
You can find a source code of Rust object detector web service in this repository.
Conclusion
In this article I showed that even if the YOLOv8 neural network created on Python, you can use it from other programming languages, because it can be exported to universal ONNX format.
We explored the foundational algorithms, used to prepare the input and process the output from ONNX model, which is the same for all programming languages that have interfaces for ONNX runtime.
After discovered the main concepts, I showed how to create an object detection web service based on ONNX runtime using Python, Julia, Node.js, JavaScript, Go and Rust. Each language has some differences, but in general, all workflow follows the same algorithm.
You can apply this experience for any other neural networks, created using PyTorch or TensorFlow (which are the most neural networks, existing in the world), because each framework can export its models to ONNX.
There are ONNX runtime interfaces for other programming languages like Java, C# or C++ and for other platforms, including mobile phones. You can find the list of official bindings here.
Also, there are unofficial bindings for other languages, like PHP. It's a great way to integrate neural networks to WordPress websites.
I believe that it won't be difficult to rewrite the projects that we created here on those other languages if you know those languages, of course.
In the next article, I will show how to detect objects on a video in web browser in real time. Follow me to know first when I publish this.
You can find me on LinkedIn, Twitter, and Facebook to know first about new articles like this one and other software development news.
Have a fun coding and never stop learning!
Posted on May 13, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.