In-Browser object detection using YOLO and TensorFlow.js

skalskip

Piotr Skalski

Posted on March 4, 2020

In-Browser object detection using YOLO and TensorFlow.js

Some time ago, I spent several evenings playing around with state of the art object detection model called YOLO, which is certainly known to those who are interested in Machine Learning on a daily basis. Originally written in Darknet — open-source neural network framework — YOLO performs really well in the tasks of locating and recognizing objects on the pictures. Due to the fact that I have been interested in TensorFlow.js for a few weeks now, I decided to check how YOLO will handle the limitations of In-Browser computing. The entire source code, as well as my previous TF.js projects, can be found on GitHub. If you want to play with the demo version, visit the “I Learn Machne Learning” project website.

Alt Text

Old guns for now…

A few months ago, the third version of YOLO was released. I had the opportunity to test its capabilities in Python and I had a great hope that I could use it in my little project. After spending two days browsing through repositories, forums, and documentation, it turned out that it is not possible to do so right now. As described in the aforementioned article, to use the original YOLO model in your TensorFlow.js project you must first make a two-step conversion. The first of steps takes us from Darknet to TensorFlow / Keras and the second converts our model into a form understandable for TensorFlow.js. Unfortunately, due to the fact that YOLOv3 has introduced new layers to its architecture, and none of the most popular tools like Darkflow or YAD2K has yet to support their conversion to TensorFlow, we have to stick to old guns for now. In the future, I will definitely need to come back and change v2 for a newer model.

Let’s get our hands dirty

The procedure of connecting model with our application is pretty much standard and it was already described in detail in the first article of this series. However this time, there is much more dirty work waiting for us, involved mostly in data processing both before and after the prediction.

Alt Text

First of all, our model must be provided with a tensor of appropriate dimensions - [1, 416, 416, 1] to be exact. As it usually happens these values are related to the dimensions of training images and batch size. Such a square input is problematic because typically pictures are not cropped this way. Cutting images to meet the above condition, carries the risk of losing valuable data which may result in false recognition of objects in the picture. To limit this undesirable effect, we use the popular smartcrop library, which frames the photo by selecting the most interesting fragment. The picture below is an excellent example of the described mechanism and a successful prediction that would probably fail without this trick. Finally, we normalize the values of each pixel, so that they are between 0 and 1. The last point is particularly important to me, as I spend almost two hours looking for a bug causing my model to perform so badly. Better late than never…

Alt Text

Asa result of each prediction, the model returns a tensor with rather strange dimensions [1, 13, 13, 425]. These enigmatic numbers have been effectively exposed in this article, which perfectly explains what is happening under the hood of YOLO. I recommend it to anyone who would like to understand the meaning of this beautiful algorithm. Our task now is to convert this tensor into neat rectangles surrounding the objects in the pictures. This step is quite extensive and could easily be the subject of a separate article. Without going into too much detail, I will say that we will use techniques such as Intersect over Union and Non-Maxima Suppression to get rid of unlikely results and aggregate the remaining rectangles with high probabilities into bounding boxes of detected objects. I recommend viewing the source code, containing these calculations.

Alt Text

Inconsistency across different devices

After finishing the work on the alpha version, I decided to show off my new toy in front of my friends. In this way, quite accidentally, I discovered that the model can behave quite differently on different devices. The class of detected objects does not change but their probability values can change by up to several dozen percents. In the model shown below, the threshold value has been set to 0.5. This means that all objects with lower probabilities will be filtered out. This was the fate of the zebra in the lower left image, its probability dropped by over 25%. TensorFlow.js is still a young library and is struggling with certain problems - currently, there are several issues related to inconsistency on their GitHub. Apparently, it is not easy to make calculations identical on each device. I keep my fingers crossed for the TensorFlow.js team and I hope that they will solve all these problems.

Alt Text

Speed kills

Finally, I would like to write just a few words about one of the important aspects of web programming (though often overlooked) which is the speed of the application. After converting YOLO into a form understood by TF.js, over twenty files are created, which together weigh about 45 MB. Loading such a large amount of data on a slow 3G connection requires almost sacred patience. It is certainly worth paying attention to if we decided to use this type of solution in production.

In a few words

TensorFlow.js is still very young but it gives us developers and date scientists amazing possibilities. You should be aware of certain limitations which I mentioned, but it is worth giving TF.js a chance, because its real capabilities are in my opinion, unexplored.

💖 💪 🙅 🚩
skalskip
Piotr Skalski

Posted on March 4, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related