A simple yet effective and association method to track objects by associating almost every detection box instead of just the high scores one.
The goal of this blog is to cover ByteTrack and techniques for Multi-Object Tracking (MOT). We will also cover running YOLOv8 object detection with ByteTrack tracking on a sample video.
Multi-Object Tracking (MOT)
You might heard of object detection, there are many algorithms like Faster RCNN, SSD, and versions of YOLO which can detect objects with good accuracy. But there is a newer problem which is Multi-object tracking. Basically, you will be passing a video stream, and for each frame, you need to detect the object and assign an “Object ID” and in the next frame if same object is detected same Object ID needs to be assigned. There are many algorithms for MOT like SORT (Simple Online and Realtime tracking), DeepSort, StrongSort, etc.
There are various methods used for object tracking as follows:
- Feature-based Tracking: This involves tracking based on its features, such as color, shapes, Texture, etc.
- Template Matching: As the name defined, this method uses a pre-defined template to match in each video sequence.
- Co-relation-based Tracking: This method is used to compute the similarity between the target object and the candidate region in subsequent frames.
- Deep learning-based Tracking: This method uses neural networks trained on large datasets to detect and track objects in real-time.
Now you must have got some basic idea on MOT. Let’s try to jump to the ByteTrack and try to understand why it is a better object tracking than DeepSort etc.
ByteSort
Here first we will understand the problem with previous MOT algorithms then we will understand the logic of ByteSort.
Problems with other MOT Algorithms
- Low confidence Detection boxes: The very first problem is other MOT Algorithms removing the low confidence detection boxes. While ByteTrack takes into account low-confidence detection boxes also. Why?
“What is reasonable is real; that which is real is reasonable.” — Hegel
i.e., Low confidence detection boxes sometimes indicate the existence of objects e.g., occluded objects. Filtering these objects causes irreversible error in MOT and brings non-negligible missing detection and fragmented trajectories.
Let’s understand it by example:
As you can see in Frame t1, we initialize three different tracklets as their scores are higher than 0.5. But in frame t2 and t3, scores drooped from 0.8 to 0.4 then 0.4 to 0.1.
These detection boxes will be eliminated by the thresholding mechanism and the red tracklets disappear accordingly as shown in Figure (b). But if we take all the detection boxes into consideration more false rates will be introduced e.g., the rightmost box in Figure (a). This comes with a second problem.
- False rate bonding box consideration: It is identified here that the similarity with tractlets provides a strong relation to distinguishing the objects and background in low-score detection boxes.
For e.g., as we can see in Figure ( c) two low scores detection boxes are matched to the tracklets by the motion-predicted boxes (in dotted lines), and thus the objects are correctly recovered. And the background box is removed since it has no matched tracklet.
So, for using the high scores to low scores detection boxes in the matching process. This simple and effective association method is called BYTE, named since each detection box is a basic unit of the tracklets. First, it matches the high-score detection boxes to the tracklets based on motion or appearance similarity. Then, it adopts the Kalman filter for predicting the tracklets location in the next frame. Then the similarity between the predicted and detection box can be computed using IoU or Re-ID feature distance. In the second matching step, low scores detections and the unmatched tracklets i.e., tracklets in the red box are matched using the same motion similarity.
Let’s try to understand Data Association which is the core of the MOT algorithm.
Data Association
It is the core of multi-object tracking which first computes the similarities between tracklets and detection boxes and applies different strategies to match them according to the similarity.
Similarities metrics: For association, location, motion, and appearance are three important cues. SORT uses location and motion cues in a very simple way. It adopts a Kalman Filter for predicting the tracklets in the next frame and then computes the IoU between the detection boxes and predicted boxes as a similarity. But Location and motion cues are good for short-range matching. But for long-range, appearance similarity are helpful. E.g., an Object that was occluded for a long time will get identified using appearance similarity. Appearance similarity is calculated by the cosine similarity of Re-ID features. DeepSort uses a standalone deep learning model for appearance similarity.
Matching Strategy: The matching strategy is used to assign an ID to the object after the computation of similarity. This can be done by Hungarian algorithm or Greedy assignment. SORT matches the detection boxes to the tracklets by matching once. While, DeepSort uses a cascaded matching strategy which first matches the detection boxes to the most recent trackers and then to the lost ones.
BYTE Algorithm
The input to the BYTE algorithm is a video sequence along the Detector. Also a detection threshold value. The algorithm outputs Tracks T of the video each frame contains the bounding box and the ID of the objects.
For each frame in the video, first we predict the detection boxes and prediction score using the Detector Det. Then we separate the detection boxes between Det(high) and Det(low) according to the detection score threshold.
After separating the detection boxes, the Kalman filter is applied to predict the new location in the current frame of each Track T. Firstly association is applied on high detection boxes after that left-over low detection boxes association will be applied.
The main highlight of BYTE is, it’s very flexible and can be compatible with different association methods.
Performance
Byte track outperforms SORT and DeepSORT algorithms. Bytetrack with 76.6 MOTA (Multi-object tracking Accuracy) while SORT and DeepSort with 74.6 and 75.4 MOTA respectively.
Now, you might have understood the main concept of ByteTrack. It’s simple I guess. Let’s try to apply it in real-world project.
ByteTrack with YOLOv8 Detector
In this we will see how we can use the YOLOv8 detector to track vehicles on the road also we will count incoming and outgoing vehicles.
https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2F_d2rv9Rq9uw%3Ffeature%3Doembed&display_name=YouTube&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3D_d2rv9Rq9uw&image=https%3A%2F%2Fi.ytimg.com%2Fvi%2F_d2rv9Rq9uw%2Fhqdefault.jpg&key=a19fcc184b9711e1b4764040d3dc5c07&type=text%2Fhtml&schema=youtube
As you can see every new vehicle is assigned an ID with a Class name and Detection probability. Using in and out you can see the count of incoming and outgoing traffic.
Let’s see the code for this implementation:
Here I have used the YOLOv8 Ultralytics library for loading the YOLO model train on the COCO dataset. And Supervision library is used for loading ByteTrack and other Vision tasks such as Labelling, vehicle count, etc.
You can just run this command by passing video as input:python sv_bytetracker_yolo.py –source_weights_path yolov8m.pt –source_video_path test_video.mp4 –target_video_path test_pred.mp4 –confidence_threshold 0.1
You can remove the class filter from the code if you want to track any other class or so.
Applications
So, we have completely understood ByteTrack. There are various applications and industries where it can be used such as:
- Automobile industry: For tracking vehicles on the road for traffic analysis. If any vehicle going in the wrong direction or traffic movement on a four-way road.
- Production Industry: This can be used in the production line for counting and tracking the production item.
- Customer Interaction in Shopping: Tracking customer movement, which product or which category customers are interested in more. How long they hold the product, whether they finally buy or return it to the shelf.
- Enhanced customer experience: Recognize when a customer appears confused or searching too long for a product.
Summary
Let’s summarize the points we have learned:
- There are various MOT models like SORT, DeepSort, Bytetrack, etc.
- There are various methods/techniques for object tracking, Feature-based tracking, Template matching, Co-relation-based tracking, and Deep learning-based tracking.
- ByteTrack algorithm takes in low scores detection also (with high scores detections) in consideration for object tracking.
- Data association is applied to each detection.
- In Data Association similarities are generated between tracklets and detection boxes. Later apply different strategies to match them according to the similarity.
- Similarity can be computed by IoU or Re-ID by prediction of tracklets from the Kalman filter.
- For long-range, appearance similarity is useful.
- For the matching strategy, the Hungarian algorithm is used.
- Byte first applies association on high scores detection boxes than on low scores detection boxes.
References
[1] Yifu Zhang, Peize Sun ByteTrack: Multi-Object Tracking by Associating Every Detection Box. arXiv: 2110.06864, 01Dec.2021
[2] YOLOv8: https://github.com/ultralytics/ultralytics
[3] Supervision: https://github.com/roboflow/supervision
[4] Nicolai Wojke, Alex Bewley, Dietrich Paulus Simple Online and Realtime Tracking with a Deep Association Metric. arXiv: 1703.07402, 21 Mar. 2017
[5] Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, Ben Upcroft, Simple Online and Realtime Tracking. arXiv: 1602.00763, 02 Feb. 2016