AI development with Richard Petty Motorsports
Recent advances in artificial intelligence (AI) deep learning have made image recognition possible at scale. Deep learning classifies objects in images by leveraging multiple layers of artificial neural networks where each layer is responsible for extracting one or more feature of the image. In this white paper, you will learn how WWT trained and implemented a neural network to identify, classify and sort images of NASCAR race cars to give NASCAR driver Bubba Wallace and Richard Petty Motorsports insight into the driving behavior of competitors during a race.
In this paper we explore the use of AI to perform an image sorting task for a use case in NASCAR. Currently, the Richard Petty Motorsports (RPM) team acquires over 10,000 images per race and needs to sort them real-time to find the ones that contain the RPM car. This task is quite time consuming for an RPM team member, wasting valuable resources that could be deployed on more critical tasks. With AI, we can do this task quickly and with a low error rate. In addition to isolating just the RPM car, a large pool of cars can be detected and sorted accordingly. Starting with an unlabeled data set of thousands of images across several races, we trained and implemented a neural network to identify, classify and sort images with a high degree of accuracy.
Recent advances in deep learning have made tasks such as image and speech recognition possible at scale. Deep learning excels in detecting or classifying objects in images by leveraging multiple layers of artificial neural networks where each layer is responsible for extracting one or more feature of the image. Moreover, the development of libraries like Keras and TensorFlow have allowed data scientists to accelerate their workflows and iterate rapidly in the pursuit of even higher model accuracy.
Deep learning models possess the ability to learn features automatically from the data, which is generally only possible when a significant amount of training data is available. However, pre-trained APIs, algorithms, and training tools that are available open-source for image classification are only growing. There are a number of models trained by deep learning researchers with a significant amount of data and computational power that anyone can use and apply for their own purposes free of charge. Being able to repurpose these models effectively to cater to different business needs is an incredible opportunity for organizations across the world.
In this paper we report how we developed a solution which can classify images of NASCAR cars captured during a race. During every NASCAR race, high-resolution car images captured are shared in a common Dropbox folder with no labels. Normally, a human on the Richard Petty Motorsports (RPM) team will sort these images of NASCAR cars during the race to find the ones that contain the RPM race car. This is a highly time-consuming task.
Using neural networks, we developed a solution that will be implemented as part of a custom application for RPM which identifies the cars, classifies them and organizes the images into folders based on their car number. The solution will free up vital resources, enabling them to spend more time examining the images as opposed to sorting them.
Part of RPM’s roadmap is to integrate the results of this model into other parts of the application that their pit crew will use during a race. For example, automatically tagging photos with the car number in the image will enable the development team building the application to add features such as clicking on a driver name and having the relevant pictures appear.
Our raw dataset included 150,000 images from NASCAR races and practices over 14 weeks. The images are organized by race and placement of the camera on the racetrack. The training and testing sets were split chronologically, with the first 11 races in the training set.
All training was performed on a NVIDIA DGX-1 using two Tesla V100 GPUs. Eight GPUs are available in the DGX-1, but we only used two due to the smaller size of the final training data set, which was less than 10GB. We used the following software and Python packages to perform the training and testing:
- Python 3.
- Jupyter Notebook.
Each of these software packages are fully open source. A docker container was loaded on the DGX-1, which contained Python 3 and the packages listed above. Jupyter environments were employed to write and test the algorithms and models.
The neural network models were trained with the following specifications:
- Batch size: 32.
- Epochs: 4.
The total training time for the 8 models described in this paper was less than 2 hours. Each prediction of the cars within an image takes less than 2 seconds, making real-time inference during a race quite feasible.
The methodology was divided into four high-level steps:
- Detecting cars within image – Images might also include other objects apart from cars or multiple cars in a single image which needed to be separated.
- Creating a labeled data set for training the model – To create a classification model, we needed a set of labeled images which was a manual and iterative process.
- Training the neural networks – Multiple neural networks were trained on the labeled data set with car images and their respective numbers.
- Using ensemble techniques to improve accuracy – The final step required us to combine the different neural networks to increase the predictive power of the solution.
Some of the challenges we faced during the implementation included:
- Multiple cars appearing in an image.
- Time required to label the images.
- Identifying different cars with very similar designs (as shown in Figure 1).
Figure 1: Different cars show similar designs across races. Correctly classifying these is a challenge.
Detecting cars within image
The first step in the process was to detect and separate out individual cars in all the images (a single image could have multiple cars or not have any cars at all). To perform this task, we used the MobileNet-SSD model which is a combination of Single Shot Detectors (SSDs) and MobileNet architecture. The MobileNet-SSD model is fast, efficient and does not require huge computational capability to accomplish the object detection task.
The model is pre-trained on the COCO (Common Object in Context) dataset. COCO was an initiative to collect natural images, the images that reflect everyday scenes and provide contextual information. In an everyday scene, multiple objects can be found in the same image and each should be labeled as a different object and segmented properly. The COCO dataset provides the labeling and segmentation of the objects in the images. It has images for 91 categories, ~80K training images and ~40K validation images including one for car.
The process flow for car detection is illustrated in Figure 2. The MobileNet-SSD network takes a raw race image as input and produces the locations of objects within the image as well as a classification score indicating the identity of each object. The value between 0 and 100% shown represents the probability that the detected object is a car. As a quality assurance measure, we selectively cropped only cars with at least 99% confidence to construct our training set for model development. However, we propose that this threshold could be lowered in a production setting without ill-effect. If necessary, a dedicated neural network could be used to handle mis-detections.
Figure 2: The car detection process. A raw image was processed by the MobileNetSSD to detect bounding boxes around cars. Each bounding box was cropped into a separate image to be classified.
Labeling car images with an iterative process
To train a model to classify the cars into different classes based on their number, we needed a set of labeled images. The process to label our dataset of ~64K cars was iterative.
A small web app was created to label and verify images in bulk. This web app was based on the Supervising-UI app built by the USC Data Science Group. The application was hosted using the Flask web framework in Python. It can be accessed by multiple users at the same time, making the process much faster and efficient.
Step 1: Start by labeling the images one-at-a-time by hand using the web app that showed a cropped image of car as in Figure 3.
Figure 3: The web application to label each image by hand. The user can input the car number in the input box and click ‘Submit’ to get the next image.
Step 2: After 7000 images were labeled, a model was trained to predict the labels for the remaining images. This model training process will be explained in the next section. The predictions of the model were then verified in bulk with a separate web app. In this application, the user only had to click on images that were incorrectly classified by the model. This verification was a faster process as 40 images could be checked at a time (Figure 4).
Figure 4: Web app to verify images in bulk. Using the app user can verify 40 images at a time, simply clicking on images which are incorrectly labeled.
Step 3: Some of the images which the model had incorrectly verified were again labeled by hand.
Step 4: Steps 2 and 3 were repeated until all the images had been labeled. Each new model was more accurate than the last, and the bulk verification process would classify more of the images. In this way the labeling process formed a cycle as in Figure 5. This combination of brute force labeling and model development is a framework that can be employed for any labeling task for a supervised learning problem.
Figure 5: The Iterative process of labeling images, which begins by labeling the images by hand. After some number of iterations of training a model and verifying its predictions, the final images are labeled by hand to complete the process.
In all, 64,923 images were labeled for 44 distinct cars. Because some of the cars race infrequently, their representation in the labeled image set was sparse. To create a more balanced distribution of classes for model training, cars with less than 1000 unique images were grouped together in a common category called “Other”. The total counts of labeled images for each unique car number are shown in Figure 6 Random samples of 1000 images from each class were used for training and testing.
Figure 6: The graph shows the distribution of the labelled dataset. The x axis has the car numbers and the y axis has the count of images per car. Images with less than 1000 images were combined to create the ‘Other’ category.
Training convolutional neural networks to classify cars
Convolutional Neural Networks or CNNs are widely used for image and video recognition, recommender systems and natural language processing.
Our ~64K labeled images were organized by race, spanning 14 races in total. The fact that the car design, for any given car, could be different depending on the race was a key challenge the classification model should be able to account for. However, since the numbers on the car do not change from race to race, the neural network should be able to pick up on this similarity and use it for classification.
Images from races 1-11 were used for training the models and images from races 12-14 were used for validation. We split the dataset chronologically because it allows us to evaluate the robustness of the model to changes in car designs in new races.
We ensured that each car had the same number of examples in the training and test set (i.e. irrespective of the total images we had for each car, all cars would have 1000 training examples and 500 testing examples). Moreover, cars with less than 1000 images in total were combined in an “other” category for classification, as cars with less training examples were very unlikely to be identified correctly by the model.
What are pre-trained models?
A pre-trained model is a model previously created to solve a similar use case. Instead of building a model from scratch to solve a similar problem, the model trained on another problem can be used as a starting point. For image recognition, there are multiple pre-trained models available in Keras which have been trained on the ImageNet dataset. Below is a summary of the accuracy of the different models:
Figure 7: Pre-trained models available in Keras and their accuracy on ImageNet validation dataset. Top-N accuracy means that the correct class gets to be in the Top-N probabilities for it to count as “correct”. The parameters are the sum of weights of biases across the layer.
What is ImageNet?
The ImageNet data set has been widely used to build various neural network architectures. It was built on a significantly large dataset (1.2M images) enabling it to be leveraged as a generalized model. The goal of the original ImageNet model was to correctly classify the images into 1,000 separate object categories. These 1,000 image categories represent object classes that we come across in our day-to-day lives, such as dogs, cats, various household objects, vehicle types etc. Since one of the categories covered in this dataset is cars, using ImageNet made sense for the use case at hand.
The use of pre-trained models for new applications is based on the concept of transfer learning. A neural network gains knowledge or learns from data, which is compiled as “weights” of the network. These weights are organized according to the different layers of the neural network architecture. We can ‘transfer’ this learning by using pre-trained weights as a starting point when training a network on a new use case. Depending on the amount of data available and the complexity of the problem being solved, one can choose to freeze (which means not changing weights during backpropagation) the first few layers and train only last few layers.
The two major benefits of using pre-trained models are:
- Less training time as majority of the layers have pre-defined weights
- No need for a large training data set because the pre-trained model has already leveraged a large dataset
Models used and fine-tuning
We employed 4 models, each trained on the ImageNet data set: VGG16, VGG19, InceptionV3 and InceptionResNetV2. This methodology helps us leverage the learning of the existing models and then repurposes the model to our specific use case. We chose to re-train the last 6 layers for the VGG models as these models have less than 30 layers in total. After a few iterations for the Inception models, we decided to retrain all the layers as these are very deep networks and retraining all the layers improved the performance. The idea behind fixing the initial layers of convolutional neural networks is that the initial layers have learned the features in generic images, like edges and basic shapes. The deeper part of the networks learn the specific designs and parts of objects. Retraining these layers will focus the model to distinguish the specific race car designs that we need to classify.
Figure 8: The neural network architecture showing the process of fine-tuning pre-trained models. The boxed layers come from an existing model, and additional layers are added to train the new model to classify the specific images. Layers from the previous model may be trained as well, or their weights may be frozen.
We appended the following custom layers to the feature layers of each pre-trained model:
- Flatten Layer – to flatten the model input
- Dense Layers – with a Relu activation function
- Dropout Layer – with a dropout rate of 50% to prevent overfitting
- Dense Layer – with a softmax activation function and number of classes of cars in our data to specify the dimension of the output vector
In addition to the original color images, we also trained each of the models using only grayscale images. The desired effect is to make these models more resilient to changes in the design of a car, as they must focus more on the shape of features like the number displayed. This results in 8 distinct models, using color and grayscale images with each of the 4 pre-trained models.
Improving accuracy with an ensemble model
As the final step we ensembled all the models to acquire an improved prediction. An ensemble model uses several “weak” predictions to create a stronger prediction that is less sensitive to possible overfitting of individual models. A simple ensemble takes the average of each prediction. A more sophisticated ensemble adds weights to the predictions of each model, based on the accuracy of the models or by training these weights for optimal performance. In general, ensembles tend to yield better results when there is a significant diversity among the models.
Stacking is a form of ensemble which involves training a learning algorithm to combine the predictions of several other learning algorithms. We used a simple form of stacking which involved taking the average of scores from each of the 8 models to get the final score.
Figure 9: Calculating final score using an ensemble of different models. Here A, B, & C represent specific cars, and each model determines a score for how likely the image is of that class. The ensemble can be a simple average of scores (equal weight to each model) or a weighted average of scores, and the final prediction is the car having the maximum.
An additional benefit of an ensemble approach is that agreement between individual ensemble components can be used as a measure of confidence in the overall score. When fewer models agree, this suggests we should be less confident in the final prediction. For this use case, it may be preferred to discard images where our confidence is low, as the accuracy of the prediction is more important than making an attempt for every image.
The accuracy across models is shown in Figure 10. The overall accuracy of the ensemble prediction was 81%, without considering agreement of the individual models. The accuracy improves as the number of models which agree with the ensemble increases, as shown in Figure 11. For instance, when only 3 models agree, the accuracy of the prediction is less than 25%. However, when all 8 agree (which happens 60% of the time) the accuracy reaches 96%.
Figure 10: Accuracy of individual models on the validation data set. The ensemble of all 8 models performs better than any individual model.
Figure 11: Accuracy vs model agreement. The x-axis shows the number of models which agree with the ensemble. The blue bars display the accuracy in each case, and the orange bars display the percent of images in that category. Accuracy increases as the number of models that agree increases which validates the effectiveness of an ensemble.
The solution comes together as we built a scoring engine which can be used to detect and classify cars for all the future races. The engine accesses the raw images captured during any race and uses the MobileNetSSD to detect cars and create bounding boxes around each. The cropped images are then scored by each of the 8 models. Stacking is used to generate the ensemble prediction for the car number and images are then sorted into car specific folders which can be accessed by the RPM team in real time.
The process flow for scoring new images is shown in Figure 12. The final model creates a bounding box around detected and classified cars. An example is shown in Figure 13.
Figure 12: Process for scoring images for the new races. Raw images are fed to the scoring engine, which detects bounding boxes, predicts the car label, and outputs the ensemble prediction. If enough models agree with the prediction, the image is written to a specified folder for each car. The end user can then easily find the images containing a given car. The cropped images are also saved and organized, so that further validation and model training can be done in the future.
Figure 13: Output of the detection and classification model with car label. This image shows Bubba Wallace driving the RPM car, number 43. The model creates a bounding box around each car and classifies them into one of the 29 classes.
We have highlighted the basic methodology we followed to modify pre-trained models for our specific image classification use case. This methodology helped us leverage the learning of the existing models and then repurpose the models to our specific use case. Even with a relatively small data set we were able to achieve an accurate model. Major benefits of using pre-trained models include a shorter training time, and a higher accuracy even with a small set of data.
The use of a neural network firstly to detect different cars in images helped us to filter out irrelevant images. We benefitted from the fact that pre-trained neural networks have already been trained to accurately detect bounding boxes of cars, one of the categories in the ImageNet dataset.
Labeling a large dataset for building the neural network was a time-consuming and challenging task, but the iterative process of hand-labeling and using the intermediate models to predict labels made it relatively quick. A web app enabled multiple users to label and verify images simultaneously.
We also discovered that ensemble learning helps improve the predictive performance by combining several models. Furthermore, the images that each model agreed upon had a significantly higher accuracy (of 96%), allowing us to lower our error rate at the cost of throwing out images which had less agreement among the models. In this use case, quality is desirable over quantity, and it is helpful to know which predictions to trust the most. Our solution points to a common theme in AI within many business use cases: AI and neural networks solutions may also be advanced by traditional machine learning techniques, either in pre-processing the data or in post-processing the results, as we have here with an ensemble technique.
The current solution was implemented using on-premise infrastructure. As a next step, we are in the process of deploying this solution using platforms provided by Amazon, Google and Azure to assess the feasibility and ease of implementation in the cloud. Using cloud-based platforms like these, one can also connect the live feed of NASCAR images to the app and perform real time predictions.
- Howard et al. MobileNets: Efficient convolutional neural networks for mobile vision applications, arXiv:1704.04861, 2017. https://arxiv.org/abs/1704.04861
- Liu et al. SSD: Single Shot MultiBox Detector, arXiv:1512.02325, 2016. https://arxiv.org/abs/1512.02325
- Simonyan, Karen and Zisserman, Andrew. Very Deep Convolutional Networks for Large-Scale Image Recognition, arXiv:1409.1556, 2015. https://arxiv.org/abs/1409.1556
- Vikas Gupta. Keras tutorial: Fine-tuning using pre-trained models. 2018. https://www.learnopencv.com/keras-tutorial-fine-tuning-using-pre-trained-models/
- COCO Data Set. http://cocodataset.org/
- Keras applications. https://keras.io/applications/
- Flask web development framework. http://flask.pocoo.org/
- MobileNet-SSD. Caffe implementation of MobileNet SSD detection network. https://github.com/chuanqi305/MobileNet-SSD
- Supervising UI. USC Information Retrieval and Data Science Group. https://github.com/USCDataScience/supervising-ui