In this article

Over the last two years, WWT has been performing research related to machine learning and artificial intelligence through its AI R&D program. The people involved in the program come from different parts of WWT including management consulting, data science, engineering, architecture and software development. 

Moreover, the projects cover a wide variety of use cases from generative race car image models to damage classification on construction equipment using deep learning. To perform this diverse set of cutting-edge projects on a wide variety of data sets, an AI R&D platform was developed leveraging progressive, flexible, and adaptable processes and technologies.

Generally, the team focuses on one research problem at a time with two to three researchers rotating every six weeks. This provides an opportunity for a deep focus on the current research problem with input and ideas from multiple people. This also means that the team at large is both deeply engrained in the project at-hand, in addition to looking forward to the exciting research findings to come.

When the COVID-19 crisis hit the United States in early March 2020, the AI R&D team was most of the way through a multi-month project involving a novel application of reinforcement learning. Instead of continuing on the same path, the entire team was able to pivot and identify multiple research projects related to the COVID-19 crisis. This quick pivot involved steering a large team of data scientists from their current focus on the reinforcement learning initiative to multiple COVID-19 related projects involving a wide range of modeling paradigms. 

The modern capabilities of the AI R&D platform allowed for data to be quickly ingested and the data scientists to have the right tools and technology at their disposal for each problem set's unique requirements. The team ultimately tackled a diverse set of problems from natural language processing, image processing and pandemic forecasting.

Overall, WWT's AI R&D effort related to COVID-19 involved three separate projects:

  1. A Kaggle competition leveraging the COVID-19 Open Research Dataset (CORD-19)
  2. A multi-week hackathon initiated by India's Department of Science and Technology (which included three separate challenges)
  3. A video analytics research project to obtain more accurate temperature measurements of humans

In particular, the Kaggle competition involved a data mining application related to searching through thousands of coronavirus related academic research articles to answer critical research questions. The hackathon also had a data mining component, very similar to the Kaggle competition, and two other projects, namely, a pandemic forecasting and activity detection project. 

Finally, the video analytics project developed a novel technique for identifying body temperatures using image processing from an infrared (IR) camera and a visual light camera. Below we provide a more detailed description of three of the projects that the team worked on.

Data mining (Kaggle competition + one component of the Hackathon)

The data mining effort encompassed both the previously mentioned Kaggle competition and a subsequent hackathon, where both competitions involved the same overall task and dataset. In particular, using the CORD-19 set of academic articles related to Coronavirus, the data mining problem involved searching through thousands of academic articles to answer specific research questions related to the coronavirus. The team focused on a subset of the CORD-19 dataset that consisted of 29,000 academic abstracts.

In natural language processing (NLP), the task of querying through a large set of documents is often framed as a document similarity problem, which involves determining if two documents or sentences are similar to each other. Here we considered the similarity between a particular research query and each of the academic articles in the CORD-19 dataset.

This task was complicated by the fact that none of the data had traditional labels to give the team guidance on which particular articles might be more useful for particular queries. Without labels or expert medical knowledge, we took an ensemble approach by combining six state-of-the-art NLP techniques. Ensemble methods combine and average predictions from multiple models in order to produce a final prediction. 

This ensemble of methods included methods such as topic modeling (see Figure 1), word2vec, and Google's signature NLP model BERT. Each of these NLP methods transforms a particular abstract and query into a numerical representation so the distance between the two can be calculated. Queries and abstracts that have a smaller distance are considered as more likely matches.

Example of topic modeling clusters using the CORD-19 data
Figure 1: Example of topic modeling clusters using the CORD-19 data

During the hackathon, the developed method was brought into production through the use of Elasticsearch. Elasticsearch is an open-source low latency search engine tool that can drastically reduce search times when dealing with a large number of documents. Using Elasticsearch, the team was able to develop a python Flask web application that allowed users to search through the 29,000 Coronavirus-related abstracts in under ten seconds using multiple state-of-the-art NLP methods.

Pandemic forecasting

The pandemic forecasting task involved predicting the cumulative number of confirmed COVID-19 cases in various locations across different states in India, as well as the number of resulting fatalities for future dates. The team used a dataset consisting of crowdsourced patient database containing state daily data for the number of confirmed/recovered/deceased COVID-19 cases in India.

Epidemic outbreak is a textbook example of an S-shaped population growth function with the infected cases increasing at an increasing rate at first followed by a decreasing pace halting at a maximum population size. With this underlying assumption, multiple population growth functions such as 3-parameter Logistic, 4-parameter Richards, and 6-parameter Baryani Roberts function were fitted to actual data using non-linear least square estimation (see Figure 2). Additionally, holdout validation was performed to test the model stability overtime.

Pandemic forecasting modeling pipeline
Figure 2: Pandemic forecasting modeling pipeline

​The proposed model was evaluated on two states in India, Maharashtra and Telangana (see Figure 3). The state of Telangana, where daily increase in confirmed cases has already reached the peak, proved to be a good fit with stable validation results presenting an error of <2 percent across all 7 holdout windows. Conversely, Maharashtra, still witnessing exponential growth, provided a wide range of errors (in various holdout samples) of up to 8 percent.

Cumulative confirmed cases in the states of Telangana and Maharashtra over time
Figure 3: Cumulative confirmed cases in the states of Telangana and Maharashtra over time

Based on the best model for the above two states, a prediction was created for the maximum number of cases and the associated timeline in Indian states. This analysis allows for data-driven and informed policy planning for COVID-19 cases in India. The above outlined process can be automated and deployed for a continuous forecasting process. Additional research has also been done in estimating the case mortality rate using survival analysis.

Video analytics 

As COVID-19 continues to spread into a global pandemic, governments and organizations look for ways to safely detect potentially infected individuals. Current solutions for infection detection suffertradeoffs between accuracy and practicality. Lab testing and contact temperature readings provide the best results, but are impractical for large crowds and can put front line staff into contact with infected individuals. 

Thermal camera temperature readers and other long-distance sensors allow safe and quick detection but are often inaccurate. The COVID-19 video analytics (COVIDeo) project proposes a combination of the two approaches providing a balance between accuracy and practicality.

The developed detection system relied on sampling the temperature of the eye canthus region. The eye canthus is the corner of the eye closest to the nose bridge and is considered the most accurate indicator of internal body temperature. Using an Infrared (IR) camera, we can sample the eye canthus temperature and get a close estimate of an individual's internal temperature, allowing us to determine whether an individual is displaying signs of a fever. 

While reading a temperature in infrared is not new, detecting the eye canthus in the infrared space is a more difficult problem. Infrared images tend to blur or outright remove features we see in visible light. As a result, existing biometric detection AI methods perform poorly on infrared images. In order to leverage the accuracy of existing facial detection models, we first detect an individual's eye canthus using a visible light camera and a pretrained eye detection model. We then transform the canthus location in the visible light image to the infrared image and sample the canthus temperature. 

Our system uses two camera modules, Intel's Real Sense (D435) and FLIR E95 for capturing the RGB-depth and infrared space respectively. The resolution of the RGB (visible) image is 1280 x 720 pixels while the thermal image is being captured at 464 x 348 pixels. Depth information is essential for transforming the eye canthus detection from RGB to IR space. 

Overall data and modeling pipeline for the human temperature video analytics project
Figure 4: Overall data and modeling pipeline for the human temperature video analytics project

In the future, we plan on doing multiple large-scale tests using the current architecture in order to acquire more thermal data. Additional thermal data will enable the team to train state-of-the-art deep learning models on top of IR images. This process will eliminate the step of RGB to IR transformation and will make the system more cost effective.

Bringing it all together

Overall, the various COVID-19 related projects pursued at WWT allowed the team to develop and enhance skills in a wide variety of areas while helping push forward analytic-based research related to the crisis. In addition, the diverse nature of the data and the projects' requirements stressed the AI R&D platform's capabilities and services. 

The efforts of the WWT AI R&D team can help empower the analytic-based approach that many throughout the world have taken to combating the global COVID-19 crisis. While many of these projects will continue to be developed, the skills developed by the team on these projects can be applied in an assortment of future projects. Most importantly, the insights created across these five projects can have a positive impact on the scientific breakthroughs needed to get us through this challenging crisis.