Forecasting Failures: Deploying a Successful Predictive Maintenance Solution

We've recently worked with several customers in the industrial space to build predictive maintenance solutions which allow users to better operate and maintain their machinery and vehicles. Through these experiences we have defined a process beginning with use case definition and covering all aspects of our standard data science lifecycle, including discovery, model training and validation, production and model refresh.

In the following sections, we will examine each aspect of that lifecycle in the context of predictive maintenance, providing lessons-learned and best practices. We will examine the applicability of predictive maintenance to industrial machinery, considering common operational constraints, and propose how to successfully set-up a scalable, sustainable predictive maintenance solution.

Technology advancements bring new predictive maintenance opportunities

The sudden proliferation of predictive maintenance and related solutions can be attributed to recent technological advancements. The business drivers of predictive maintenance are not new, e.g., to reduce costly repairs or increase the uptime and availability of machines. This increase in processing power and network bandwidth means larger amount of data is being collected, ingested and processed than ever before.

The implementation of artificial intelligence (AI) and machine learning (ML) models has become more widespread, and the ability to present data has become easier. A predictive maintenance model can take in data from sensors integrated into new machines (greenfield) or added to existing machines (brownfield). This wealth of information can be processed in central locations which have seen improvements through distributed compute and streaming data architectures. The growth of cloud provides elastic data storage and the ability to tap into new services or applications.

Algorithm development has grown alongside hardware and software developments. Some algorithms, which may be decades old, have seen a resurgence as computing power has caught up. The ability to run and train these models rapidly has led to many advances not just in the day-to-day running of data science but also in the models themselves.

The increased availability of algorithms provides new ways to process data and predict trends. AI and ML models are leveraged by data scientists to better understand current baselines and predict future events.

Ultimately, predictive maintenance recommendations need to be delivered to decision makers who can act to avoid adverse events. Improvements in the usability of front-end applications and the ease with which they can be built in a rapid, agile manner have contributed to acceptance by field operations.

When solutions are deployed in the manufacturing space, they can improve productivity, yield and availability for our clients. Next, we explore our predictive maintenance lifecycle. We aim to provide step-by-step guidelines of our recommended process to tailor your solution for success.

Use case definition

Before developing a solution, we find it crucial to clearly define the business objectives upfront. Defining our use case does not need to capture every detail but must act as a set of guiding principles for the modeling effort. As with other analytics use cases, this step is crucial for predictive maintenance.

Specifically, time must be spent upfront with stakeholders to understand the existing maintenance situation and what effect machine failure or outages will have on the business operations. There should be an evaluation of the current uptime and maintenance schedules, as well as a determination of the priority metrics or KPIs (e.g., cost, uptime, number of events) to address, and what the expected gain may be. It is also helpful to have a model tying the predictive maintenance efforts to the expected gains, to provide a hypothesis which can then be proven or disproven.

Discovery

Predictive maintenance models need to capture as many factors as they can to predict adverse events. Usually, the data required will be spread across your enterprise. A robust data engineering setup will have pipelines built horizontally and vertically to pull data into an analytics platform.

A good range can include data pulled from the components and sensors on machine level, to shift and operator schedules, all the way up to business units and financials and external information such as environmental conditions. With an ever-growing AI-driven world, it is crucial to set up your platform with the ability to ingest unstructured data such as images, video, and sound data as well. Advanced techniques can be used to extract features from unstructured data that can then be analyzed alongside structured data. The Discovery step explores datasets that may have never been used before (e.g., log files) but could be useful to determine baseline operations.

Once ingested, a range of statistical and visualization techniques should be used to explore the data. Standard techniques like bi-variate analysis, categorical classifications, ARIMA and power spectrum density can help us understand the time series data better. A correlation matrix is a way to examine linkages between variables and show some initial indications of predictors. Power spectral analysis is another effective discovery tool. It can be used to convert a machine's time series data from time domain to frequency domain, allowing us to narrow down to the strong frequency range variations.

*Figure 2: Example power density spectral graph*

Feature engineering

Once data is thoroughly explored, it needs to be engineered for an AI/ML model. An integral part of discovery is feature engineering. Complex machines can break down in different ways, with differing levels of severity, for different lengths of time and with different root causes.

The datasets may need significant adjustments to handle these exceptions. This process of making raw datasets usable is known as feature engineering. For example, sensor readings are a common source of data in predictive maintenance solutions. The readings can be set up in almost unlimited interpretations, e.g. the most recent value, the average value over a time period, the count of times the reading met a requirement.

For image and video analytics, feature engineering may involve the definition of ROIs (Regions of Interest) and design of image transformations (e.g. edge detection, corner detection, motion/speed detection), as well as selection of threshold values. For text analytics, common feature engineering techniques may involve word embedding (e.g. word2vec, doc2vec), frequency-based variables (e.g. tf-idf) or more advanced language modeling (e.g. Google BERT).

To illustrate, WWT was able to help one utility company set up their data effectively for modeling. The discovery lifecycle relies on domain and operations teams to tailor datasets to the right context. This process can also help to build trust and collaboration between analytics and operations teams.

Ultimately, the inputs and outputs should be well understood by all parties, and a focus should be placed on what is most applicable in the field. Typically, it is wise to err on the side of caution and create a rich master dataset with many variables to examine. Discovery is also where we recommend identifying features for the model inputs (independent variables) and targets (dependent variables).

Model training

Once you have feature-rich data sets, you have a choice of techniques. The prediction you are able to make will depend on the data available, future modeling environment considerations and the operational context. In general, it is possible to examine the expected life of a machine (either as remaining-useful-lifetime or a given chance of failure in the near-future given a time horizon), classify anomalous behavior to alert operators or to create survival models which can enhance classical 'x thousand hours' guidelines.

*Figure 4: Example survival analysis output*

Models are trained using traditional methods like regression, clustering and classification algorithms, or advanced deep learning models like neural networks. Modern data science libraries vastly simplify the running of these models, and many deep learning techniques can be implemented with a few lines of code.

The implication is that an effective team can train multiple models in parallel and determine the most suitable or an ensemble of models for the use case in hand, and measure performance across a range of factors including accuracy and reliability of the model. The age-old concept of "garbage in, garbage out" still applies, and data science skills are required to train and validate the models and accurately connect to data sources.

With increased use of Git repositories and team collaboration on cloud platforms, it has become easier to train complex deep learning models on faster computing systems like GPUs or edge computing. Classification models such as Random Forest or machine learning techniques like XGBoost can easily be applied and improved upon during the model training step with these advanced hardware and collaboration tools available.

When it comes to heavy machinery, it might be difficult to find historical failure events. There can be very few or almost negligible events present. We recommend training an anomaly detection model instead of classification model, which can define a normal behavior and raise an alert when any anomaly occurs in real time.

Recent developments in analytical modeling include development regarding the 'explainability' of analytics. This reduces the black-box nature of some techniques and benefits data scientists who can fine-tune model inputs and parameters, as well as front-line operators who can use the additional output to better make decisions and mitigate against failures.

Each model class has differing ways to explain its decisions, and another layer of recommendation engine can be deployed on top of model output to weigh the benefits of each model output to the end-user community. Finally, any decision maker should consider resources at hand, e.g., the size of the data, and the compute power needed for identifying which models to promote throughout model validation, testing and production.

Model validation

Analytics modeling is a highly iterative process, even after a model or class of models has been trained. The overarching idea is to create initial models utilizing all the available inputs, and then to filter out those which are not contributing to model performance. This process of validating the model is performed in a stepwise fashion and can involve multiple techniques to evaluate performance like training the test split and performing cross validations.

Since predictive maintenance is related to field equipment, field testing can also be planned to test out the results of model. Due to the iterative nature of this approach, it may also be necessary to adjust variables and return to Discovery to adjust our variables.

To ensure that model performance will be retained in production, and to better inform the model building process, it is important to partition datasets for validation. One of the main concerns in model building is overfitting — meaning a model may be too specifically trained on past events and not be resilient enough to deal with the underlying trends of the machine/environment that will continue in the future. Validation datasets should include out-of-sample and out-of-time records, used to test models for scenarios in which new machines are added or to look forward in time, respectively.

Production

The model training and validation phases include the research and development of analytics models, but not the operational production of those models. This will typically involve either informing or integrating with existing systems, or the creation of a new systems for visibility and alerts.

Any model production step would involve a significant data engineering effort, which will be responsible for establishing the data pipeline, running the model at a desired frequency, generating the flag and reporting it as an alert. This involves collaboration of multiple parties within organization: for example, machine engineers who know how data is being generated, where it is being stored; data engineers who know how data is being sourced and where is it stored; solution architects who help in setting up the entire workflow; and integration specialists who set up the external tools, which may include both reports/dashboards and alert displays for the front line users. The entire ecosystem of data storage and data flow needs to be established in order for the model to be set up and integrated in a timely fashion.

In earlier times, resolving inconsistencies in production environment and development environment took a significant effort and time. The arrival of the modern containerization utilities have significantly reduced the platform-dependency of ML products/solutions. Moreover, with container orchestration tools such as Kubernetes, Docker Swarm or OpenShift, the dynamic provisioning and scaling of compute resources becomes feasible.

Constraints around environment size, ability to centralize model operations and other operational context will define how the models can run in practice. The production vision should be a consideration in the 'Model Training' phase of work, mentioned previously. In complex environments, it is usually best to maintain as many common elements as possible in the production environment. Utilizing existing alarms systems, reports or visualizations to deliver recommendations can be very helpful to improve acceptance and impact of predictive maintenance models.

The modeling environment may involve a large collection of historical data in order to capture patterns and make predictions. In contrast, and depending on the data and models, the production environment might involve only a small amount of compute to process an equation and determine output. More advanced machine and deep learning techniques will involve some 'edge processing' and it is possible to create a hybrid approach where some rudimentary analytics are performed on the edge, and more complex analyses are performed centrally.

Advances in GPU chips, used in self-driving cars, have greatly improved edge processing power. In other applications, historical data may not be available or applicable. For real-time or quasi-real-time applications, data streaming is another engineering aspect that needs to be incorporate in model production. For those application we often incorporate data streaming technologies such as Kafka and Spark Streaming.

For both model training and model production, WWT has ample experience in a wide range of computing platforms, ranging from individual computers to on-prem cloud/servers to public cloud, hybrid cloud and edge devices. WWT also has expertise in best practices through the data science development cycles, as captured in our MLOps operational methodology.

Model refresh

Once predictive maintenance is put into production, analytics performance monitoring should be put in place to ensure models are meeting the KPIs as determined. The feedback loop can be used to shape further development of features, linking us back to Discovery and Model Training stages. Many algorithms will greatly benefit from the additional data generated during production. After a model is able to reduce our machine failure rate, there may be new context and novel ways in which our systems might fail, which need to be continually captured.

For one of our clients, after implementation of anomaly detection model, a part of the machine was replaced after six months and our model started giving multiple false positives. It is important to note, maybe more so than other AI solutions, predictive maintenance is dependent on physical attributes of machines, and thus will benefit from a refresh.

It becomes important to keep a feedback loop with the equipment engineering team who are aware about the high-level logic of model and can communicate it to the support team in case of any major hardware changes happen at the equipment.

Depending on production and operations, model updates can also take place at regular intervals. The update process is usually a shortened version of the data analytics lifecycle of modeling and production, where features are tested for their predictive value, and in this case compared against features used in the initial or subsequent versions of the model. Some machine learning algorithms are predicated on the ability to consume the feedback loop automatically identify new behavior patterns to improve predictions.

Throughout the model refresh phase, there will be a need to share and distribute knowledge between different teams who may have competing interests throughout the analytics lifecycle. We recommend investing in proper knowledge management and code documentation which will help ensure the continued development and transition of our models. Knowledge management will also support any automation efforts, e.g., new deployments, new sensors. As any predictive maintenance program matures, there will be much more scope to automate.

Closing thoughts

Our clients operate with complex machinery, systems and their components, which benefit from predictive maintenance applications. The benefits of increased uptime and reduced failure can generate value beyond operational time and dollars saved. A common result of initiating this kind of advanced analytics is known as the "halo effect." Beyond the initial predictive maintenance scope, there may be additional benefits in user satisfaction and reduction in turnover of resources, and an improvement in other semi-related activities as a focus is placed on more efficient operations.

Predictive maintenance applies to a range of verticals, such as mining, aerospace and manufacturing. Our successes implementing predictive maintenance applications in commercial and government organizations have created a wealth of knowledge around the subject, and we follow a clear direction to build a predictive maintenance solution. For example, WWT devised and implemented a solution to detect broken teeth on mining shovels, improving upon the industry standard. Read more about our approach, technology and results.

Looking further forward, it would likely benefit our clients to develop a framework of how to approach predictive maintenance, so standards and best practices can be defined to make the most of shared resources. This would create great potential to scale across the various machines and environments and achieve optimal return on an investment in predictive maintenance.