Creating a MLOps Solution Embedded With Security Hygiene
In this article
As incorporating machine learning into an organization's existing data science stack becomes more widespread, security sometimes becomes an afterthought. The traditional cybersecurity landscape as we know it – anti-malware solutions, web application firewalls – are not enough. Focus needs to expand to general security hygiene and providing protections against attack scenarios, attack surfaces and attack vectors. This article focuses on security hygiene as a lens that developers from all processes of the MLOps cycle can adopt to ensure that models are protected against potential attack scenarios.
Attack scenarios refer to data, systems, Advanced Persistent Threats (APT) and specifically how they relate to MLOps. Attack vectors speak to the method or pathway prone to attacks, such as ransomware, email attachments, SQL injections, man in the middle attacks or denial of service attacks. Attack surfaces include physical locations, for instance through a data center or a cloud service provider. Other attack surfaces are human, by using social engineering or threat actors; and digital, by using a misconfigured server or cloud service.
Security hygiene in the scope of MLOps considers all the attack vectors and surfaces, such as the Model Training Infrastructure, databases, ports and patch levels.
Secure hygiene starts as the data is prepared for ingestion: poorly labeled, disorganized data increases the chance of mistakes and error in modeling, turning the data lake into a data "swamp" as time goes on.
Like other data analytics solutions, there is an opportunity for attackers/hackers to reverse engineer the dataset and break through masked and confidential data. It is best practice to feed the model encrypted data from the beginning, instead of encryption being an afterthought in the output.
Before applying rules and deciding on who should be allowed access to what, complete an airtight RACI and determine a review process for the RACI. The review cadence should be influenced by the type of data in the MLOps solution, and the type of data created from the trained model.
It is important to understand who should get access, how users should obtain access and to revisit, by reviewing the RACI, whether users still need access. For example, if proper Identity and Access Management (IAM) policies are not applied to a pipeline, access to data can be manipulated and then blocked if the Principle of Least Privilege (PoLP) is not followed. For example, critical data types such as PII or critical model training data could be deleted, hurting the training process and output.
Security must be at the forefront when deciding on the platform in which MLOps solutions are designed and built. Data scientists, engineers and vendors must collaborate and identify the possible attack vectors across the data, assets and pipeline to ensure that consistent protections are put in place throughout the process. All critical touch points for the data and model should be protected, with applicable vendor solutions aligned. In the case of an APT type of incident where the entire system is down, it is essential that there is a back-up plan to recover the model and data.
Good security hygiene should apply to all elements of the solution, with the ability to zoom in on specific areas, such as PoLP, wherever possible. For instance, the handoff between ML and Ops is typically error prone. More attention should be applied at the point of handoff for PoLP.
Adversarial attacks are another threat to MLOps solutions. These attacks take advantage of models' deep neural networks by inserting a deceiving input during the training process that developers might overlook. This poisons the ML system and disrupts the model training process – a small piece of bad data can snowball as the model trains, magnifying the issue.
Making sure that the MLOps solution remains robust and resilient against adversarial attacks is important. More robust machine learning models and solutions can be developed through adversarial training, which is the insertion of noise or modified data into the model training process. By this means, the model learns to recognize false or maliciously augmented data and prevent that data from impacting the core solution features. While this still leaves solutions vulnerable to black box attacks, or attacks where the model's input and output labels, a method known as ensemble adversarial learning can prevent this. Ensemble adversarial learning is a method in which multiple classifiers are trained together and their weights combined to yield a final model. Prevention and detection are key in protecting the model in the building, training and output stages.
An example of good practice is using a data pipeline to query and process data instead of moving data away from the server. Moving data opens unsafe behaviors such as transferring data samples by USB drive/email, retaining data on your device past-policy for efficiency, which could lead to data getting lost, violations in regulation and more.
Security hygiene is a holistic process. While it is true that security hazards differ greatly between different parts of the MLOps process, it is essential to ensure that the flow is protected in its entirety.
While many security requirements are unique to specific cases, many common protective measures can be applied where suitable. One common measure is data classification. Data should be classified based on business sensitivity, usage, information dependencies, compliance requirements and other relevant factors to ensure that critical and/or confidential data and systems are handled properly throughout the MLOps process. Critical data elements must be protected during translation or consumption, and a variety of steps will achieve this.
The choice between these options should depend on the classification of the data. Encryption should be used for the most critical and sensitive data, while less critical data may not require such steps and masking, or tokenizing, may suffice.
Another common measure is to implement a zero-trust policy. This policy covers two parts – data and access. When data moves across the environment, the data should always be checked or sampled for validation. Before granting access, the user, device and network security should always be validated.
The last common security measure is redundancy. Redundancy is not always essential but should be considered by a company if the data and model are business critical. If so, it should be created within a cyber vault which separates the physical architecture and digital assets from the production environment. This vault can then be activated in place of the production environment when that environment is compromised.
The exciting part of MLOps is how it encourages a security-first mindset. By tackling the whole lifecycle, we can assess security risks at various touch points. With security it is always important to over-emphasize, over-communicate, over-prepare and be over-cautious, so the highest values of efficiency and automation from MLOps are realized.