When a deployed ML model produces poor predictions, it can be due to a wide range of problems. It can be the result of bugs that are typical in any program—but it can also be the result of ML-specific problems. Perhaps data skews and anomalies are causing model performance to degrade over time. Or the data format is inconsistent between the model’s native interface and the serving API. If models aren’t monitored, they can fail silently.
When a model is embedded into an application, issues like this can create poor user experiences. If the model is part of an internal process, the issues can negatively impact business decision-making.
Software engineering has many processes, tools, and practices to ensure software quality, all of which help make sure that the software is working in production as intended. These tools include software testing, verification and validation, and logging and monitoring.
In ML systems, the tasks of building, deploying, and operating the systems present additional challenges that require additional processes and practices. Not only are ML systems particularly data-dependent because they inform decision-making from data automatically, but they’re also dual training-serving systems. This duality can result in training-serving skew. ML systems are also prone to staleness in automated decision-making systems.
These additional challenges mean that you need different kinds of testing and monitoring for ML models and systems than you do for other software systems—during development, during deployment, and in production.
Based on Google work with customers, they’ve created a comprehensive collection of guidelines for each process in the MLOps lifecycle. The guidelines cover how to assess, ensure, and control the quality of your ML solutions. Google have published this complete set of guidelines on the Google Cloud site.
To give you an idea of what you can learn, here’s a summary of what the guidelines cover:
- Model development: These guidelines are about building an effective ML model for the task at hand by applying relevant data preprocessing, model evaluation, and model testing and debugging techniques.
- Training pipeline deployment: These guidelines discuss ways to implement a CI/CD routine that automates the unit tests for model functions and the integration tests of the training pipeline components. The guidelines also help you apply an appropriate progressive delivery strategy for deploying the training pipeline.
- Continuous training: These guidelines provide recommendations for extending your automated training workflows with steps that validate the new input data for training, and that validate the new output model that’s produced after training. The guidelines also suggest ways to track the metadata and the artifacts that are generated during the training process.
- Model deployment: These guidelines address how to implement a CI/CD routine that automates the process of validating compatibility of the model and its dependencies with the target deployment infrastructure. These recommendations also cover how to test the deployed model service and how to apply progressive delivery and online experimentation strategies to decide on a model’s effectiveness.
- Model serving: These guidelines concern ways to monitor the deployment model throughout its prediction serving lifetime to check for performance degradation and dataset drift. They also provide suggestions for monitoring the efficiency of model service.
- Model governance: These guidelines concern setting model quality standards. They also cover techniques for implementing procedures and workflows to review and approve models for production deployment, as well as managing the deployed model in production.
To read the full list of Google recommendations, read the document Guidelines for developing high-quality ML solutions.