The Machine Learning Testing Landscape

Daniel Angelov
Oct 12, 2022

Developing trustworthy machine learning models is the bottleneck for numerous machine learning applications and so quality assurance needs to become a core aspect of the development process. The community is already aware of that necessity, but information and tooling around ML testing is still scarce. In this article, we provide a comprehensive overview of the ML testing landscape, discuss various methods and available tools.

MLOps & Testing

Machine learning has transformed so many industries and is breathing in the necks of all companies that have not embraced the new revolution. In the same way, that digitalization in the last decade has clearly differentiated those that have embraced the change from those that have not, machine learning is now creating a new divide of companies that are using the power of advanced analytics and predictions to improve their processes and deliver new products to their customers.

So there is a wave of new teams developing AI-powered solutions and are wondering how to transition their software development processes to support machine learning projects. MLOps questions around data, features, labels, models, testing, pipelines, and versioning keep popping up, but everyone agrees that taking machine learning to production is a challenging task and so the first few model versions are “expected” to fail. More experienced teams have already established their software stacks often using popular specialized tools such as V7 or SuperAnnotate for data labeling, Weights and Biases for experimentation, SageMaker or VertexAI for training, Arize or Fiddler for monitoring. Both types of teams, though, usually perform testing only by evaluating the model on static test sets resulting in little information about the real-world performance of the model.

Why ML Testing? 

Frankly, the current practices of developing ML models follow more or less the following workflow:

(1) divide up a dataset into subsets

(2) train on one until a satisfactory performance is reached on another

(3) deploy to the real world, anticipating the system to fail 

(4) wrap things up in a monitoring solution 

(5) wait for failures while managing the possible business fallback 

(6) use the few obtained flagged issues to infer some failure mode 

(7) collect new data (if the process allows for it) and repeat again. 

To top it all off, training machine learning models often feels like a game of whack-a-mole where fixing one issue introduces regressions on parts of the problem space where the model previously worked fine.

Training models often feels like a game of whack-a-mole where fixing one issue introduces regression on previously working parts of the problem space. Image credit.

We tend to use machine learning for problems that are hard to describe and program their solutions explicitly. The only way to trust a machine to solve a problem that’s so ineffable that you’ve resorted to machine learning is to simply test everything. Extensively testing your machine learning models beyond just evaluating them on static test sets is absolutely crucial to gain confidence and stop treating your end users as beta testers, which is still often the case. 

If you are the consumer of an ML model the chances are you are a beta tester without knowing it!

More importantly, having a clear track record of discovering model vulnerabilities and fixing them before even thinking about deployment makes ML teams eager to deploy their latest and greatest models instead of dreading it. 

One of the main criteria in industry that team leads use to quantify the efficiency of a machine learning dev cycle is the time it takes for a team to go from ideation within a meeting to a deployed initial solution, obtaining some real-world performance results. This metric is often referred to as “time to market” or “time to production”. It’s so important because it describes how quickly an organization can try out new ideas. So what are the ways we can speed up this process, yet produce more reliable results by focusing on testing?

Thinking about testing in a more principled fashion (like other industries do e.g. software engineering, semiconductors) is key to tightening the feedback loop, increasing the speed of iteration, and the reliability of the resulting models. 

So what are integration tests, functional tests, unit tests, or stress tests for AI? 

Those are definitely different from the corresponding tests around the integration of AI within a software system! Shoehorning ML testing into established software QA frameworks is what many ML teams attempt to do, but ultimately realize there is a broader picture to be considered.

We are quite used to thinking about unit tests vs. integration tests vs. system tests when building a software system, but it is much less clear what the spectrum of testing methods for ML looks like. So let’s start our journey across the landscape of ML testing methods!

ML Testing Methods

The spectrum of ML testing methods.

When comparing different types of testing methods we should consider 3 important aspects:

  • Confidence. How much confidence do we gain by performing a particular test? For example, knowing that our model performs well on a small dataset gives us little confidence about its performance in the real world.
  • Complexity. How difficult is it to set up and perform a particular test? For example, it is relatively easy to ensure that your data does not contain missing values, but this on its own gives you little confidence in your final model.
  • Cost. How much does it cost you to set up and perform a particular test? If a test runs in seconds and it costs you a fraction of a cent to run it then you can probably perform it whenever any change occurs. In contrast, if it takes you several months of engineering and infrastructure time, then you should be much more careful.

Data Checks

The simplest set of tests within an ML pipeline is making sure that your datasets do not have missing or invalid data points and that it’s being processed correctly. From erroneous data loading, where there has been more than one occasion people have shared anecdotal stories of their training pipeline loading images using OpenCV (in BGR mode), but relying on Pillow for inference (which loads images in RGB mode) to using old values for normalizing images. This would not lead to very bad performance, it would lead to consistently underperforming and thus a lot harder to identify. 

And what cannot be identified and tracked, cannot be fixed.

Another candidate for unit testing is data preprocessing - from text normalization and categorization of fields to transformation of images, cropping, and filtering of point clouds. A great practice is to try and come up with an inverse function f-1 for any data transformation f in your pipeline and verify f-1(f(x)) == x. This can be done by the already established python testing tools like pytest and unittest. Also pandera can ensure your data follows a certain schema. For pipelines and reproducible ML, a great tool is ZenML.

Data Understanding

Understanding where the data comes from, what is the underlying mechanism that generates it, can provide useful insights into the expected variability the model has to be able to handle. Moreover, various analysis methods can be employed to find important or interesting data points. This can be performed by both automatic flagging combined with some manual iteration and visualizations. Unsupervised clustering, PCA, t-SNE, or metadata analysis are great starting points. Examples of good tools here are Tableau, the-great-expectations, or some of the multiple data visualization tools like D3, matplotlib, holoviz.

The main problem with data exploration is that rare cases are hard to find and thus annotate as a special requirement, yet those modes of failure are even harder to collect in the real world. This in itself can be detrimental and lead to a poor understanding of model failures.

Model Evaluation

Evaluating a model on hold-out test sets is the type of testing that is taught in every single ML course and every ML team is doing. Often teams also split their test sets into different scenarios in order to test for different failure modes. Figuring out what metrics are most relevant for the downstream task is one of the most difficult steps since these metrics should be a proxy of the real-world performance of the model after it’s been deployed. Moreover, examining model performance on individual data points can be informative for the next steps needed to improve the training dataset or the model itself. However, one should be careful when making decisions based on the observed failures since they can be caused by either difficult or rare samples, and these need to be treated differently. Overall, evaluating model performance on carefully curated static datasets is a necessary step for testing a model before deploying it, but it is not sufficient to build more robust ML models.

Operational Domain

The hold-out test sets are meant to capture the diversity of the problem into a set of discrete points. However, these representative data points are often not sufficient to capture the long tail of events that our AI models can observe - that’s why deploying models to the real world is an interactive process as data for new failures is uncovered and added to the test sets. We need a new strategy to describe the area of operation beyond the immediate data. Rather than using static data points hoping to implicitly specify our requirements, explicitly listing a set of specifications (e.g. “every car within 100m needs to be detected”, or “intruders need to be detected regardless of the time of day”) that govern how the data can change, suddenly makes the problem more tractable. Every rule describes how the data can vary, eventually defining an operational domain for the model. The notion of an operational domain and testing model capabilities with respect to it is a common method in safety-critical applications such as autonomous driving.

An important side effect of defining an operational domain in the context of the solved problem is that even non-ML experts understand what the requirements for the model are. On one hand, this not only enables the entire organization to have a common language to talk about model performance, but also reveals risks that might otherwise be perceived only by the ML team. On the other hand, the ML team gets to analyze failures and track regressions with respect to human-interpretable metrics, making areas for improvement easier to identify.

Stress Testing

The hold-out dataset is just a small subset of points within the operational domain and the data the AI model will see in production. Having decided on and codified the operational domain for a given task, we can explore the performance of the model across the entire domain. We can traverse the operational domain and test the support of the model in these novel configurations - not part of the available datasets, but with the expectation that our model should be able to handle them. This streamlines and identifies regions of poor performance, creating actionable outcomes that the team can directly use to improve the model and resolve the issue before deploying the model.

Stress testing increases development speed, consistency and ultimately the business value that can be extracted from a model.

Platforms such as Efemarai enable ML teams to stress test their models in a continuous integration fashion and collaborate when investigating issues.


Through stress testing and the creation of high-fidelity simulation, one aims to find alternative methods of representing the problem space. An advantage of simulations is being able to work on multiple levels of abstractions - from semantic scene representation to full-scale realistic renderings. This creates a library of scenarios used for acceptance testing and is a great way to test the entire system on pre-scripted cases. Building an operational domain for simulation, though, requires various simulation aspects to be considered ranging from fidelity, scenarios, confidence in results, amount of computation, etc. Both Unity and Nvidia have simulators that are targeted at robotics. DeepDrive is an open-source simulator for developing and testing self-driving cars. Simulation can also focus on predicting behavior - e.g. of crowds down to physics concepts

Formal Methods

All of the above methods are empirical approximations that evaluate the model across an ever-expanding subset of the true underlying operational domain of the model, and so their results have a statistical nature. Using formal methods, it is possible to mathematically guarantee that a particular behavior of the AI model will be contained within a pre-set boundary. It is a powerful technique, but is currently applicable to the most simple models and neural networks. Training bigger models with verification in mind can expand the supported scales, but the question regarding setting up the domain of operation and productionalizing it remains part of the equation. You can read more in the Introduction to Neural Network Verification book. 


In this article, we have highlighted different strategies for testing machine learning models across axes such as confidence in the obtained result, the complexity of implementing the solution and maintaining it as part of the MLOps stack, and the cost of implementing and running the testing process.

The vast majority of companies are early in their use of the testing landscape techniques, and we invite readers to take steps in the right direction to improve their development processes with a clear focus on QA and as a result start building better, more reliable AI models faster.