Testing & Quality Assurance (QA) for Data, ML Model and Code Pipelines

We have already touched on multiple points of the Machine Learning lifecycles in previous blog posts and our Next Generation Data & AI Journey powered by MLOps. In the last post, we talked about effective Data Science & Machine Learning pipelines. In this blog post, we will dive a little deeper into the quality aspects of such pipelines. Particularly:

Why quality gates are the precondition for automation
Different types of testing for your Machine Learning application and pipeline
How to understand and avoid technical depth
How to build quality gates for a data and machine learning pipeline

Setting the Context & Challenges for Testing ML Systems: Code, Data & Model

We will explore how to test code, data, and machine learning models to construct a machine learning system on which we can reliably iterate. Tests are a way for us to ensure that something works as intended, and we are incentivized to implement tests and discover sources of error as early in the (development) lifecycle as possible to decrease downstream costs and wasted time.

In other blog posts we cover in more detail the concepts of a machine learning system and data, model, and application pipelines. I.e., the following figure of our digital highway for end-to-end machine learning provides an overview.

Figure 1. Our detailed blueprint: A digital highway for end-to-end machine learning and effective MLOps.

Another way to approach this topic is to differentiate between traditional software tests and machine learning (ML) tests, where software tests check the written logic , while ML tests check the learned logic. ML tests can be further split into testing and evaluation. Many ML developers are familiar with ML evaluation, where we train a model and evaluate its performance on an unseen validation set. This is done via metrics (e.g., accuracy) and visuals (e.g., precision-recall curve).

On the other hand, ML testing involves checks on model behaviour. Pre-train tests, which can be run without trained parameters, check if our written logic is correct. For example, is classification probability between 0 and 1? Post-train tests check if the learned logic is expected. We will further explore the different types of testing below.

The Need for Quality Gates

While Figure 1 above illustrates different pipelines and quality gates, we have explained what they are and how you set them up in another blog post. The key benefit of building pipelines is to automate steps that previously have been executed manually. An example would be cleaning data. Usually, cleaning data begins with experimental analysis right after the initial data extraction. The data itself and properties such as the underlying distributions, impurities, and data accuracy, must be uncovered. At a certain point, the understanding of the data has reached a point where we can confidently make assumptions about the data (and here data could be a single file or a continuous stream of data).

Let's see the following example: Given a dataset from the “DACH”-region, we, for example, know with high confidence that the “Country”-field should contain values that can be mapped to either “de,””au,” “ch” (e.g., “DE” to “de,” etc.) or be empty.

The piece of code which we have initially written to transform our data experimentally can now be executed as part of a pipeline to automate the process. It is vital to notice here that not only the transformation should be automated, but also the assumptions which allow you to apply that transformation in the first place, as well as the quality of the data at the end of the transformation process. This is what we call “quality gates.”

Applying these quality gates allows you to build confidence in your pipeline. Confidence is an important ingredient of true automation, allowing you to shift from a manual and reactive to a more “hands-off” system. Another reason quality gates are crucial for your ML systems is that your system will “fail fast, " meaning that unvalidated assertions will not propagate through your system and affect downstream functionalities or services. Hence, issues can be detected earlier and be mitigated faster.

Different Types of Testing for Machine Learning Applications

There are four majors types of tests which are utilized at different points in the development cycle regardless of whether we are talking about traditional software or ML systems:

Unit tests: tests on individual components with a single responsibility (ex., function that filters a list).
Integration tests: tests on the combined functionality of individual components (ex. data processing).
System tests: tests on the design of a system for expected outputs given inputs (ex., training, inference, etc.).
Acceptance tests: tests to verify that requirements have been met, usually referred to as User Acceptance Testing (UAT).
Regression tests: tests based on errors we've seen before to ensure new changes don't reintroduce them.

While ML systems are probabilistic in nature, they are composed of many deterministic components that can be tested similarly to traditional software systems. The distinction between testing ML systems begins when we move from testing code to testing the data and models.

Figure 2: Four major types of tests.

The methods used to test can sensibly be grouped by which part of the system is tested (data pipeline, ML pipeline, code, application) and by whether the test relates to function or non-functional aspects of the application. Functional in this context refers to the ability of a system to perform a predefined task. For a machine learning model, this could refer to the ability to classify an image. Non-functional here pertains to aspects that do not directly correspond to the ability to perform a predefined function but rather aspects such as performance, latency, and security.

In Figure 3, we can see the different layers of functional testing. This figure is based on Martin Fowler's work “The practical test pyramid”. On the bottom layer, we can see where the tests specific to ML projects come into play. Both the data and the model tests are focused on unit test assertion validation (with some exceptions), hence the QA for these parts is on the same layer as unit tests.

Figure 3: Hierarchies of functional tests based on Martin Fowler's practical test pyramid.

There are many other types of functional and non-functional tests, such as smoke tests (quick health checks), performance tests (load, stress), security tests, etc. but we can generalize all of these under the system tests above.

Testing your Data Pipeline

The data pipeline connects different input sources with data storage through ingestion and transformation workflows. In Figure 4, the unified analytics architecture including these workflows, can be seen. Consequently, there are three testable workflows that constitute the data pipeline: data ingestion/extraction, data transformation, and data storage.

Testing your data ingestion workflows can be difficult because depending on context, there can be many data sources with different technical and structural properties. Optimally, these data sources can be integrated as producers with tools like Kafka. More often than not, you will need to build code to ingest data from the source to downstream systems or workflows. This early in the lifecycle of an application, it is particularly important to have solid QA in place as moving parts (e.g., incorrect data) can cause amplified errors in downstream systems. How to improve QA for data ingestion:

Build unit tests for critical functions of your data ingestion.
Consider mocking your actual data source and testing the process of getting data from the mocked source to your downstream system. Mocking is helpful because it allows you to do integration tests without using peripheral systems.

For data transformation, we face the challenge of having to normalize data from different inputs to fit our target system (schema on write) or to fit our on-demand needs (schema on read). Either way, the challenges for testing the data transformation workflow are similar. The following points should be part of your data transformation QA:

Finding missing values
Validating data shapes
Removing duplicates
Setting and validating legal data value ranges

Additionally, preprocessing functions are applied to normalize data to the required target format (E.g., dates, country codes, etc.). Here the testing paradigm to build a unit test for such functions applies.

Figure 4: The Unified Analytics Architecture is a comprehensive reference model that enables organizations to streamline and simplify their data analytics workflows by consolidating multiple technical capabilities, layers, and operational best practices into a single architecture view.

Testing your Machine Learning Pipeline

The term testing for a machine learning model is somewhat ambiguous since the validation of a model according to target metrics such as accuracy is not considered to be a software test from the conventional software engineering perspective. We believe that perceiving the model evaluation as testing facilitates solving key challenges for quality assurance for ML pipelines.

The most natural way of QA for your machine learning pipeline is evaluating your model with regard to a target metric such as accuracy or loss. There are different ways how this evaluation can be approached:

Training and Validation Testing: In this rather basic type of evaluation, a portion of the data is used to train the model, and another part is used to validate its performance. This is typically done by splitting the data into a training set and a validation set.
Cross-validation Testing: In this type of testing, the data is split into multiple folds, and the model is trained and tested on each fold. This helps to ensure that the model is not overfitting to a specific subset of the data.
Holdout Testing: In this type of testing, a portion of the data is set aside as a test set, and the model is trained on the remaining data. The performance of the model is then evaluated on the test set.
A/B Testing: In this type of testing, two or more different models are compared by testing them on the same data. This can help to determine which model is performing better.

Additionally, there are checks which do not pertain directly to a target metric such as accuracy or loss, but serve another purpose:

Online Testing: The model is tested in real-time on production data. This is usually done by deploying a candidate model in parallel to the production model (shadow deployment).
Fairness/Bias: This is closely related to the explainability of a machine learning model and evaluates on what basis a model has generated an output. Validating fairness and bias is particularly important when building systems with ethical implications.
Perturbation Testing: Perturbation testing validates the stability of a model by making subtle changes to the validation data, such as adding noise or removing samples, and then validating if the output of the model has changed reasonably.

Testing your Machine Learning Application as a whole

Once the model has been built and validated, we can start containerizing the application. It is common practice to make a machine learning model available through an API. Hence the content of the container is the machine learning model as well as an API serving requests. With a built container, we can start moving up the test pyramid seen on Figure 3. Concretely this means testing our system in combination with other surrounding systems.

Imagine the following scenario: You are building a containerized API intended to serve your models predictions as a service. In our case, we need to access another system (validator) to validate if a request is legal as soon as an interaction occurs with our API. To be able to test our application, we deploy it to a test environment through the pipeline. After deployment, we first test if our application can handle the data from the validator by simulating its responses. We simulate the validator's response first because it is a lot cheaper and quicker to test. If we pass this test, we deploy the validator and run all defined test cases.

Why Technical Debt Inhibits Automation

As mentioned before, most of the layers of a machine learning system stem from some experimental approach. In the very early stages, data is explored, and in later stages, different model architectures are trialed to obtain the most promising results. Due to the broader range of technical expertise needed compared to non-ml-systems, more roles (which usually means more people) are involved in the process. Therefore, it is relatively easy to incur technical depth throughout the process of building an ml-system.

“Technical Debt” is the complexity introduced through immature, incomplete, or inadequate code. Commonly, this happens due to design deficiencies, inadequate practices, or time constraints. Due to the nature of ml-systems, an additional layer of complexity is added compared to non-ml-systems, as they consist not only code but also of other components like the model and the data. In general, technical debt may be related to data dependencies, model complexity, and reproducibility but can also relate to pieces of code that do not have sufficient test coverage. In terms of automation, technical depth becomes a problem once it impedes your ability to integrate new features into your pipeline. This happens mainly for two reasons: missing test coverage and unrefactored code.

Consider the following example: You are a data engineer in a large enterprise. A new data source has been made available, and you would like to make sure this new source is integrated seamlessly into your centralized data lake. At first, you might consider evaluating the structure of the data coming from your data source. If you do this in, let’s say, a notebook, code will be built in an ad-hoc way. That code will later have to be migrated into the teams “data-connectors” repository, which contains all the data integration code for data sources. Additionally, the code, as well as the data should be covered by tests.

At that point, you have incurred technical dept. So, what can be done to mitigate that challenge? Here are a few principles to keep in mind to reduce technical dept: When coding, think automation first, each piece of code should be built to be ready for automation. In practice, this means each functionality should be suitable to be run in a pipeline; hence moving parts of your functionality should not be hard coded but rather should be injected according to the target environment you would like your code to be run in.

Here, consider this scenario: You are building a containerized API that is intended to serve your model's predictions as a service. Reflect on which variables can be dynamic when this API is being tested, integrated with peripheral systems, and deployed to a staging or production environment. In our case, we need to access a database to load information as soon as an interaction occurs with our API. Since we have separate databases for different environments, we need to make sure we can inject these variables (such as DB URL) from the pipeline depending on what target system we deploy our API to. Additionally, the credentials should be stored in either a centralized secret management tool if we need this secret across multiple projects, or in the CI/CD tool in an encrypted fashion (E.g., Azure Key Vault, Gitlab) if we only need the secret for this application.

Quality Gates in Action: Reproducibility

Reproducibility refers to the ability to recreate a result or analysis when re-running an experiment. In the context of a Machine Learning project, reproducibility is relevant not only to the Machine Learning model but also to many other aspects of the whole lifecycle. Each piece or workflow needs to be reproducible for the whole pipeline to be reproducible. Here is a comprehensive list of how each part affects reproducibility:

Code: The most prominent part. As with any traditional software project, we need to keep track of the different source code versions at any specific time. This allows us and every team member to pick up the work at any given state.
Data: The data will eventually change over time due to the flow of information or specific transformations applied to the dataset. Data versioning is crucial for reproducibility and beneficial for data auditability. It will allow us to determine whether and how a dataset has changed.
Model: Training ML models is an iterative process. We may need to try different models and hyperparameters to reach the desired performance. For that purpose, we need to track not only the model itself but also how the model itself was created. Hence, we need to understand which data has been used and what the resulting model parameters were. It can be helpful to store the training configuration, such as learning rate, etc. This can be done using a Machine Learning experimentation tracking tool like ML-Flow.
Configuration: While it is possible to code all of your parameters hard and file paths directly into your code, we prefer a dedicated configuration file. This makes it easier to reuse units of code and to change our configuration for different experiments while keeping our pipeline the same.
Environment: We need to remember that even a slightly different version of a Python package involved in one step may change the results in ways that can be hard to predict. We should make sure that the runtime environment is also repeatable. In the case of Python, packages, for example, can be documented in a requirements.txt file.

Considering all the points above makes sure that findings can be reproduced consistently. We need to get the same or very similar results compared to what we got in “the lab environment” where the model was initially built. It ensures the soundness and integrity of an experiment, and you do not get a big surprise (usually a bad one) once you run your algorithms in production where end-users interact with it.

We can use reproducibility as a quality gate by validating the output of our system along the pipeline and comparing it to previous results.

Despite adhering to all the best practices, there are still a few points that will affect your reproducibility:

Changes in the ML-Framework you build your model with: During the continuous development of an ML-System, it is highly likely, that you will upgrade a dependency. This might affect the behavior of the training process.
Randomness: Throughout the lifecycle of your ML application, there are multiple points where randomness is applied. Examples are sampling, introducing random noise, GPU settings, etc.

Continuous Quality Assurance powered by Monitoring, Observability & SRE

After the model is deployed to the production environment, we also need to apply monitoring and alerting for critical components of the model pipeline. This way, we can detect and respond to crucial failures or threats in real time and ensure that the model performs well by providing accurate predictions.

We need to continuously monitor the performance of your deployed models and their serving infrastructure while adopting SRE (Site Reliability Engineering) practices to ensure reliability, scalability, and maintainability.

SRE focuses on building and operating large-scale, highly available systems by applying principles from software engineering. Incorporate SRE principles, such as defining Service Level Objectives (SLOs) and Service Level Indicators (SLIs), to set clear expectations for your system's performance and reliability. Additionally, implement practices like error budgeting and automated testing to proactively identify and address potential issues, ultimately improving the overall stability and resilience of your ML model serving infrastructure. By carefully choosing the appropriate model-serving approach and implementing these best practices, including monitoring, observability, and SRE, you can ensure that your machine learning models are deployed efficiently and deliver high-quality results to users on time.

For the concept of Observability, please look at our previous blog post here. For more on SRE, please see here.

Continuous model training is another crucial aspect, which refers to continuously training machine learning models using the latest data available. By continuously training models, organizations can ensure they are always up-to-date with the latest information and provide accurate and effective predictions in real-time. As mentioned above, we track the model’s outputs and overall performance in the production environment in Continuous Monitoring. Hence, when performance degrades, the model can be retrained to improve accuracy and reliability.

Machine Learning Architects Basel

Adopting data- and ML-driven approaches can be challenging and time-consuming if you are unfamiliar with the required data and software architectures, tools, and best practices. Managing data and machine learning end-to-end initiatives and operations can be challenging and time-consuming, including assessing and implementing required technologies and effective DataOps and MLOps workflows and practices. Collaboration with an experienced and independent partner could be a valuable option.

Machine Learning Architects Basel (MLAB) is a member of the Swiss Digital Network (SDN). Having pioneered the ‘Digital Highway for End-to-End Machine Learning & Effective MLOps’ we have created frameworks and reference models that combine our expertise in DataOps, Machine Learning, MLOps, and our extensive knowledge and experience in DevOps, SRE, and agile transformations.

If you want to learn more about how MLAB can aid your organization in creating long-lasting benefits by developing and maintaining reliable data and machine learning solutions, don't hesitate to contact us.

We hope you find this blog post informative and engaging. It is part of our Next Generation Data & AI Journey powered by MLOps.