Testing & QA for agentic systems

After deploying our agentic system reliably in Part 1(Link) and utilizing OpenTelemetry for logging the internals in Part 2 (Link) we now want to look deeper at testing our agentic system and ensuring its quality. To keep the post somewhat short and because many principles from classical software testing also apply to agentic systems we focus on the parts where classical software testing reaches its limit. As part of the Swiss Digital Network we do not address testing and quality assurance (QA) as an isolated topic but as one integral part of a larger continuous software delivery and operations process. Our formalization of this process is what we call the "Digital Highway". As Machine Learning Architects Basel we created the "Digital Highway for MLOps" on top of the original "Digital Highway" for software. This "Digital Highway for MLOps" is the basis for our testing and QA approach to agentic systems.

Recap: MLAB Agent

In the last two blogposts we created an helpful agent for internal MLAB needs. The agent can search through MLAB websites, draft and send mails and convert currencies. A high level view of the system is depicted in Figure 1.

Figure 1: High-level architecture of the MLAB Agent. The diagram shows how the agent connects to internal MLAB resources and external services to perform tasks such as information retrieval, email drafting and sending, and currency conversion.

As can be seen in this picture the system consists of many moving parts that depend on each other, which increases its complexity. The architecture follows a Retrieval Augmented Generation (RAG) pattern, allowing the LLM to ground its responses in stored data while also performing actions through external APIs. The contents of the RAG database are ingested through a data ingestion subsystem.
If not handled correctly, this complexity leads to error accumulation where errors that occur in certain part of the system lead to more errors downstream. For example, when wrong data is ingested this data gets consumed by the system and potentially influcences the response of the overall system. Even if we are lucky and a user reports the error we are still in the dark about where the error originated.
To overcome this, we need a new way of thinking about how we build and maintain the system in a way that helps us detect and understand problems early, before they cascade into larger failures. Preferably we want to catch errors before even deploying the system. This can be achieved by introducing quality gates into the deployment process

Designing Quality Gates

Quality gates are automated validation steps in your deployment pipeline that act as checkpoints before code progresses to the next stage. The pipeline typically moves code through multiple environments, with staging serving as the critical final validation ground before production. Staging should mirror your production environment as closely as possible, including the same infrastructure, database schemas, and external service integrations. This is where you typically run tests that validate the entire system working together, simulating real user journeys from start to finish. You need to run these tests in staging rather than earlier environments because staging is where all components are actually available and properly configured.
The deployment process for our MLAB Agent is shown in the picture below. To keep it simple we only use on pre-production environment. However, in critical production systems you may have multiple stages before the final deployment (e.g. a user acceptance testing stage). We split the code into two repositories: (1) the data ingestion pipeline, and (2) the agentic system. Both repositories have their own lifecycle and therefore also a separated deployment pipeline.

Figure 2: Simplified deployment pipeline for the MLAB Agent. The diagram illustrates how the system moves through a single pre-production (staging) environment before reaching production. Both components,the data ingestion pipeline and the agentic system—are managed in separate repositories, each with its own lifecycle and deployment workflow.

Locally we setup quality gates for standard code metrics like if the unit and integrations tests pass. We also introduce specific gates for the ingestion, retrieval, and agent exexcution. These tests use mocked dependencies so we can test them locally, without requiring a full execution environment.
Most quality gates require test data to validate that your system behaves correctly. Unit and integration tests suites, which are typical quality gates also used in classical software engineering, need specialized datasets tailored to their specific scope. Unit tests use small, focused inputs to validate individual functions. Integration tests need data that validate the correct interaction between components. End-to-end tests for agentic systems typically rely on a Golden Dataset, which contains representative examples with known correct outputs that validate the entire system's behavior in production-like scenarios.
There is extensive literature on classical software testing methodologies, and many established best practices for unit and integration testing apply equally well to agentic systems. Here, we will instead focus on the aspects that are specific to testing agentic systems, particularly around end-to-end validation where the non-deterministic and complex nature of AI agents requires different approaches than traditional software.

Golden Datasets

A golden dataset is a curated collection of reference examples that represent the expected behavior of the system. It serves as your ground truth: a definitive set of inputs paired with verified, high-quality outputs that your system should produce. For agentic systems, a golden dataset typically includes:

Input examples Representative user queries or scenarios
Expected outputs Verified, high-quality responses or actions
Context and metadata Retrieved chunks, tool calls, reasoning steps

Using samples from the golden dataset we can query the system and, based on the execution trace and the ouput, calculate the previously introduced metrics. As we iterate on the system, updating prompts or switching models, the outputs to the golden dataset samples will change. Therefore the golden dataset acts as a kind of regression test suite, helping to spot when changes improve or decay performance.
Because the quality gates define hard limits on each metric and are checked against the golden dataset for each new version before deployment we make sure that new versions are not significantly worse. However, because the underlying models are fuzzy and we only check a small share of all possible inputs we can not entirely depend on the golden dataset as a quality check but need to have good monitoring and observability strategy in place after the system is deployed. It is also important to verify that the contents of our golden dataset reflect the types of queries the users are entering and is kept in sync whenever the user behaviour changes (also known as Data Drift in ML literature). Both, monitoring and obervability as well as detecting data drifts are addressed by the next blogpost of this journey.

Putting it into Practice

Now that we know about the importance of testing agentic systems and quality gates lets put it into practice for the MLAB agent we develop during this journey. As a brief recap from our previous blogpost on deploying agents, we built an agent that goes beyond simple RAG question-answering. It is equipped with multiple tools and a vector store containing embeddings of our website content, enabling it to perform diverse tasks ranging from information retrieval to email composition and currency conversion. Concretely the MLAB agent has access to the following tools:

Tool	Capability
retrieve_information	Query the ChromaDB vector store for relevant information from the MLAB website about services, offerings and blogpost contents
convert_currency	Perform real-time currency conversions
write_email_draft_tool	Compose professional email drafts
send_message	Send emails to specified recipients

Example 1: Simple information retrieval

{
    "id": "test_001",
    "query": "What is MLAB's consulting offering?",
    "evaluation_types": ["agent_behavior", "user_response"],
    "expected_tool_calls": [
      {
        "tool": "retrieve_information",
        "params": {"query": "MLAB consulting offering"},
        "validation": {
          "expected_chunk_ids": [
            "chunk_services_consulting_001",
            "chunk_services_ai_engineering_003",
            "chunk_about_expertise_002"
          ]
        }
      },
      {
        "tool": "grade_documents"
      },
      {
        "tool": "generate_response"
      }
    ],
    "agent_behavior": {
      "tool_sequence_matters": true,
      "required_tools": ["retrieve_information", "grade_documents", "generate_response"]
    },
    "user_response": {
      "success_criteria": "Comprehensive overview of MLAB's consulting services",
      "contains": ["AI Engineering", "Data Products", "MLOps"]
    },
    "metadata": {
      "category": "information_retrieval",
      "difficulty": "easy",
      "requires_tools": ["retrieve_information", "grade_documents", "generate_response"],
      "requires_external_api": false
    }
  }

Example 2: Email generation with retrieval


{
  "id": "test_002",
  "query": "Send an email to person@example.com about MLAB's AI Engineering consulting. Keep it brief.",
  "evaluation_types": ["agent_behavior", "user_response"],
  "expected_tool_calls": [
    {
      "tool": "retrieve_information",
      "params": {"query": "AI Engineering consulting"},
      "validation": {
        "expected_chunk_ids": [
          "chunk_services_ai_engineering_001",
          "chunk_services_ai_engineering_002"
        ]
      }
    },
    {
      "tool": "write_email_draft_tool",
      "validation": {
        "contains_keywords": ["AI Engineering", "consulting", "MLAB"],
        "tone": "professional",
        "word_count_max": 150
      }
    },
    {
      "tool": "send_message",
      "params": {"recipient": "person@example.com"},
      "validation": {
        "recipient_matches": "person@example.com",
        "send_status": "success"
      }
    }
  ],
  "agent_behavior": {
    "tool_sequence_matters": true,
    "required_tools": ["retrieve_information", "write_email_draft_tool", "send_message"]
  },
  "user_response": {
    "success_criteria": "Confirmation message that email was sent successfully",
    "contains": ["sent", "person@example.com"]
  },
  "metadata": {
    "category": "email_generation",
    "difficulty": "medium",
    "requires_tools": ["retrieve_information", "write_email_draft_tool", "send_message"],
    "requires_external_api": false
  }
}

Here we manually defined a small number of golden dataset entries which is always a good idea to get a grasp of the problem space and capabilities as well as limitations of the envisioned system. However, GenAI tools can help to create larger datasets based on a couple of hand-made inputs. Aim for 50 to 500 diverse high-quality entries in your golden dataset. Based on the golden dataset we will calculate the metrics. As it would be to extensive to address all possible metrics and the field evolves a lot we will only show a handful of useful ones. In general metrics and quality gates should be adapted to the concrete use-case and the current state-of-the-art. Frameworks like DeepEval [2] and TruLens [3] evolve alongside research and can (and should) be utilized in productionized agentic systems. For learning reasons we will look at some metrics in a level of detail that is not neccessary when applying the previously mentioned frameworks.

Retrieval Performance

When evaluating how well our agent retrieves relevant information from the vector store, we need metrics that capture both the quality and completeness of the retrieved documents. Two fundamental metrics for this are precision@K and recall@K, which tell us how many of our retrieved documents are actually relevant and how many of the relevant documents we successfully found.
Let's walk through a concrete example using the first entry from our golden dataset. When a user asks "What is MLAB's consulting offering?", we expect the system to retrieve three specific chunks:

chunk_services_consulting_001
chunk_services_ai_engineering_003
chunk_about_expertise_002

These three chunks represent our ground truth, the documents we know contain relevant information to answer the query. Now suppose our system retrieves the top 5 documents (K=5) from the vector store: all three of the expected and 2 additonal documents. Now we can calculate

Precision@5: fraction of the retrieved documents are actually relevant
3 are relevant, and 5 are retrieved, so: 3/5
Recall@5: fraction of all relevant documents we successfully retrieved
we got all three relevant documents, so 1

In this example, our system achieved perfect recall but only moderate precision. It found everything it needed but also retrieved some irrelevant content. This trade-off is common in retrieval systems. Setting K too low might hurt recall (missing relevant documents), while setting K too high might hurt precision (including too much noise). For our MLAB agent, we use an average Recall@5 ≥ 0.80 for the whole golden dataset as a quality gate to ensure the system retrieves at least 80% of relevant documents before deployment. Missing too many relevant documents would result in incomplete or incorrect answers, making recall the critical metric for maintaining response quality. We track Precision@5 as a monitoring metric rather than a hard quality gate. Watching precision over time helps us understand whether we're retrieving too much irrelevant content. If precision consistently drops, this signals that we should consider adjusting K in a future version. A lower K might reduce noise, while a higher K could improve recall if we notice we're frequently missing relevant documents. This flexible approach lets us tune the system based on real-world performance data later on

Automating Quality Gates

To ensure these quality gates are consistently enforced, we automate them within our deployment pipeline. The quality validation step runs on each deployment to staging, iterating through the golden dataset and calculating metrics before allowing promotion to production. This automation approach can be implemented with CI/CD tools like GitLab CI, GitHub Actions, or Jenkins. Here's an example implementation of a recall quality gate:

from typing import List
import json

def validate_retrieval_quality(
    golden_dataset_path: str,
    recall_threshold: float = 0.80,
    k: int = 5
) -> bool:
    """Validate retrieval performance against golden dataset."""
    
    with open(golden_dataset_path, 'r') as f:
        golden_dataset = json.load(f)
    
    recall_scores = []
    
    for entry in golden_dataset:
        retrieval_tool = next(
            (tool for tool in entry["expected_tool_calls"] 
              if tool["tool"] == "retrieve_information"), 
            None
        )
        if not retrieval_tool:
            continue # skip golden dataset entries that do not containe retrieval
        
        expected_chunks = retrieval_tool["validation"]["expected_chunk_ids"]
        
        # run_agent(...) runs the agent and returns the execution trace
        execution_trace = run_agent(entry["query"])

        # get_retrieved_chunks_from_trace(...) extracts all retrieved chunk ids
        retrieved_chunks = get_retrieved_chunks_from_trace(execution_trace)      
        
        relevant_retrieved = set(retrieved_chunks).intersection(set(expected_chunks))
        recall = len(relevant_retrieved) / len(expected_chunks)
        recall_scores.append(recall)
    
    avg_recall = sum(recall_scores) / len(recall_scores)
    print(f"Average Recall@{k}: {avg_recall:.3f} (threshold: {recall_threshold})")
    
    if avg_recall < recall_threshold:
        raise ValueError(
            f"Quality gate failed: Recall@{k} of {avg_recall:.3f} is below "
            f"threshold of {recall_threshold}."
        )
    
    return True

This validation function integrates into a GitLab CI pipeline as follows:


      validate_quality:
      stage: validate
      image: python:3.11
      script:
        - pip install -r requirements.txt
        - python scripts/run_e2e_tests.py --env staging
      only:
        - main

If the quality metrics fall below the defined thresholds, the pipeline fails and prevents promotion to production.

Runtime Quality Gates

Runtime Quality Gates represent a new paradigm in system reliability. Unlike traditional software systems that fail silently or catastrophically when something goes wrong, agents equipped with runtime quality gates can actively monitor their own execution and recover from poor performance. These quality gates operate during system runtime, continuously evaluating whether the agent's actions and outputs meet predefined quality thresholds. When an agent's execution trace falls below acceptable thresholds during runtime, the system can take corrective action rather than simply proceeding with flawed outputs. This self-correcting capability fundamentally distinguishes agents from classical software systems, where error handling typically involves static exception catching rather than dynamic quality assessment and recovery. Runtime quality gates can also serve as a protection mechanism by blocking responses that fail to meet minimum quality standards. Rather than delivering potentially incorrect or irrelevant answers to users, the system can intercept poor outputs and return a generic fallback response such as "We could not answer your query. Please reword your question." This approach prioritizes user trust and system reliability over always providing an answer.

RAG Triad

The RAG Triad, developed by TruLens, provides a comprehensive framework of three key runtime quality gates. This triad addresses the fundamental challenge that RAG systems face: ensuring quality from retrieval through final answer generation (and potentially multiple steps in between). Each dimension captures a distinct aspect of system performance that must be monitored and optimized.This can be seen in Figure 3

Figure 3: The RAG Triad concept highlighting the connection between retrieval quality, grounding, and generation accuracy as described in source [1]

Context Relevance measures whether the retrieved chunks or documents are actually relevant to the user's query. For example, when a user asks "What is MLAB's MLOps consulting approach?", poor context relevance would mean retrieving chunks about MLAB's office location or general company history instead of MLOps methodology documents. This makes it impossible for downstream components to generate accurate answers regardless of how sophisticated the generation model is.

Groundedness evaluates whether the generated answer is faithful to the retrieved context. This dimension addresses the hallucination problem by checking if claims in the answer can be traced back to the source documents. For instance, if the retrieved context mentions "MLAB provides AI Engineering consulting" but the generated answer claims "MLAB offers AI Engineering with guaranteed project delivery in under 3 months and fixed pricing", the specific commitments about timeline and pricing are not present in the context. The added claims might be true but we cannot verify that from the provided context. Groundedness ensures that the system stays anchored to its sources rather than fabricating information.

Answer Relevance assesses whether the final response actually addresses the user's original question. A system might retrieve relevant context and generate a grounded answer that is nonetheless tangential to what the user asked. For example, when asked about MLAB's consulting rates, it might provide a perfectly grounded explanation of consulting services without mentioning pricing at all. This dimension ensures that the entire pipeline delivers value by directly answering the query rather than providing related but unhelpful information about something else. Together, these three dimensions have shown to provide good coverage of the system quality of RAGs. By evaluating context relevance, groundedness, and answer relevance independently, we also can diagnose exactly where the system succeeds or fails and target improvements accordingly.

Together, these three dimensions have shown to provide good coverage of the system quality of RAGs. By evaluating context relevance, groundedness, and answer relevance independently, we also can diagnose exactly where the system succeeds or fails and target improvements accordingly.

Using LLM as a Judge

Traditional evaluation methods for language models struggle with a fundamental problem: exact matching fails to capture semantic equivalence. This limitation becomes especially clear when evaluating RAG systems using the metrics of the RAG Triad. Each dimension of the triad requires nuanced semantic understanding that classical matching approaches simply cannot provide

Consider the Context Relevance evaluation. When assessing whether retrieved chunks are relevant to a query, exact matching is useless. A query like "What causes rain?" and a retrieved passage beginning with "Precipitation occurs when water vapor condenses" share no overlapping words, yet the passage is highly relevant. Similarly, embedding similarity alone fails because two passages might be semantically similar but address different aspects. "How does rain form?" and "How does snow form?" have high embedding similarity, yet a passage about snow formation would be irrelevant for a rain query. The Groundedness dimension presents even greater challenges. Determining whether an answer is faithful to the source context requires understanding paraphrase, inference, and logical consistency. If the context states "The company's revenue increased by 15% in Q3" and the answer says "The firm saw strong growth last quarter," classical methods cannot verify this alignment. An LLM judge can recognize that these statements are semantically equivalent despite different wording.

For Answer Relevance, the challenge is determining whether the response actually addresses the user's question. The answer "Paris is known for the Eiffel Tower" is semantically related to "What is the capital of France?" but doesn't directly answer it. Exact matching misses this entirely, while embedding similarity would show high scores despite the answer being incomplete. The most robust way to handle these often nuanced differences would be to have a human in the loop. However, humans do not scale very well. Therefore most RAG evaluation metrics rely on utilizing LLMs as judges (aka. LLM-as-a-Judge) because they provide the contextual understanding and semantic reasoning needed to assess the dimensions of the triad accurately. The LLM judge mimics human evaluation by understanding intent, recognizing paraphrase, and detecting logical relationships that simpler methods cannot capture.

Steering the Agent at Runtime

When RAG triad metrics fall below acceptable thresholds during execution, the agent can retry with modified strategies rather than returning poor results. This steering happens through conditional branching based on real-time metric evaluation. The system checks each metric after its corresponding stage and triggers recovery mechanisms when quality drops. Below is a concrete implementation of Groundedness check using a vanilla LLM call. In practice you would use a framework like LangChain [4] to perform LLM calls and a library like TruLens for the LLM-as-a-Judge call.


from openai import OpenAI

GROUNDEDNESS_PROMPT = """You are evaluating if an answer is faithful to the source context.

Context: {context}

Answer: {answer}

Does the answer only contain claims that can be directly supported by the context?
Rate from 0.0 (completely unfaithful) to 1.0 (perfectly grounded).
Respond with just the number."""

def check_groundedness(context: str, answer: str, threshold: float = 0.7) -> tuple[bool, float]:
    """
    Evaluates if the answer is grounded in the context.
    Returns (passes_check, score).
    """
    client = OpenAI()
    
    prompt = GROUNDEDNESS_PROMPT.format(context=context, answer=answer)
    
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=50,
        temperature=0
    )
    
    score = float(response.choices[0].message.content.strip())
    return score >= threshold, score

This implementation uses an LLM judge to evaluate Groundedness by directly asking whether claims in the answer are supported by the context. The function returns both a binary pass/fail decision and the raw score, allowing the agent to make informed decisions about whether to retry or accept the output. When the score falls below the threshold, the agent can attempt recovery by regenerating the answer with an explicit prompt instruction requiring it to ground all claims in the provided context, retrieving additional context to fill information gaps, or falling back to a generic response if grounding remains impossible.


Query: "What AI Engineering services does MLAB offer?"
Retrieved Context: "MLAB provides AI Engineering consulting focused on building production-ready AI systems and MLOps implementation."

Generated Answer: "MLAB offers comprehensive AI Engineering services including consulting, model training, deployment automation, and 24/7 system monitoring with guaranteed 99.9% uptime."
Groundedness Check: FAIL (score: 0.4)
Reason: "Claims about 24/7 monitoring and 99.9% uptime guarantee not present in context"

Recovery Action: Retry generation with prompt including failure feedback
New Answer: "MLAB provides AI Engineering consulting focused on building production-ready AI systems and MLOps implementation."
Groundedness Check: PASS (score: 0.95)

Final Output: "MLAB provides AI Engineering consulting focused on building production-ready AI systems and MLOps implementation."

In this trace, the initial answer fabricated a growth percentage not found in the retrieved context. The Groundedness check caught this hallucination before the response reached the user. The agent then regenerated the answer with explicit instructions to only state facts directly from the context and failure feedback from the LLM-as-a-Judge call, producing a grounded response that passes quality gates. This self-correction prevents delivering unreliable information while still providing a useful answer, rather than resorting to a generic fallback response.

Ensuring High Data Quality

Last but definitely not least, we want to emphasize one other aspect: data quality. Any RAG-based system is only as good as the data it receives. All data that is ingested should be validated by applying data quality checks. These checks can include detecting duplicates, validating file formats and encoding, ensuring completeness of required fields, or identifying corrupted documents. The checks should be automatically executed inside your ingestion pipeline for all data entering the system. Data checks are a topic of their own, which we address in this blogpost

Conclusion

Testing agentic systems requires a fundamentally different approach than classical software testing. While traditional unit and integration tests remain valuable, they cannot capture the non-deterministic and complex behavior of AI agents. This is where quality gates, golden datasets, and runtime checks become essential.

In this post, we demonstrated how to implement quality gates at multiple stages of the deployment pipeline for the MLAB agent. By using golden datasets as regression test suites and enforcing thresholds on certain metrics, we ensure that new versions do not significantly degrade performance before reaching production. Runtime checks, particularly the RAG Triad framework, add another layer of protection by detecting and correcting issues like hallucinations and irrelevant responses during execution.

However, it is important to recognize that we have only scratched the surface. The metrics and techniques covered here represent a small subset of what is available, and the field of agentic system evaluation is evolving rapidly. New frameworks, methodologies, and quality measures emerge frequently as researchers and practitioners discover better ways to assess AI behavior. What works well today may be superseded by more sophisticated approaches tomorrow.

For production systems, you need to research and select metrics tailored to your specific use case. A customer support agent requires different quality measures than a code generation agent or a data analysis agent. Generic metrics provide a starting point, but true quality assurance demands deep understanding of your domain, your users' expectations, and the specific failure modes relevant to your application.

In our next post, we explore how to maintain visibility into your agents performance through monitoring and observability practices, including detecting data drift when user queries diverge from your golden dataset assumptions.

Machine Learning Architects Basel

Machine Learning Architects Basel (MLAB) is a member of the Swiss Digital Network . We have created our effective MLOps framework that combines our expertise in DataOps, Machine Learning, MLOps, and our extensive knowledge and experience in DevOps, SRE, and agile transformations. If you want to learn more about how MLAB can aid your organization in creating long-lasting benefits by developing and maintaining reliable data and machine learning solutions, don't hesitate to contact us.

References & Acknowledgements

Reliable Data Solutions

AI Engineering

Effective AI & LLM(Ops)