Introduction
In the rapidly evolving world of AI, agents powered by Large-Language Models (LLMs) unlock everything from conversational support to automated content creation. Yet deploying them reliably in production remains a challenge. In this tutorial, we'll show how to stand up an LLM-driven agent on Azure AKS; every step maps 1-for-1 to AWS or GCP if that's your stack.

This tutorial is part of our AI-Engineering series and complements our earlier post on Introduction to AI Engineering. .
Prerequisites
To follow along, you should be familiar with:
Azure services (Key Vault, Application Insights, Log Analytics, AKS)
We'll deploy everything on Azure, but identical concepts apply to other clouds.
Kubernetes concepts and management
AKS handles scaling and rollout. Alternatives such as Docker Swarm or Mesos work too, but the Kubernetes ecosystem makes life easier.
Terraform for Infrastructure-as-Code
Terraform lets us declare-not-click our infrastructure and reuse the code on any cloud.
Helm Charts
Helm bundles our Kubernetes resources into versioned, repeatable releases.
Overview of the Stack
We'll deploy our agent on an Azure AKS cluster and configure logging, tracing, and CI/CD via Helm Charts.
Features of our agent
The agent will be able to:
- Answer questions about our website
- Draft e-mails
- Convert currencies on the fly
- Gracefully respond to off-topic queries
Components of our agent
The agent relies on four building blocks:
- Data pipeline (crawler → embeddings → ChromaDB → Azure)
- Agent tools (retriever, e-mail draft/send)
- LLM planner
- Short-term memory
Concretley our agents will look like this:
Step-by-Step Tutorial
Step 1 – Set up AKS, Log Analytics & Container Registry
We create the cluster, enable logging, and push a private registry; RBAC permissions let AKS pull the image.
Step 2 – Deploy the agent
AKS handles container orchestration, auto-scaling, and load-balancing. We choose VM sizes optimised for our model's GPU and memory needs.
Step 3 – Integrate OpenAI models
Supply your OpenAI API key (or Azure OpenAI endpoint) and update the Helm values file.
Step 4 – Add observability with OpenTelemetry
Reliability starts with instrumenting the code itself. The Python snippet below wires OpenTelemetry into our agent so that every prompt, tool call, and model token is traced, logged, and counted. A lightweight Collector sidecar then streams those signals to Azure:
- Traces – each request becomes a trace with spans for plan-build → tool-call → LLM-compose, enabling slow-path replay in Application Insights.
- Logs – structured JSON (prompt, response, token count) land in Log Analytics for ad-hoc search.
- Metrics – token/sec, queue age, and GPU utilisation feed KEDA auto-scaling and SLO dashboards.
from opentelemetry.instrumentation.logging import LoggingInstrumentor
from my_agent.utils import init_logger # your helper
AZURE_CONNECTION_STRING = os.getenv("AZURE_MONITOR_CONNECTION_STRING")
# ---------- Tracing ----------
resource = Resource(attributes={SERVICE_NAME: "mlab-agent"})
trace.set_tracer_provider(TracerProvider(resource=resource))
tracer_provider = trace.get_tracer_provider()
if AZURE_CONNECTION_STRING:
from azure.monitor.opentelemetry.exporter import AzureMonitorTraceExporter
trace_exporter = AzureMonitorTraceExporter(
connection_string=AZURE_CONNECTION_STRING
)
tracer_provider.add_span_processor(BatchSpanProcessor(trace_exporter))
print("🔗 Azure Monitor trace exporter enabled.")
else:
tracer_provider.add_span_processor(
SimpleSpanProcessor(ConsoleSpanExporter())
)
print("🖥️ Console span exporter enabled for local dev.")
# ---------- Logging ----------
if AZURE_CONNECTION_STRING:
from azure.monitor.opentelemetry.exporter import AzureMonitorLogExporter
log_exporter = AzureMonitorLogExporter(
connection_string=AZURE_CONNECTION_STRING
)
otel_handler = LoggingHandler(level=logging.INFO)
provider = LoggerProvider()
provider.add_log_record_processor(BatchLogRecordProcessor(log_exporter))
With this in place, you can open Application Insights and:
- Expand a trace to see every agent decision step → was the Currency-Converter tool slow, or did the LLM skip it?
- Create an alert when P95 plan-build latency exceeds 500 ms.
- Drill into logs for any span ID to view the exact prompt and response.
We'll explore dashboards and auto-alerts in a dedicated monitoring post, but this wiring gives you live debugging from Day 1.
Conclusion
A reliable agent requires a solid backbone—elastic compute, IaC, observability, and CI/CD. With this stack, you can iterate safely and scale confidently. Questions or feedback? Contact us!
Machine Learning Architects Basel
Machine Learning Architects Basel (MLAB) is part of the Swiss Digital Network . We help customers deploy and scale data & AI products.