You're monitoring your LLM service wrong. Not because your tooling is broken — because you're applying the wrong mental model.
Web service observability is built on request rate, error rate, and latency (RED). P99 latency is your headline SLO. A p99 of 200ms means 99% of requests resolve within 200ms. That's a meaningful signal for a REST API.
For LLM inference it's mostly useless. A request to generate a 2000-token response can take 30 seconds. P99 of 30 seconds tells you nothing about whether users had a good experience. What matters is: how long did they wait before they saw the first word? Time to first token — TTFT — is your user-facing SLO. Everything else is operational.
The distinction matters because TTFT and end-to-end latency respond to completely different optimisations. High TTFT is usually a prompt problem or a queue problem. High end-to-end latency is usually a model size or hardware problem. If you're tuning the wrong thing, you'll ship improvements your users don't notice.
Local Platform Engineering Series
- Running Local Kubernetes with k3d: Fast, Ephemeral, and Kind to Your Battery
- Gateway API in Practice: From Ingress Migration to Envoy Debugging
- Multi-Tenant Observability: LGTM at Platform Scale
- Network Control with Cilium and Kyverno: Policies That Actually Work
- Observing LLM Inference: The Metrics That Actually Matter
- AI Tool Gateways: Sandboxing Agent Access in Kubernetes
The Four Metrics That Matter
Time to First Token (TTFT)
The gap between sending a request and receiving the first token. This is what the user experiences as responsiveness. A chatbot that takes 4 seconds to start responding feels broken even if it generates 80 tokens/second once it starts.
TTFT decomposes into:
- Queue wait time — how long the request waited before inference started
- Prompt evaluation time — time to process the input tokens (scales with prompt length)
A long TTFT is almost always either a queue problem (too many concurrent requests) or a prompt length problem (10,000-token system prompts will hurt TTFT regardless of hardware). The fix is different for each. You need the breakdown, not just the aggregate.
Tokens Per Second (TPS) / Time Per Output Token (TPOT)
The generation rate after the first token. This is what determines end-to-end latency for long responses. For interactive applications it matters less than TTFT. For batch processing, document summarisation, or code generation tasks where users are waiting for a complete output, it's the primary performance signal.
TPS is hardware-bound: GPU memory bandwidth, quantisation level, batch size. If TPS is low, you need better hardware or a smaller model — there's limited room to software-optimise it.
Queue Depth
Pending requests. The leading indicator for both high TTFT and cascading failures. If queue depth stays at 0 under normal load and spikes during traffic bursts, you can safely auto-scale on it. If it's consistently above zero, you're running at capacity and TTFT is degrading for everyone.
Token Throughput
Total tokens generated per second across all requests. This is your infrastructure efficiency metric — it tells you how well you're utilising the GPU or CPU. A low throughput-per-dollar means your serving configuration needs tuning: batch size, tensor parallelism, context length.
What Ollama Exposes
Ollama (covered in the series setup) exposes a Prometheus /metrics endpoint:
curl http://ollama.local:8080/metrics 2>/dev/null | grep ollama_curl http://ollama.local:8080/metrics 2>/dev/null | grep ollama_Key metrics:
ollama_request_duration_seconds— full request latency histogramollama_generate_duration_seconds— generation phase onlyollama_prompt_eval_duration_seconds— prompt evaluation phaseollama_tokens_generated_total— cumulative token countollama_pending_requests— current queue depth
The combination of generate_duration and prompt_eval_duration gives you the TTFT decomposition without instrumented clients. Apply a ServiceMonitor and these appear in Prometheus automatically.
# TTFT proxy: median prompt eval duration
histogram_quantile(0.50, rate(ollama_prompt_eval_duration_seconds_bucket[5m]))
# TPS (rolling 1 minute)
rate(ollama_tokens_generated_total[1m])
# Queue depth
ollama_pending_requests# TTFT proxy: median prompt eval duration
histogram_quantile(0.50, rate(ollama_prompt_eval_duration_seconds_bucket[5m]))
# TPS (rolling 1 minute)
rate(ollama_tokens_generated_total[1m])
# Queue depth
ollama_pending_requestsFramework Support: Google ADK, LangChain, LangGraph
The choice of framework affects not just what you build but what observability you get for free.
Google ADK
Google ADK (Agent Development Kit) is designed for multi-agent systems. Out of the box it emits traces via OpenTelemetry with spans that cover:
- Agent invocations with model name and prompt token count
- Tool calls with input/output capture
- LLM calls with TTFT and total latency as span attributes
- Error events with full stack traces
ADK's OTel integration is configuration-based — set the exporter endpoint and you get traces in Tempo without instrumenting a single function:
from google.adk.telemetry import configure_otel
configure_otel(
service_name="my-agent",
otlp_endpoint="http://tempo.tracing:4318",
export_traces=True,
)from google.adk.telemetry import configure_otel
configure_otel(
service_name="my-agent",
otlp_endpoint="http://tempo.tracing:4318",
export_traces=True,
)What ADK doesn't give you: Prometheus metrics. If you want queue depth or TPS in Grafana, you need to either scrape the ADK metrics endpoint separately or push custom metrics via the Prometheus Python client alongside your traces.
LangChain
LangChain's observability story has improved significantly with LangSmith, but LangSmith is a managed service rather than something you run locally. For the self-hosted case, LangChain uses callbacks.
The built-in OpenAICallbackHandler captures token counts and costs. For structured tracing, use the opentelemetry-langchain package:
from opentelemetry.instrumentation.langchain import LangchainInstrumentor
LangchainInstrumentor().instrument()from opentelemetry.instrumentation.langchain import LangchainInstrumentor
LangchainInstrumentor().instrument()This emits spans for chains, agents, and LLM calls with the following attributes on LLM spans:
gen_ai.request.modelgen_ai.usage.prompt_tokensgen_ai.usage.completion_tokensgen_ai.response.finish_reasons
What's missing: TTFT is not captured by default. The LLM call span covers the full request duration, not the time to first token. For streaming responses, you need to instrument the stream consumer yourself. Add a timing wrapper around the first chunk event if TTFT matters to your SLO.
LangGraph
LangGraph builds on LangChain and inherits its callback/instrumentation system. Where it adds value for observability is graph-level visibility: each node in the graph is a separate span, so you can see exactly which step in a multi-step agent workflow is slow.
from langgraph.graph import StateGraph
# Instrumentation is inherited from LangChain's OTel instrumentor
# Each node transition creates a child spanfrom langgraph.graph import StateGraph
# Instrumentation is inherited from LangChain's OTel instrumentor
# Each node transition creates a child spanIn Tempo, a LangGraph trace looks like a call tree with nodes like router, tool_executor, llm_call, output_formatter. When a graph is slow, you can immediately see which node is the bottleneck without log diving.
LangGraph's gap: it doesn't emit queue depth or concurrency metrics. If you're serving multiple concurrent graph executions, you need to wrap the execution endpoint with a semaphore and export the queue size as a custom metric.
Instrumenting for TTFT with OpenTelemetry
None of the frameworks give you TTFT for streaming responses without custom instrumentation. Here is the pattern that works for all of them:
from opentelemetry import trace
import time
tracer = trace.get_tracer("llm-service")
def stream_with_ttft(client, prompt: str, model: str):
with tracer.start_as_current_span("llm.stream") as span:
span.set_attribute("llm.model", model)
span.set_attribute("llm.prompt_chars", len(prompt))
start = time.perf_counter()
first_token_at: float | None = None
token_count = 0
for chunk in client.stream(prompt, model=model):
if first_token_at is None and chunk:
first_token_at = time.perf_counter()
span.set_attribute(
"llm.ttft_ms",
round((first_token_at - start) * 1000, 1)
)
token_count += 1
yield chunk
elapsed = time.perf_counter() - start
span.set_attribute("llm.output_tokens", token_count)
span.set_attribute("llm.tps", round(token_count / elapsed, 1))from opentelemetry import trace
import time
tracer = trace.get_tracer("llm-service")
def stream_with_ttft(client, prompt: str, model: str):
with tracer.start_as_current_span("llm.stream") as span:
span.set_attribute("llm.model", model)
span.set_attribute("llm.prompt_chars", len(prompt))
start = time.perf_counter()
first_token_at: float | None = None
token_count = 0
for chunk in client.stream(prompt, model=model):
if first_token_at is None and chunk:
first_token_at = time.perf_counter()
span.set_attribute(
"llm.ttft_ms",
round((first_token_at - start) * 1000, 1)
)
token_count += 1
yield chunk
elapsed = time.perf_counter() - start
span.set_attribute("llm.output_tokens", token_count)
span.set_attribute("llm.tps", round(token_count / elapsed, 1))Wrap this around any streaming LLM call regardless of framework. The span attributes llm.ttft_ms and llm.tps appear in Tempo, filterable by model, service, or any dimension you add.
Setting SLOs for LLM Services
SLOs for LLM inference are different from web services because the latency distribution is multimodal — it depends heavily on input length and output length, which vary enormously across requests.
A practical starting point:
| Metric | SLO | Notes |
|---|---|---|
| TTFT | p95 < 500ms | Interactive chat use case |
| TPS | p50 > 10 t/s | Minimum for legible streaming |
| Queue depth | < 5 for 95% of time | Leading indicator for degradation |
| Error rate | < 0.5% | Includes context-length exceeded |
p95 < 500ms for TTFT is the threshold below which users don't consciously notice the wait. Above 1 second, users assume something is wrong. The 500ms target is achievable for 3B–7B models on modern CPU hardware with short-to-medium prompts; it requires a GPU for larger models or long system prompts.
Set Grafana alerts on these, routed to the team that owns the inference service via the Alertmanager routing from part three.
What's Next
You can now see what your LLMs are doing. The final piece is controlling what they're allowed to do. Part six covers AI Tool Gateways — the proxy layer that sandboxes agent access to tools, limits blast radius, and creates an auditable record of everything an AI agent touched.