~/blog/llm-observability-inference-metrics

Observing LLM Inference: The Metrics That Actually Matter

8 min read

You're monitoring your LLM service wrong. Not because your tooling is broken — because you're applying the wrong mental model.

Web service observability is built on request rate, error rate, and latency (RED). P99 latency is your headline SLO. A p99 of 200ms means 99% of requests resolve within 200ms. That's a meaningful signal for a REST API.

For LLM inference it's mostly useless. A request to generate a 2000-token response can take 30 seconds. P99 of 30 seconds tells you nothing about whether users had a good experience. What matters is: how long did they wait before they saw the first word? Time to first token — TTFT — is your user-facing SLO. Everything else is operational.

The distinction matters because TTFT and end-to-end latency respond to completely different optimisations. High TTFT is usually a prompt problem or a queue problem. High end-to-end latency is usually a model size or hardware problem. If you're tuning the wrong thing, you'll ship improvements your users don't notice.


The Four Metrics That Matter

Time to First Token (TTFT)

The gap between sending a request and receiving the first token. This is what the user experiences as responsiveness. A chatbot that takes 4 seconds to start responding feels broken even if it generates 80 tokens/second once it starts.

TTFT decomposes into:

  • Queue wait time — how long the request waited before inference started
  • Prompt evaluation time — time to process the input tokens (scales with prompt length)

A long TTFT is almost always either a queue problem (too many concurrent requests) or a prompt length problem (10,000-token system prompts will hurt TTFT regardless of hardware). The fix is different for each. You need the breakdown, not just the aggregate.

Tokens Per Second (TPS) / Time Per Output Token (TPOT)

The generation rate after the first token. This is what determines end-to-end latency for long responses. For interactive applications it matters less than TTFT. For batch processing, document summarisation, or code generation tasks where users are waiting for a complete output, it's the primary performance signal.

TPS is hardware-bound: GPU memory bandwidth, quantisation level, batch size. If TPS is low, you need better hardware or a smaller model — there's limited room to software-optimise it.

Queue Depth

Pending requests. The leading indicator for both high TTFT and cascading failures. If queue depth stays at 0 under normal load and spikes during traffic bursts, you can safely auto-scale on it. If it's consistently above zero, you're running at capacity and TTFT is degrading for everyone.

Token Throughput

Total tokens generated per second across all requests. This is your infrastructure efficiency metric — it tells you how well you're utilising the GPU or CPU. A low throughput-per-dollar means your serving configuration needs tuning: batch size, tensor parallelism, context length.


What Ollama Exposes

Ollama (covered in the series setup) exposes a Prometheus /metrics endpoint:

curl http://ollama.local:8080/metrics 2>/dev/null | grep ollama_
curl http://ollama.local:8080/metrics 2>/dev/null | grep ollama_

Key metrics:

  • ollama_request_duration_seconds — full request latency histogram
  • ollama_generate_duration_seconds — generation phase only
  • ollama_prompt_eval_duration_seconds — prompt evaluation phase
  • ollama_tokens_generated_total — cumulative token count
  • ollama_pending_requests — current queue depth

The combination of generate_duration and prompt_eval_duration gives you the TTFT decomposition without instrumented clients. Apply a ServiceMonitor and these appear in Prometheus automatically.

# TTFT proxy: median prompt eval duration
histogram_quantile(0.50, rate(ollama_prompt_eval_duration_seconds_bucket[5m]))

# TPS (rolling 1 minute)
rate(ollama_tokens_generated_total[1m])

# Queue depth
ollama_pending_requests
# TTFT proxy: median prompt eval duration
histogram_quantile(0.50, rate(ollama_prompt_eval_duration_seconds_bucket[5m]))

# TPS (rolling 1 minute)
rate(ollama_tokens_generated_total[1m])

# Queue depth
ollama_pending_requests

Framework Support: Google ADK, LangChain, LangGraph

The choice of framework affects not just what you build but what observability you get for free.

Google ADK

Google ADK (Agent Development Kit) is designed for multi-agent systems. Out of the box it emits traces via OpenTelemetry with spans that cover:

  • Agent invocations with model name and prompt token count
  • Tool calls with input/output capture
  • LLM calls with TTFT and total latency as span attributes
  • Error events with full stack traces

ADK's OTel integration is configuration-based — set the exporter endpoint and you get traces in Tempo without instrumenting a single function:

from google.adk.telemetry import configure_otel
 
configure_otel(
    service_name="my-agent",
    otlp_endpoint="http://tempo.tracing:4318",
    export_traces=True,
)
from google.adk.telemetry import configure_otel
 
configure_otel(
    service_name="my-agent",
    otlp_endpoint="http://tempo.tracing:4318",
    export_traces=True,
)

What ADK doesn't give you: Prometheus metrics. If you want queue depth or TPS in Grafana, you need to either scrape the ADK metrics endpoint separately or push custom metrics via the Prometheus Python client alongside your traces.

LangChain

LangChain's observability story has improved significantly with LangSmith, but LangSmith is a managed service rather than something you run locally. For the self-hosted case, LangChain uses callbacks.

The built-in OpenAICallbackHandler captures token counts and costs. For structured tracing, use the opentelemetry-langchain package:

from opentelemetry.instrumentation.langchain import LangchainInstrumentor
 
LangchainInstrumentor().instrument()
from opentelemetry.instrumentation.langchain import LangchainInstrumentor
 
LangchainInstrumentor().instrument()

This emits spans for chains, agents, and LLM calls with the following attributes on LLM spans:

  • gen_ai.request.model
  • gen_ai.usage.prompt_tokens
  • gen_ai.usage.completion_tokens
  • gen_ai.response.finish_reasons

What's missing: TTFT is not captured by default. The LLM call span covers the full request duration, not the time to first token. For streaming responses, you need to instrument the stream consumer yourself. Add a timing wrapper around the first chunk event if TTFT matters to your SLO.

LangGraph

LangGraph builds on LangChain and inherits its callback/instrumentation system. Where it adds value for observability is graph-level visibility: each node in the graph is a separate span, so you can see exactly which step in a multi-step agent workflow is slow.

from langgraph.graph import StateGraph
# Instrumentation is inherited from LangChain's OTel instrumentor
# Each node transition creates a child span
from langgraph.graph import StateGraph
# Instrumentation is inherited from LangChain's OTel instrumentor
# Each node transition creates a child span

In Tempo, a LangGraph trace looks like a call tree with nodes like router, tool_executor, llm_call, output_formatter. When a graph is slow, you can immediately see which node is the bottleneck without log diving.

LangGraph's gap: it doesn't emit queue depth or concurrency metrics. If you're serving multiple concurrent graph executions, you need to wrap the execution endpoint with a semaphore and export the queue size as a custom metric.


Instrumenting for TTFT with OpenTelemetry

None of the frameworks give you TTFT for streaming responses without custom instrumentation. Here is the pattern that works for all of them:

from opentelemetry import trace
import time
 
tracer = trace.get_tracer("llm-service")
 
def stream_with_ttft(client, prompt: str, model: str):
    with tracer.start_as_current_span("llm.stream") as span:
        span.set_attribute("llm.model", model)
        span.set_attribute("llm.prompt_chars", len(prompt))
 
        start = time.perf_counter()
        first_token_at: float | None = None
        token_count = 0
 
        for chunk in client.stream(prompt, model=model):
            if first_token_at is None and chunk:
                first_token_at = time.perf_counter()
                span.set_attribute(
                    "llm.ttft_ms",
                    round((first_token_at - start) * 1000, 1)
                )
            token_count += 1
            yield chunk
 
        elapsed = time.perf_counter() - start
        span.set_attribute("llm.output_tokens", token_count)
        span.set_attribute("llm.tps", round(token_count / elapsed, 1))
from opentelemetry import trace
import time
 
tracer = trace.get_tracer("llm-service")
 
def stream_with_ttft(client, prompt: str, model: str):
    with tracer.start_as_current_span("llm.stream") as span:
        span.set_attribute("llm.model", model)
        span.set_attribute("llm.prompt_chars", len(prompt))
 
        start = time.perf_counter()
        first_token_at: float | None = None
        token_count = 0
 
        for chunk in client.stream(prompt, model=model):
            if first_token_at is None and chunk:
                first_token_at = time.perf_counter()
                span.set_attribute(
                    "llm.ttft_ms",
                    round((first_token_at - start) * 1000, 1)
                )
            token_count += 1
            yield chunk
 
        elapsed = time.perf_counter() - start
        span.set_attribute("llm.output_tokens", token_count)
        span.set_attribute("llm.tps", round(token_count / elapsed, 1))

Wrap this around any streaming LLM call regardless of framework. The span attributes llm.ttft_ms and llm.tps appear in Tempo, filterable by model, service, or any dimension you add.


Setting SLOs for LLM Services

SLOs for LLM inference are different from web services because the latency distribution is multimodal — it depends heavily on input length and output length, which vary enormously across requests.

A practical starting point:

MetricSLONotes
TTFTp95 < 500msInteractive chat use case
TPSp50 > 10 t/sMinimum for legible streaming
Queue depth< 5 for 95% of timeLeading indicator for degradation
Error rate< 0.5%Includes context-length exceeded

p95 < 500ms for TTFT is the threshold below which users don't consciously notice the wait. Above 1 second, users assume something is wrong. The 500ms target is achievable for 3B–7B models on modern CPU hardware with short-to-medium prompts; it requires a GPU for larger models or long system prompts.

Set Grafana alerts on these, routed to the team that owns the inference service via the Alertmanager routing from part three.


What's Next

You can now see what your LLMs are doing. The final piece is controlling what they're allowed to do. Part six covers AI Tool Gateways — the proxy layer that sandboxes agent access to tools, limits blast radius, and creates an auditable record of everything an AI agent touched.

AI Tool Gateways: Sandboxing Agent Access in Kubernetes →