2026-05-14

Minimal tracing and costs for AI pipelines in production

Which metrics actually matter for LLM pipeline cost and reliability before buying enterprise observability.

Before you buy an “LLM observability” suite, you can already tell whether a pipeline survives load and what it costs with a handful of numbers captured in your code or structured logs. This post outlines an MVP metric set aligned with RED (rate, errors, duration), adapted to model calls.

Verifiable premise: per-token pricing changes—pull rates from your vendor’s pricing page (OpenAI, Anthropic, Google, etc.) and compute estimated_cost with their documented formula (input_tokens * price_in + output_tokens * price_out is typical). Here we focus on how to measure, not fixed dollar amounts.

What not to measure at first

Avoid these early anti-patterns:

Thirty-chart dashboards before you can list a stable request_id linked to a minimal trace.
“Approximate” tokens with no source—if the client hides usage, read the official response payload before inventing multipliers.
Average latency only—p95 tells you far more about UX and timeouts.
Full prompt logging in prod without retention policy—storage cost + privacy risk.

MVP metrics: definitions and units

For each model invocation (HTTP call to the vendor or your gateway) record:

Field	Units	Notes
`duration_ms`	ms	Wall clock from send until full response (or stream end).
`input_tokens` / `output_tokens`	ints	From the provider `usage` object when present.
`ok`	boolean	`false` on non-2xx or exceptions before body completes.
`error_class`	short string	e.g. `rate_limit`, `timeout`, `invalid_request`, `provider_5xx`.
`model`	string	Exact model id sent to the API.
`step`	string	If the pipeline has multiple stages (retrieve, classify, generate).

Use the same row shape for async jobs with job_id instead of synchronous HTTP.

Percentiles: compute p50 and p95 for duration_ms at least daily (SQL or metrics backend). p95 surfaces slow tails and flaky vendors before the mean moves.

Error rate: errors / attempts grouped by model and step. A spike on one step usually means tool-calling or schema validation, not “the model in general”.

Estimated vs measured cost

Measured cost = actual tokens × list price. Estimated cost might be a worst-case budget per request (declared max tokens × worst price). When measured systematically undershoots or overshoots estimates, verify:

Caching or long stable prefixes changing real input tokens.
Streaming truncations lowering output tokens.
Hidden retries in the HTTP client multiplying billed calls.

Log attempt (1 = first try, 2 = retry) so one user action does not look like one request when it is three.

Sensible alerting

Three thresholds cover many teams:

Daily budget from summed tokens × price.
p95 latency beyond a contractual SLA (e.g. 8s on a sync endpoint).
Error rate > Y% for 15 minutes on the same step.

Avoid paging on a single 500—bucket with error_class and volume.

When to adopt a dedicated product

Move to Langfuse, Helicone, LangSmith, etc. when at least one is true:

You need distributed tracing across services (DB retrieve + worker + LLM) with automatic correlation.
Non-engineers must explore traces without raw log access.
You want versioned eval datasets co-located with traces.

Until then, a llm_calls table in Postgres/SQLite with the columns above plus an index on created_at answers most “why did cost spike yesterday?” questions.

Sample SQL

SELECT
  date_trunc('day', created_at) AS day,
  model,
  sum(input_tokens + output_tokens) AS tokens,
  avg(duration_ms) AS avg_ms,
  percentile_cont(0.95) WITHIN GROUP (ORDER BY duration_ms) AS p95_ms,
  sum(CASE WHEN ok THEN 0 ELSE 1 END)::float / count(*) AS error_rate
FROM llm_calls
WHERE created_at > now() - interval '7 days'
GROUP BY 1, 2
ORDER BY 1 DESC, tokens DESC;

(Adapt percentile_cont for SQLite—use a window approximation or export to BI.)

Align metrics with vendor `usage` and (optionally) OpenTelemetry

Vendor payloads: when available, persist raw usage slices (prompt/completion/cached) so invoices reconcile with logs. OpenAI Chat/Responses objects expose usage; with prompt caching you may see prompt_tokens_details.cached_tokens—see the Chat object reference plus current caching guides. Do the same for Anthropic/Google: normalize into your llm_calls.input_tokens / output_tokens columns.
OpenTelemetry GenAI: client span conventions (gen_ai.operation.name, gen_ai.request.model, gen_ai.usage.*, …) are evolving—worth adopting if you already standardize on OTel. Start here: Gen AI spans (spec was development/experimental when researched; confirm semver + OTEL_SEMCONV_STABILITY_OPT_IN on older SDKs).
RED alignment: one “inference” span per model call with stable attributes; keep prompt bodies off prod spans unless you have a redaction pipeline.

Summary

Track duration, tokens, outcome, and step per invocation; compute p95 and error rate; reconcile with published pricing. Everything else is polish, not a prerequisite for production sanity.

Want to ship ideas like these into your product?