More

akarshc · 2026-04-16T14:41:28 1776350488

I wrote this after building an AI gateway layer for ModelRiver to handle streaming responses, retries, and provider failover across LLM APIs. Phoenix supervision trees ended up being especially useful for isolating failures between model calls. Curious how others are structuring gateway layers in front of LLM providers.

akarshc · 2026-01-28T15:01:40 1769612500

I’ve worked on and shipped a few AI systems that reached real users.

This post isn’t about models or prompts. It’s about the things that kept breaking once AI moved off the happy path: async jobs, retries, silent failures, provider outages, cost blowups, and debugging without visibility.

I wrote this mostly as a way to document the mistakes I made and what I wish I had known earlier. Happy to answer questions or dig deeper into any of the failure modes.

akarshc · 2026-01-27T15:36:30 1769528190

This is one of the harder problems, and there isn’t a perfect answer.

The main thing we try to avoid is pretending mid-stream retries are the same as pre-request retries. Once a stream has started, we treat it as a sequence of events with checkpoints rather than a single opaque response. Retries are scoped to known safe boundaries, and anything ambiguous is surfaced explicitly instead of silently re-emitting tokens.

In other words, correctness is prioritized over pretending the stream is seamless. If we can’t guarantee no duplication, we make that visible rather than hide it.

akarshc · 2026-01-27T15:32:29 1769527949

This was something we were careful about. The request and event models are intentionally close to what most providers already expose, rather than introducing a completely new abstraction.

Teams usually integrate it incrementally in front of existing calls. If you remove it, you’re mostly deleting the orchestration layer and keeping your provider integrations and client logic. You lose centralized retries and observability, but you’re not stuck rewriting your entire request model.

If adopting it requires a full rewrite, that’s usually a sign it’s being applied too broadly.

akarshc · 2026-01-27T15:14:00 1769526840

I’m one of the builders. Once AI requests moved beyond simple sync calls, we kept running into the same problems in production: retries hiding failures, async flows that were hard to reason about, frontend state drifting, and providers timing out mid-request.

This page breaks down the three request patterns we see teams actually using in production (sync, async, and event-driven async), how data flows in each case, and why we ended up favoring an event-driven approach for interactive, streaming apps.

Happy to answer questions or go deeper on any part of the architecture.

akarshc · 2026-01-26T03:50:45 1769399445

Totally agree, that “what actually hit the wire?” view is critical once things go async.

ModelRiver already has this covered via request logs. Every request captures the full lifecycle, the exact payload sent to the provider, streaming chunks as they arrive, partial responses, errors, retries, and the final outcome. Even if a stream fails midway, you can still inspect what was sent and what came back before the failure.

So you can clearly tell whether the issue is payload shape, provider throttling, or a mid stream failure, before any retry or failover logic kicks in. That wire level visibility is core to how we approach debugging async AI requests.

akarshc · 2026-01-22T17:30:57 1769103057

If streaming behavior is still product-specific and changing fast, this adds friction. It only pays off once failure handling stabilizes and starts repeating across the system.

akarshc · 2026-01-22T17:21:18 1769102478

Queues work well before or after a request, but they’re awkward once a response is already streaming. This layer exists mainly to handle failures during a stream without spreading that logic across handlers, workers, and client code.

akarshc · 2026-01-22T17:18:59 1769102339

While building AI features that rely on real-time streaming responses, I kept running into failures that were hard to reason about once things went async.

Requests would partially stream, providers would throttle or fail mid-stream, and retry logic ended up scattered across background jobs, webhooks, and request handlers.

I built ModelRiver as a thin API layer that sits between an app and AI providers and centralizes streaming, retries, failover, and request-level debugging in one place.

It’s early and opinionated, and there are tradeoffs. Happy to answer technical questions or hear how others are handling streaming reliability in production AI apps.

akarshc · 2025-10-23T18:03:00 1761242580

In the age of AI, how I rewire my mind to pick up new programming languages.