AIMay 5, 20257 min read

What I Learned Shipping Three AI Integrations in Six Months

Three AI features shipped. Two rewrites. One lesson that kept repeating itself. Here's what the demos never show you.

What I Learned Shipping Three AI Integrations in Six Months — cover image

Three AI integrations shipped in six months. Two of them required a partial rewrite after the first production week. The third held up — but only because I made the mistakes on the first two first.

This is not a tutorial. This is what the integration demos never show you.

The streaming problem nobody warns you about

Every OpenAI and Anthropic demo streams tokens to a <div>. It looks effortless. In production, three things break that effortlessness immediately.

First, the connection drops. Mobile users, flaky hotel Wi-Fi, browser tabs that go background — they all interrupt a stream mid-sentence. If you are not handling stream interruptions with a retry or a graceful fallback state, you are showing users a half-rendered paragraph with no explanation. That is worse than no AI feature at all.

Second, React re-renders on every token. If your streaming text lives inside a component that also manages other state, every token triggers a re-render of more than just the text node. On a mid-range Android device, this is visible jitter within the first ten seconds. The fix is to isolate the streaming text into a dumb leaf component that has no siblings re-rendering alongside it.

Third, the Vercel AI SDK's useChat hook is the right tool, but it makes assumptions about your data shape that collide with custom system prompts. Read the source before you wire it to anything complex.

Token costs scale differently than you think

A feature that costs $4 in testing costs $400 at modest scale. That math is obvious. What is not obvious is that the costs spike before the users do.

The reason is prompt construction. In testing, you write a clean system prompt and a clean user message. In production, you start injecting context — user history, document excerpts, previous conversation turns. The context window fills up, and you stop noticing because the feature still works. The bill is the first sign something changed.

On one project, a document summarization feature went from 800 tokens per call in staging to 4,200 tokens per call in production. The difference was the boilerplate context I was prepending that I had forgotten about. A token counter logged at the API boundary would have caught this in the first deploy. I added one on the third.

The UX gap between a demo and a product

An AI demo has one happy path. A product has a user who types gibberish, submits an empty form, asks the same question four times expecting a different answer, and then complains it is slow.

The two UX decisions that actually mattered:

— Loading states that set expectations, not just spinners. "Analyzing your data" is better than a rotating circle. "This takes around 10 seconds for larger documents" reduces abandonment by more than any performance optimization I made.

— A visible way to stop the generation. If a user realizes their prompt was wrong three seconds into a 20-second response, they need an escape. A stop button is not optional. It is the difference between a user who corrects their prompt and a user who closes the tab.

Rate limits are a product decision, not a technical one

Every AI API has rate limits. The question is not how to raise them — it is what to do when you hit them. I have seen three approaches on real projects:

Queue the request silently and deliver the result when the slot opens. Works for async features. Terrible for anything conversational.

Degrade gracefully to a cached or rule-based fallback. Requires building the fallback, which teams resist. Worth it.

Show an honest error and ask the user to try again in a moment. Underestimated. Users accept this if the product is otherwise reliable.

The worst approach is to return a generic 500 and let the user think the feature is broken.

What I would do differently

On all three projects I underbuilt the observability layer. I knew what the AI was returning. I did not know how often it was returning something the user immediately deleted or regenerated. That data would have changed the prompts faster than any amount of offline testing.

Log the regeneration rate. It tells you more about prompt quality than any eval suite you can write before launch.

Conclusion

The hard part of AI integrations is not the API call. It is the streaming edge cases, the token cost discipline, and the UX between the happy path and everything else. None of that shows up in the demo.

Summary

Streaming requires interruption handling and isolated React components or you get jitter and broken states. Token costs spike because of context injection, not feature scope — add a boundary logger early. UX needs explicit time expectations and a stop button, not just a spinner. Rate limit handling is a product decision: queue, degrade, or be honest. The highest-leverage observability metric is regeneration rate, not latency.