Vercel AI SDK with ProxyLLM: Provider Setup and Caveats
Configure createOpenAICompatible with ProxyLLM's base URL and OpenAI calls bill to a flat ChatGPT plan. Setup code, the streaming caveat, and per-environment keys.
The Vercel AI SDK reaches any OpenAI-compatible endpoint through the @ai-sdk/openai-compatible package. Create a provider with createOpenAICompatible, set baseURL to https://api.proxyllm.ai/v1 with a ProxyLLM key, and OpenAI-model calls run through Codex Hosted on your own ChatGPT subscription instead of per-token API billing. One caveat belongs in the first paragraph rather than the footnotes: the Codex lane returns complete responses, so streamText streams only when an API-key lane serves the call.
Here is the config, the caveat handled properly in code, and the cost math for a route that runs all day.
The provider config
npm install ai @ai-sdk/openai-compatible
// lib/providers.ts
import { createOpenAICompatible } from "@ai-sdk/openai-compatible";
export const proxyllm = createOpenAICompatible({
name: "proxyllm",
baseURL: "https://api.proxyllm.ai/v1",
apiKey: process.env.PROXYLLM_API_KEY,
});
Use it anywhere the SDK takes a model:
// app/api/summarize/route.ts
import { generateText } from "ai";
import { proxyllm } from "@/lib/providers";
export async function POST(req: Request) {
const { document } = await req.json();
const { text } = await generateText({
model: proxyllm("gpt-5"),
prompt: `Summarize for the weekly digest:\n\n${document}`,
});
return Response.json({ summary: text });
}
Behind the provider, we serve OpenAI-model calls through Codex on your own ChatGPT account, log every request, and enforce the key’s budget cap. Connecting the account takes about five minutes with OpenAI’s device-code flow; the setup guide walks through it.
The streaming caveat, stated plainly
The Codex lane returns complete responses. streamText only streams when an API-key lane serves the call. That single fact should decide which SDK function each route uses:
generateTextandgenerateObjectare a natural fit for the flat lane. They await a complete result anyway, which is exactly what the Codex lane returns.streamTextstill works, but a call served by Codex delivers the entire answer in one piece. A chat UI built on it loses its typing effect and shows a wait followed by the full message.
The production pattern is two providers, one per job:
// lib/providers.ts
import { createOpenAICompatible } from "@ai-sdk/openai-compatible";
import { createOpenAI } from "@ai-sdk/openai";
// Flat lane: background work on your ChatGPT subscription
export const proxyllm = createOpenAICompatible({
name: "proxyllm",
baseURL: "https://api.proxyllm.ai/v1",
apiKey: process.env.PROXYLLM_API_KEY,
});
// Streaming lane: user-facing chat on your own API key
export const openaiDirect = createOpenAI({
apiKey: process.env.OPENAI_API_KEY,
});
// app/api/chat/route.ts
import { streamText } from "ai";
import { openaiDirect } from "@/lib/providers";
export async function POST(req: Request) {
const { messages } = await req.json();
const result = streamText({ model: openaiDirect("gpt-5"), messages });
return result.toDataStreamResponse();
}
The split usually lands 80/20: summaries, enrichment, moderation, structured extraction, and queue jobs carry most of the volume and none of them need streaming. The chat surface that does need it is a small slice of spend on its own key. The same split in plain Node, outside the AI SDK, is covered in the Node.js base URL guide.
What a background route costs both ways
A summarization route serving 5,000 requests a day at 2,000 input and 300 output tokens per call, on GPT-5 (OpenAI’s June 2026 list: $1.25 per million input, $10 per million output):
| Line | Arithmetic | Monthly |
|---|---|---|
| Calls | 5,000/day × 30 = 150,000 | |
| Input | 150,000 × 2,000 = 300M tokens × $1.25/M | $375 |
| Output | 150,000 × 300 = 45M tokens × $10/M | $450 |
| API total (GPT-5) | $825 | |
| Flat path | ChatGPT Pro 5x $100 + ProxyLLM $129 | $229 |
A summarization route doing 5,000 calls a day runs about $825 a month at GPT-5 list prices and about $229 flat. On our planning estimates a $20 Plus window absorbs roughly $700 of API-equivalent work a month, which sits just under this workload, so Pro 5x is the honest tier here; treat both figures as estimates, since OpenAI tunes plan limits over time. Past a window, requests fall back to a second connected account, then your own API key, and the request log shows which lane served each call.
Keys per environment
Vercel’s environment scoping maps cleanly onto per-key budgets. Generate three ProxyLLM keys, store each as PROXYLLM_API_KEY in the matching environment, and cap them differently: preview tight, staging moderate, production sized to real traffic. A preview deployment that starts looping retries hits its own small cap instead of eating production’s window, and the request log shows spend per environment without any tagging work on your side.
Two scope notes to round out the picture. The model surface on the flat lane is what Codex serves, which means chat models: keep embed() calls and fine-tune jobs on your own OpenAI key, which we pass through with no markup. The full compatibility rundown is in what works with Codex Hosted.
The condensed setup lives on the Vercel AI SDK integration page. If your AI routes already have a real OpenAI line item, the calculator maps it to a plan tier in thirty seconds.
Frequently asked questions
How do I point the Vercel AI SDK at a custom OpenAI-compatible endpoint?
Install @ai-sdk/openai-compatible and call createOpenAICompatible with a name, baseURL, and apiKey. Set baseURL to https://api.proxyllm.ai/v1 with a ProxyLLM key, and models from that provider work in generateText, streamText, and generateObject like any other AI SDK model.
Does streamText stream through ProxyLLM?
Only when an API-key lane serves the call. The Codex Hosted lane returns complete responses, so a streamText call served by Codex delivers the whole answer at once rather than token by token. Use generateText on the Codex lane and keep streaming UIs on an API-key lane.
Why did my streamText response arrive all at once?
The call was served by the Codex lane, which returns complete responses by design. Nothing is broken. Route user-facing streaming paths through an API-key lane and let background routes take the flat lane, where streaming does not matter.
Can preview, staging, and production use separate keys?
Yes, and they should. Generate one ProxyLLM key per Vercel environment, store each as PROXYLLM_API_KEY in the matching environment scope, and give preview a tight budget cap. The request log then breaks out spend per environment.