Model integration · Cerebras

Cerebras for latency-sensitive work.

Cerebras inference is fast on paper. Put cerebras/llama-3.1-70b behind the same endpoint as your other models, on your own key, and check whether the speed holds for your real prompts.

$129/month SaaS. Bring your own model keys. No inference markup.

Three steps to connect.

01

Pick the Cerebras speed lane

Cerebras sells raw inference speed. ProxyLLM passes Cerebras-backed models through providers that expose them, on your own key; native Cerebras key storage is future work.

02

One client surface

Send chat completions to https://api.proxyllm.ai/v1 with your ProxyLLM key and keep the same request shape used for every other model.

03

Measure latency honestly

Headline tokens-per-second is not end-to-end latency. Read real timings in ProxyLLM request logs before you move a workflow to Cerebras.

Fast inference, measured.

Call cerebras/ models where your configured provider exposes them. Your key, your logs, no markup.

client.ts
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.proxyllm.ai/v1",
  apiKey: "pk_live_...",
});

const r = await client.chat.completions.create({
  model: "cerebras/llama-3.1-70b",
  messages: [{ role: "user", content: "Score these leads from 1 to 5." }],
});
Codex Hosted · the main feature

Run your AI workloads on your ChatGPT subscription.

ProxyLLM runs OpenAI's Codex for you, signed in with your own ChatGPT account. Your apps call one OpenAI-compatible endpoint and the work bills to your flat plan instead of per-token API pricing.

$129/month · normal SaaS pricing

Choose speed with logs.

ProxyLLM records latency, tokens, and failures on every call, so the fast lane earns its place with production data. $129/month flat.