LlamaIndex Ingestion Without the Token Meter

LlamaIndex ingestion is a bulk LLM workload: three extractors over 10,000 chunks is 30,000 model calls. The api_base setup and the worked math, metered vs flat.

LlamaIndex ingestion is a bulk LLM workload wearing a data-pipeline costume. Embedding 10,000 chunks costs about a dollar; running three metadata extractors over those same chunks is about 30,000 model calls, and a re-indexing schedule multiplies that forever. Pointing LlamaIndex’s OpenAI LLM at a flat lane with api_base moves exactly that bulk work onto your own ChatGPT subscription through Codex Hosted, while embeddings stay on your own key where they cost pennies.

Where LlamaIndex actually spends tokens

The reflex is to blame embeddings, and the reflex is wrong. Embedding models price at pennies per million tokens (live numbers at openai.com/api/pricing), so a full corpus embed is lunch money. The LLM steps are the bill:

  • Metadata extractors. TitleExtractor, SummaryExtractor, and QuestionsAnsweredExtractor each make roughly one call per chunk. Three extractors over 10,000 chunks is 30,000 calls before a single user asks a question.
  • Summary indexes. DocumentSummaryIndex and tree-summarize builds add per-document and per-level synthesis calls.
  • Re-indexing. A corpus rebuilt weekly bills its ingestion pipeline four to five times a month. Re-indexing is the silent multiplier in most RAG budgets.
  • Query-time synthesis. Refine and tree-summarize response modes make multiple LLM calls per query, each carrying retrieved context.

The setup

One constructor argument routes the LLM side; embeddings stay on your own OpenAI key:

import os

from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.extractors import (
    QuestionsAnsweredExtractor,
    SummaryExtractor,
    TitleExtractor,
)
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

Settings.llm = OpenAI(
    model="gpt-5",
    api_base="https://api.proxyllm.ai/v1",
    api_key=os.environ["PROXYLLM_API_KEY"],
)
# Embeddings are pennies per million tokens; keep them on your own key.
Settings.embed_model = OpenAIEmbedding(api_key=os.environ["OPENAI_API_KEY"])

docs = SimpleDirectoryReader("corpus").load_data()
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=800),
        TitleExtractor(llm=Settings.llm),
        SummaryExtractor(llm=Settings.llm),
        QuestionsAnsweredExtractor(llm=Settings.llm),
    ]
)
nodes = pipeline.run(documents=docs)
index = VectorStoreIndex(nodes)

Every extractor call and every query-time synthesis call now rides the flat lane. Account connection is OpenAI’s device-code flow, covered in the setup guide.

The 10,000-chunk math

Assume 800-token chunks, three extractors, and a per-call shape of roughly 2,500 input tokens (chunk, neighboring context, instructions) and 200 output tokens. At GPT-5 list prices (June 2026: $1.25 per million input, $10 per million output):

LineArithmeticCost
Embeddings (10k × 800 tok)8M tokens at embedding rates~$1
Extractor input30,000 calls × 2,500 = 75M × $1.25/M$93.75
Extractor output30,000 × 200 = 6M × $10/M$60.00
One full build~$155
Weekly rebuilds (4/mo)4 × $154 LLM cost~$615
Query synthesis (2,000/mo)8M input + 0.6M output~$16
Metered month~$631
Flat setupChatGPT Plus $20 + ProxyLLM $129$149

A weekly-rebuilt 10,000-chunk index with three extractors runs about $631 a month at GPT-5 list prices and about $149 against a Plus plan. The Plus window absorbs an estimated $700 of API-equivalent work monthly, so this workload fits, with the fallback lanes catching any heavy week: requests fall over to a second connected account, then your own API key, and the request log names the lane per call. A corpus twice this size belongs on Pro 5x ($229 all-in, ≈$3,500 estimate). All capacity figures are estimates, never guarantees.

Cheaper models change the level, not the structure: the same build on gpt-5-mini costs roughly $31 in LLM calls, and the meter still scales linearly with every chunk, extractor, and rebuild you add.

Ingestion is the natural flat-lane workload

Pipelines do not need streaming; an extractor cannot act on half a summary. The Codex lane returns complete responses, which is exactly the shape IngestionPipeline consumes, while user-facing query surfaces that stream should stay on an API-key lane. And because ingestion runs unattended, give the pipeline its own key with a budget cap, so a runaway re-index loop has a ceiling. If your pipeline grows into agentic territory, retrieval plus tool loops, the cost model generalizes with the agent cost formula.

The honest caveats are the standard ones: the flat lane serves what Codex serves (OpenAI models), embeddings and fine-tunes stay on your own key, and programmatic Codex use is documented OpenAI functionality with OpenAI keeping the final call over its accounts.

The same trick elsewhere in your stack

If LlamaIndex handles your retrieval and LangChain handles your orchestration, the identical one-argument change covers both: LangChain on a flat-rate OpenAI lane. The condensed LlamaIndex steps live at the LlamaIndex integration.

If your ingestion pipeline already shows up on an OpenAI invoice, the calculator tells you in thirty seconds which plan tier the same work fits inside.

Frequently asked questions

How do I use a custom OpenAI endpoint in LlamaIndex?

Pass api_base when constructing the LLM: OpenAI(model='gpt-5-mini', api_base='https://api.proxyllm.ai/v1', api_key=...) from llama_index.llms.openai, then assign it to Settings.llm. Every extractor, summary index, and query engine that uses the default LLM routes through that endpoint.

How much does it cost to index 10,000 chunks with LlamaIndex?

Embeddings alone are cheap, roughly a dollar for 8M tokens at current embedding rates. The LLM steps are the bill: three metadata extractors over 10,000 chunks is about 30,000 model calls, roughly $154 per full build at GPT-5 list prices, and rebuilding weekly turns that into about $615 a month.

Do embeddings go through ProxyLLM's flat lane?

No. Embedding models are not part of what the Codex lane serves, so OpenAIEmbedding should stay on your own OpenAI API key at the default base URL. That split costs you almost nothing, since embeddings price at pennies per million tokens while the LLM calls carry the real ingestion bill.

Why is my LlamaIndex ingestion bill so high?

Because ingestion features call the LLM per chunk. Title, summary, and question extractors each add one call per chunk, document-summary indexes add tree-summarize passes, and every re-index repeats all of it. A corpus that rebuilds on each refresh bills its entire ingestion pipeline again each time.

More on Integrations
Codex Hosted · the main feature

Run your AI workloads on your ChatGPT subscription.

ProxyLLM runs OpenAI's Codex for you, signed in with your own ChatGPT account. Your apps call one OpenAI-compatible endpoint and the work bills to your flat plan instead of per-token API pricing.