What you'll take away
The GenAI pilot delivered outstanding results. The demo impressed the board. The pilot was approved for production rollout. Then the engineering team built the production token cost model and discovered that serving the application to the company's full user base would cost more than their entire current cloud infrastructure bill.
LLM API pricing is seductively cheap at demo scale and shockingly expensive at production scale. A system prompt that's three thousand tokens long, invoked ten thousand times per day, costs three billion tokens of input per month before a single user has typed anything. At frontier model pricing, that's a substantial monthly bill just for sending context the model will largely ignore.
Understanding the Token Cost Stack
LLM API cost breaks down into four components, each of which can be individually optimised: system prompt tokens (sent on every request), context/RAG tokens (retrieved documents injected into the prompt), conversation history tokens (prior turns carried forward for multi-turn conversations), and output tokens (the model's response). Input and output tokens are priced differently, with output typically 3–5x more expensive per token at most frontier model providers.
The Hidden Cost of Long System Prompts
System prompts are sent with every API call. A well-crafted system prompt for an enterprise application might run to two to four thousand tokens of instructions, persona definition, formatting requirements, and safety guidelines. At ten thousand daily active users making three requests each, that system prompt alone is 60–120 million tokens per day — a major cost line that scales linearly with usage and produces zero output.
RAG Context Cost
RAG systems retrieve document chunks and inject them into the prompt as context. Each retrieved chunk is tokens. A system that retrieves ten chunks of three hundred tokens each adds three thousand tokens to every request. Combined with a long system prompt and conversation history, context tokens can represent 70–80% of total token consumption on a RAG-based application.
Conversation History Accumulation
Multi-turn conversational applications carry forward the history of prior turns in each request, so the model has context for the current message. In a long conversation, this history grows unboundedly — a fifty-turn conversation might carry thousands of tokens of prior exchange in every request, most of which are irrelevant to the current query.
Is your AI ready for production?
48-hour turnaround. No obligation.
The rule of thumb: a production GenAI application with no token optimisation typically costs 5–20x more per user-query than a well-optimised one. The optimisation work is mostly prompt engineering and architecture changes — not model changes — and it should happen before production rollout, not after the first invoice.
Token Optimisation Strategies That Matter
- System prompt compression: Audit every sentence in your system prompt for whether it meaningfully changes model behaviour when removed. Long prompts with redundant instructions are common in systems that were iteratively prompted without periodic pruning
- Selective RAG retrieval: Retrieve fewer, higher-quality chunks rather than many lower-quality ones. A well-tuned retrieval that returns three highly relevant chunks outperforms one that returns ten mixed-relevance chunks — at one-third the context cost
- Conversation summarisation: Summarise earlier conversation turns rather than carrying the full transcript. A hundred-token summary of the first twenty turns costs less than two thousand tokens of full history
- Model tiering: Not every request needs a frontier model. Classify request complexity and route simpler queries to smaller, cheaper models that perform adequately for the task
- Response caching: For common, repetitive queries, cache the model response and serve it without an API call on repeated requests
- Prompt caching: Many providers offer prompt caching for repeated system prompts — the cost of re-processing a cached prefix is near zero, making long system prompts substantially cheaper at scale
Building Cost Visibility Before You Need It
Production GenAI applications need token cost monitoring from day one. Track cost per user session, cost per query type, and cost trends over time. Set budget alerts before you hit cost thresholds. Build the ability to switch between model tiers in the architecture before you're under cost pressure, so the optimisation can be made deliberately rather than in crisis.
GYSP's AI & ML Development practice includes token economics review in every GenAI architecture engagement. The projects that don't blow their budgets at scale are the ones that modelled production cost before writing the first production line of code.
“A three-thousand-token system prompt that nobody reviewed for redundancy is a tax on every single user request for the entire life of the application. Prompt economics matter as much as prompt quality.”
— Rahul, AI/ML Delivery Head — GYSP.tech
