AI/ML DevelopmentLLM CostFinOpsPrompt EngineeringGenAIToken Optimisation

The Token Tax: Preventing Your GenAI Pilot from Bankrupting the Budget

Rahul

AI/ML Delivery Head, GYSP.tech

15 June 20259 min read

What you'll take away

Understanding the Token Cost Stack
Token Optimisation Strategies That Matter
Validated Outcomes
Building Cost Visibility Before You Need It

The GenAI pilot delivered outstanding results. The demo impressed the board. The pilot was approved for production rollout. Then the engineering team built the production token cost model and discovered that serving the application to the company's full user base would cost more than their entire current cloud infrastructure bill.

LLM API pricing is seductively cheap at demo scale and shockingly expensive at production scale. A system prompt that's three thousand tokens long, invoked ten thousand times per day, costs three billion tokens of input per month before a single user has typed anything. At frontier model pricing, that's a substantial monthly bill just for sending context the model will largely ignore.

Understanding the Token Cost Stack

LLM API cost breaks down into four components, each of which can be individually optimised: system prompt tokens (sent on every request), context/RAG tokens (retrieved documents injected into the prompt), conversation history tokens (prior turns carried forward for multi-turn conversations), and output tokens (the model's response). Input and output tokens are priced differently, with output typically 3–5x more expensive per token at most frontier model providers.

The Hidden Cost of Long System Prompts

System prompts are sent with every API call. A well-crafted system prompt for an enterprise application might run to two to four thousand tokens of instructions, persona definition, formatting requirements, and safety guidelines. At ten thousand daily active users making three requests each, that system prompt alone is 60–120 million tokens per day — a major cost line that scales linearly with usage and produces zero output.

RAG Context Cost

RAG systems retrieve document chunks and inject them into the prompt as context. Each retrieved chunk is tokens. A system that retrieves ten chunks of three hundred tokens each adds three thousand tokens to every request. Combined with a long system prompt and conversation history, context tokens can represent 70–80% of total token consumption on a RAG-based application.

Conversation History Accumulation

Multi-turn conversational applications carry forward the history of prior turns in each request, so the model has context for the current message. In a long conversation, this history grows unboundedly — a fifty-turn conversation might carry thousands of tokens of prior exchange in every request, most of which are irrelevant to the current query.

The rule of thumb: a production GenAI application with no token optimisation typically costs 5–20x more per user-query than a well-optimised one. The optimisation work is mostly prompt engineering and architecture changes — not model changes — and it should happen before production rollout, not after the first invoice.

Token Optimisation Strategies That Matter

Is your AI ready for production?

48-hour turnaround. No obligation.

Request AI Architecture Review

System prompt compression: Audit every sentence in your system prompt for whether it meaningfully changes model behaviour when removed. Long prompts with redundant instructions are common in systems that were iteratively prompted without periodic pruning
Selective RAG retrieval: Retrieve fewer, higher-quality chunks rather than many lower-quality ones. A well-tuned retrieval that returns three highly relevant chunks outperforms one that returns ten mixed-relevance chunks — at one-third the context cost
Conversation summarisation: Summarise earlier conversation turns rather than carrying the full transcript. A hundred-token summary of the first twenty turns costs less than two thousand tokens of full history
Model tiering: Not every request needs a frontier model. Classify request complexity and route simpler queries to smaller, cheaper models that perform adequately for the task
Response caching: For common, repetitive queries, cache the model response and serve it without an API call on repeated requests
Prompt caching: Many providers offer prompt caching for repeated system prompts — the cost of re-processing a cached prefix is near zero, making long system prompts substantially cheaper at scale

Validated Outcomes

Intercom, the customer communications platform, published a detailed engineering post on the cost journey of their AI features after scaling to millions of interactions. The primary finding: without token cost modelling at the architecture stage, AI feature costs scaled super-linearly with usage because prompt designs that were efficient at low volume became expensive at scale when context windows grew with conversation length. After implementing model tiering — routing simple intent classification to smaller models and reserving frontier models for complex reasoning tasks — Intercom reduced their per-conversation AI cost by approximately 40% without measurable degradation in user-rated response quality.

GYSP builds token economics models before writing production code for every GenAI engagement. The model projects cost at 10x, 100x, and 1000x current usage, identifies which prompt and context design decisions have the most impact on cost at scale, and determines the model tiering strategy appropriate for the use case. Clients who complete this analysis before build consistently avoid the budget crisis at scale that characterises GenAI projects where cost was treated as a launch-time optimisation rather than an architectural constraint.

Building Cost Visibility Before You Need It

Production GenAI applications need token cost monitoring from day one. Track cost per user session, cost per query type, and cost trends over time. Set budget alerts before you hit cost thresholds. Build the ability to switch between model tiers in the architecture before you're under cost pressure, so the optimisation can be made deliberately rather than in crisis.

GYSP's AI & ML Development practice includes token economics review in every GenAI architecture engagement. The projects that don't blow their budgets at scale are the ones that modelled production cost before writing the first production line of code.

“A three-thousand-token system prompt that nobody reviewed for redundancy is a tax on every single user request for the entire life of the application. Prompt economics matter as much as prompt quality.”
— Rahul, AI/ML Delivery Head — GYSP.tech

ShareLinkedIn Twitter / X

Ready to act on this?

Is your AI ready for production?

Get a free AI architecture review — we assess your current design, identify failure points, and outline a production-ready path.

92%

Faster information retrieval

70%

Reduction in support queries

99.5%

Extraction accuracy

Request AI Architecture Review

48-hour turnaround · No obligation · Senior engineers only

Get new AI/ML Development insights in your inbox

Practical, no-fluff articles for engineers and technology leaders. New pieces delivered as they're published.

No spam. Unsubscribe any time.

The Token Tax: Preventing Your GenAI Pilot from Bankrupting the Budget

Understanding the Token Cost Stack

The Hidden Cost of Long System Prompts

RAG Context Cost

Conversation History Accumulation

Token Optimisation Strategies That Matter

Validated Outcomes

Building Cost Visibility Before You Need It

Is your AI ready for production?

Get new AI/ML Development insights in your inbox

More from the Blog

The "It Works On My Machine" AI Crisis: Why 90% of Models Die in Production

Stop Buying Vector Databases: The Case for the Unified Data Layer

Your PDFs Are Ruining Your AI: The Case for Layout-Aware Ingestion