What you'll actually pay for inference.
Token volume, model choice, latency requirements, and wrapper margin (Harvey, Glean, Hebbia) all change the bill. Fourteen probing questions. Output: side-by-side $/mo across OpenAI, Claude, Gemini, and self-hosted, plus the wrapper margin if you're going through one. The full memo includes a 12-month cost projection, model-mix recommendation, and the questions to negotiate with on your next renewal.
How this is calculated
Per-token published pricing across the four major providers is the input table. We adjust for: reasoning depth (basic chat vs Claude Opus / GPT-5 / Gemini Pro tier), context window (long-context surcharges where they apply), multi-modal (vision / audio token weights), streaming overhead, tool-calling round-trips, and wrapper margin if you're going through Harvey / Glean / Hebbia / Cresta / Sierra.
Self-hosted estimates assume Llama-class open weights on AWS Inferentia 2 / Trainium with batched inference — typical effective $/MTok ranges from $0.30 (small models, high throughput) to $4 (70B+, low throughput).
What this does NOT estimate
- Fine-tuning costs (training compute amortised separately)
- Embedding storage + retrieval (vector-DB infra)
- Custom moderation / safety-classifier overhead
- Engineering time to build + maintain prompt pipelines
For a procurement-ready model-mix recommendation, see the email memo or book a call.
FAQ
How accurate is this estimate?
±25% on the headline figure. Real bills vary based on prompt patterns, batch sizes, and provider rate-limit step functions. Treat as a planning anchor.
Is the wrapper margin really 30-60%?
For Harvey AI we have public reference data showing 50%+ margin on AmLaw 200 deployments. Glean and Hebbia run similar. The memo includes the source data + how to verify against your own contract.
When does self-host pay back?
Above ~$80K/yr OpenAI/Claude bills, Inferentia 2 + Llama 3.1 70B starts beating direct API on $/MTok. Below that, direct API wins on operational simplicity.
What's in the memo?
12-month cost projection (3 scenarios), model-mix recommendation, prompt-pattern optimisations that cut spend 20-40%, vendor-negotiation questions for renewal.