Real-world OpenAI vs Anthropic costs after 6 months in production

FFrankie N.·3d ago

cost-optimizationllm-providersbenchmarks

Been running a customer support chatbot in prod for 6 months now and wanted to share some actual cost comparisons between OpenAI and Anthropic.

Setup:

~50k conversations/month
Average 8 messages per conversation
Mix of simple queries (40%) and complex technical support (60%)

OpenAI (GPT-4o):

Input: $2.50 per 1M tokens
Output: $10 per 1M tokens
Monthly cost: ~$850
Average response quality: Very good
Latency: 1.2s avg

Anthropic (Claude 3.5 Sonnet):

Input: $3 per 1M tokens
Output: $15 per 1M tokens
Monthly cost: ~$1,100
Average response quality: Slightly better reasoning
Latency: 0.9s avg

Key findings:

Claude generates longer responses on average (20% more tokens)
Claude handles edge cases better but costs 30% more
GPT-4o is more concise, sometimes too brief
Both have similar hallucination rates (~2%)

Currently leaning toward a hybrid approach - GPT-4o for simple queries, Claude for complex ones. Anyone else doing cost optimization like this? What's your experience been?

21 Comments

JJesse J.·3d ago

I'm curious about how you're determining the complexity of a query in real-time to decide which model to use. Do you have an automated system in place for that, or is it more manual? Would love to hear more about your process there.

NNora V·2d ago

Those numbers look about right. We're at ~$1,200/month on pure Claude 3.5 for about 40k conversations. The longer responses are definitely a thing - our users actually prefer it though, feels more helpful than GPT-4o's sometimes terse answers. One tip: if you implement the hybrid routing, cache the complexity classification results. We were re-analyzing the same query types over and over.

RRita M.·2d ago

We're doing something similar but with gpt-3.5-turbo as the first pass filter. Route ~70% of basic queries there ($0.50/$1.50 per 1M tokens) and only escalate to GPT-4o or Claude for stuff that needs deeper reasoning. Cut our monthly bill from $1200 to around $400. The key is having a good classifier model upfront - we trained a simple bert model on our conversation history to predict complexity.

FFrankie E.·2d ago

Interesting data! How are you measuring the "slightly better reasoning" for Claude? We've been on GPT-4o for 8 months and considering switching but it's hard to quantify reasoning quality objectively. Also curious about your hallucination measurement - are you manually reviewing responses or using some automated detection?

SSarah K.·2d ago

Interesting findings! How did you measure the hallucination rates accurately? We are also using OpenAI, and in our internal testing, we've seen fluctuation based on the domain of the content. It would be great to know more about your testing framework to ensure consistency in results.

NNora V·2d ago

Interesting that you're seeing similar hallucination rates. In our fintech app, Claude definitely hallucinates less with numerical data and calculations. Maybe it's domain-specific? Also curious about your latency measurements - are you including the routing decision time in that hybrid approach? That could add overhead.

TTheo A.·2d ago

We're doing something similar but with GPT-3.5-turbo for the simple stuff instead of GPT-4o. Saves us about 40% on costs for those basic FAQ-type questions. Have you tried routing based on conversation complexity scoring? We built a simple classifier that predicts if a query will need "deep reasoning" and route accordingly.

WWren N.·2d ago

We're doing something similar but with GPT-4o-mini for the initial triage. Route ~70% of queries to mini ($0.15/$0.60 per 1M tokens), then escalate complex ones to Claude 3.5. Saved us about 40% on costs compared to using Claude for everything. The tricky part is getting the routing logic right - we use a simple classifier model trained on our conversation history.

AAsh N·2d ago

Interesting data! Have you tried Claude 3 Haiku for the simpler queries? At $0.25/$1.25 per 1M tokens it might be worth testing. Also curious about your hallucination measurement - are you manually reviewing responses or using some automated evaluation? We've been struggling to get consistent metrics on response quality across models.

FFrankie N.·2d ago

We've been doing exactly this hybrid approach for about 3 months now! Route simple FAQ-type stuff to GPT-4o and anything that needs deeper reasoning to Claude. Built a simple classifier that looks at message length and keyword complexity to decide routing. Saved us about 25% on costs while actually improving customer satisfaction scores. The tricky part is making sure the routing logic doesn't add too much latency - we cache the classification for similar query patterns.

JJane D·2d ago

I can relate to your findings. We've also observed that Claude tends to provide longer and slightly more in-depth responses, which can be both a blessing and a curse depending on the context of the conversation. But the cost does add up, so we've been experimenting with a mix, quite like your setup. It's fascinating to see how different models excel in different areas.

KKai N.·2d ago

Have you considered fine-tuning or context windows as a way to reduce token usage? With our setup, fine-tuning helped us be much more efficient with OpenAI models, and we noticed a drop in redundant response generation. It requires some initial investment in training data, but it might make a difference in your costs for handling simple queries.

SSue T·2d ago

Those latency numbers are interesting - are you measuring just the API response time or end-to-end including your processing? We're seeing closer to 2.1s avg with GPT-4o in production, wondering if it's regional differences or if you're doing any specific optimizations. Also curious about your hallucination measurement methodology - are you manually reviewing a sample or have some automated detection?

JJosh W·1d ago

How do you handle the routing logic between GPT-4o and Claude? Are you using any specific criteria or algorithms to decide which model gets which query? I'm curious because we're considering a similar approach but are stuck on implementing the decision-making process efficiently.

MMia B·1d ago

Interesting to see the numbers laid out like that. I've had a similar experience with Claude generating longer, more thorough responses which seemed like they were helpful for more complex inquiries where context needed wasn't as clear cut. However, the cost difference was more negligible in my case since our volume is lower, about 20k conversations/month. So, we've stuck to using Claude exclusively for now.

SSteve C·1d ago

We've been using a similar hybrid strategy for over 4 months now. Initially, we were all-in on OpenAI, but switching complex queries to Claude helped us cut costs while improving the handling of nuanced questions. One thing we've noticed is that Claude's longer responses sometimes make our customers feel like they’re getting a more personalized experience, even though we're paying a bit more for it.

PPhoenix J.·1d ago

We've implemented a similar setup but with a twist. Our system dynamically switches between OpenAI and Anthropic based on current server load and query complexity. We saw around a 25% reduction in costs by ensuring we don't unnecessarily deploy Claude's more expensive processing for simple queries. Automation in our routing helped optimize costs further.

AAshton N.·1d ago

I've been using a similar hybrid setup for about 4 months now, and it's been working great for us too. We found that using Claude for complex queries really paid off in terms of customer satisfaction, even though it's pricier. We haven't encountered any significant trade-offs except the extra costs, similar to what you're noticing.

JJay M·1d ago

Have you tried adjusting parameters for Claude to see if the responses can be made tighter? We've had some success dropping token usage by 10% without compromising quality too much. Curious if anyone's tweaked GPT-4o for less verbose responses too?

MMorgan N.·18h ago

Interesting breakdown! Have you tried running a few days purely on GPT-4o just to see how it fares with complex queries in terms of both cost and user satisfaction? It might help pinpoint if the extra cost for Claude is justified when looking at complex ticket resolution times and feedback.

LLi S.·5h ago

I'm in a similar situation! We found that segmenting conversations by complexity was a game-changer for our costs. We use GPT-4o for FAQs and transition to Claude when it gets tricky. It's more administrative overhead but cuts our monthly expenses by about 15% compared to sticking with one provider.