Last updated on 07/08/2025

This post originally appeared on SEP’s Blog.

Large language models are often described as general-purpose problem solvers. That’s fair; given a prompt, they can write SQL queries, summarize academic papers, troubleshoot YAML files, produce a mediocre chili recipe, and draft HR emails, all in the same breath. That breadth of capability is a big driver of the current enthusiasm.

Here’s the thing though, that generality isn’t actually useful in most software applications.

When was the last time your application needed to summarize a research paper, write a poem, and make a weird mashup of the art stylings of Jim Davis and H.R. Geiger?

Most of the software we build isn’t general. It’s targeted. It supports a specific workflow, a set of user goals, or a bounded domain of behavior. The LLMs we embed into those products are used in furtherance of that narrow focus. They’re not being asked to be general-purpose reasoners; they’re being asked to support very particular features.

That distinction matters. If you’re choosing an LLM for a product, you shouldn’t be asking “Which model performs best on everything?” You should be asking “Which model performs best on the specific tasks I’m using an LLM for in my product?” The same is true when testing those models. General evaluations might tell you what a model can do in theory, but narrow, task-specific evaluation gives you more insight into how an LLM might perform in your system.

Narrowing Your View of LLMs in Your Product

A good starting point is your own product backlog. What user stories or acceptance criteria rely on LLM output? Are you summarizing help tickets? Extracting key details from a legal document? Offering chat-based guidance inside a form?

Each of these cases implies a different shape of task, a different kind of input, and a different tolerance for error. You don’t need a model that can pass the bar exam if all it’s doing is reformatting user input into structured tags.

Likewise, your system architecture already has clues. Look at your data flow diagrams. Where does the LLM fit? Is it

Upstream, producing content a human will review?
Midstream, transforming data before another component consumes it?
Downstream, directly generating something the user sees?

Once you’ve mapped where and how your LLMs operate, you can start grouping those uses into categories. Most applications break down into a small number of task types:

Question answering, usually within a domain or based on internal documentation
Conversational interactions, like guided help or natural language search
Structured output generation, such as tagging, classification, or rewriting
Coding, including converting a plain language question to SQL
Information extraction, pulling entities or values from text or documents

These aren’t hard categories, but they help you shift from “We’re using an LLM” to “We’re using an LLM to do this specific kind of thing,” which is the first step toward evaluating and testing with intent.

General vs. Specific Evaluation of Benchmarks

Most model evaluations you’ll see in public, e.g. in blog posts, marketing copy, and research announcements, are framed around broad benchmarks like:

MMLU
HellaSwag
GSM8K
HumanEval

These benchmarks are useful for comparing overall model capability, but they’re not designed to tell you how a model will perform in your application. It’s the difference between asking “How good is this model?” and asking “How good is this model at the one thing I need it to do consistently well?”

If you’re building a conversational agent for onboarding new users to your fintech product, a model’s performance on arcane high-school-level logic puzzles isn’t especially informative. What you care about is whether it can explain your product clearly, interpret user intent, and respond in a tone that matches your brand. There’s no leaderboard for that, but there could be internal evals that simulate and score those tasks.

To make more meaningful choices, treat public benchmarks as a starting point, not the final answer. Use them to:

Eliminate clearly unfit models (e.g. ones that fail at basic language understanding or hallucinate constantly)
Understand model strengths and tradeoffs (e.g. coding vs. reasoning vs. summarization)
Identify when a model’s capability might align with your needs, but always validate those assumptions with your own evaluations

Staying on top of new model releases can help, but don’t get caught in the trap of chasing leaderboard champions. Instead, ask whether a model’s strengths map to the buckets you identified earlier. If not, then it doesn’t matter how good it is in general; it’s the wrong tool for your job.

What Could Go Wrong

It’s tempting to reach for the most capable model you can afford and assume it’ll handle whatever you throw at it. After all, it’s general-purpose, right? But treating LLMs like magical Swiss Army knives can lead to some avoidable problems.

You Pick the Wrong Model for Deployment

If you’re selecting based on general benchmark performance, you might end up with a model that’s overpowered (and overpriced) for your needs, or worse, one that performs poorly at your actual task despite its high ranking. A model that aces code generation benchmarks might be weaker at following nuanced business rules or summarizing customer feedback in the tone your users expect.

You Miss the Opportunity to Use Multiple Models

Not all parts of your system need the same capabilities. If you treat the LLM as a monolith, you’re less likely to segment tasks and match them to fit-for-purpose models. Maybe one cheap model handles internal tagging, another stronger one drafts long-form content, and a third (very cautious one) supports legal review. Narrow evaluation opens the door to composability. General evaluation closes it.

You Overpay for Misaligned Capabilities

A generalist model that’s tuned to do a little bit of everything will likely cost more, both in token price and latency. If only 10% of your use cases benefit from that extra power, you’re spending 100% of your inference budget on them. That might be fine during prototyping, but at scale, those costs compound fast.

You Build on an Illusion of Competence

This might be the most subtle risk: assuming the model is “smart enough” to handle a task without validating how it actually behaves in context. That leads to brittle, untestable implementations that only surface problems in production, usually when a user points them out.

Conclusion

LLMs really are general-purpose tools, but your product isn’t general-purpose software.
That disconnect is where a lot of the trouble starts. If you evaluate models based on what they can do instead of what you need, you risk choosing the wrong tool, misjudging its performance, or inflating costs without improving outcomes.

When you treat your LLM usage like any other part of your system, by scoping it to a specific purpose, evaluating it in that context, and testing it like a feature instead of a miracle, you get better results. More predictable behavior. Lower risk. Less surprise in production.

The real power of LLMs isn’t in doing everything. It’s in doing the right thing, well.

Generality in Generative AI is an Illusion