The Evaluation Loop: How We Measure and Improve AI Tools
You shipped tools. Agents use them. But are they working well? Without evaluation, you're guessing. Here's the four-step loop we use to measure tool effectiveness and systematically improve agent performance.
The Problem We Found
After building our initial tool system, we had a nagging question: how do we know if these tools are actually working well?
Agent outputs looked reasonable. The pipeline completed successfully. Users were generating code. But “looks reasonable” is not a metric. We had no way to distinguish between a tool that agents used correctly 95% of the time and one they struggled with 40% of the time — because both produced functional output. The difference was in the cost: tokens wasted on retries, unnecessary tool calls, context pollution from oversized responses.
We discovered this gap the hard way. We redesigned a tool’s description — a single sentence change — and agent task completion jumped 12%. We added a format parameter to another tool and token consumption per task dropped 30%. These were not minor optimizations. They were fundamental improvements hiding behind “it works.”
The lesson was clear: without systematic evaluation, you are optimizing by intuition. And intuition is unreliable when the system you are optimizing is non-deterministic. With the MCP ecosystem now hosting thousands of tool servers, the need for systematic evaluation has never been greater.
The Four-Step Evaluation Loop
Step 1: Prototype
Build a quick prototype and test it locally. For HubAI, this means:
- Implement the tool with
BaseMCPTooland Zod schema validation - Run it through a handful of representative agent tasks manually
- Collect initial feedback — does the agent select the tool? Does it use the right parameters? Is the response useful?
The prototype does not need to be perfect. It needs to be testable. A working tool with a mediocre description is more valuable than a perfectly described tool that does not exist yet.
Step 2: Evaluate
Run a programmatic evaluation — one agentic loop per task, each with a single task prompt and the tool set under test.
Our evaluation suite has 20 representative tasks spanning the full complexity range:
- Single entity with basic fields (simple)
- Multiple entities with one-to-many relationships (medium)
- Complex project with embedded documents, virtual fields, and cross-entity validation (hard)
Each task is paired with a verifiable outcome — not “does the output look good,” but “does the generated schema contain exactly these fields with exactly these types.” Exact match where possible, LLM-as-judge for subjective quality dimensions.
We track five metrics per evaluation run:
| Metric | What It Tells You |
|---|---|
| Accuracy | Did the agent produce the correct output? |
| Tool-call count | How many tool calls did the agent need? (Fewer = better tool design) |
| Token consumption | How many tokens were consumed? (Lower = more efficient tools) |
| Tool errors | How many tool calls failed? (More = unclear descriptions or schemas) |
| Runtime | How long did the task take? (Slower = unnecessary steps) |
Step 3: Analyze
This is where the real insights live. Raw metrics tell you what is happening. Transcripts tell you why.
We read every evaluation transcript — not just the final output, but the full sequence of tool calls, agent reasoning, and intermediate results. The patterns we look for:
- Redundant tool calls — If the agent calls the same tool twice with slightly different parameters, the tool probably needs pagination, filtering, or a different response format.
- Invalid parameter errors — If agents frequently pass wrong parameter types or values, the description is ambiguous or the parameter names are unclear.
- Tool selection confusion — If the agent hesitates between two tools (visible in reasoning traces), the tools overlap or their descriptions are too similar.
- Context overflow after tool calls — If subsequent agent performance degrades after a specific tool call, the tool is returning too much data.
The critical skill is reading between the lines. Agents do not always say what they mean. An agent that says “Let me try a different approach” after a tool call is telling you that the tool’s response was confusing. An agent that calls the same tool three times with increasing specificity is telling you the tool needs better filtering.
Step 4: Collaborate
The final step closes the loop. We take the evaluation transcripts — including the agent reasoning, tool calls, and results — and feed them into an agent tasked with improving the tools.
This agent-as-collaborator approach works because the AI can spot patterns across dozens of transcripts that a human would miss:
- “Agents consistently struggle with the
scopeparameter — consider renaming it todetailLeveland adding examples in the description” - “The
getEntityContexttool returns relationship data that is only used 15% of the time — consider making it opt-in with aincludeRelationshipsflag” - “Error messages for the
validateSchematool do not include the failing field name — agents waste 3-4 tool calls locating the error”
We maintain a held-out test set to prevent overfitting. Tool improvements are tested against tasks the improvement agent has never seen, ensuring that fixes generalize rather than memorize.
What This Looks Like in Practice
Here is a concrete example of the loop in action:
Problem detected: The getProjectContext tool had a 23% redundant call rate — agents frequently called it twice per task.
Transcript analysis: Agents called the tool once for project metadata, then again for project settings. The tool returned both, but the response was large (~1,500 tokens) and agents could not find the settings within the metadata block.
Fix: Added a responseFormat parameter with options concise (metadata only, ~200 tokens) and detailed (full context, ~1,500 tokens). Updated the description to recommend starting with concise.
Result: Redundant call rate dropped from 23% to 3%. Token consumption per task dropped 18%.
One parameter change. One description update. Measurable improvement across every metric.
The Key Insight
Evaluation is not a one-time quality gate. It is a continuous loop that drives systematic improvement. Every round of evaluation reveals tool design issues that are invisible without data.
The most impactful improvements are usually the smallest: a renamed parameter, a restructured response, a more specific error message. But you cannot find them without measuring, and you cannot measure without a structured evaluation process.
Tools are contracts between deterministic systems and non-deterministic agents. Evaluation is how you ensure those contracts are working. Without it, you are optimizing by intuition — and intuition is unreliable when the system is non-deterministic.