Tool Evaluation

The Evaluation Loop: How We Measure and Improve AI Tools

You shipped tools. Agents use them. But are they working well? Without evaluation, you're guessing. Here's the four-step loop we use to measure tool effectiveness and systematically improve agent performance.

Evaluation Metrics QA Continuous Improvement

The Problem We Found

After building our initial tool system, we had a nagging question: how do we know if these tools are actually working well?

Agent outputs looked reasonable. The pipeline completed successfully. Users were generating code. But “looks reasonable” is not a metric. We had no way to distinguish between a tool that agents used correctly 95% of the time and one they struggled with 40% of the time — because both produced functional output. The difference was in the cost: tokens wasted on retries, unnecessary tool calls, context pollution from oversized responses.

We discovered this gap the hard way. We redesigned a tool’s description — a single sentence change — and agent task completion jumped 12%. We added a format parameter to another tool and token consumption per task dropped 30%. These were not minor optimizations. They were fundamental improvements hiding behind “it works.”

The lesson was clear: without systematic evaluation, you are optimizing by intuition. And intuition is unreliable when the system you are optimizing is non-deterministic. With the MCP ecosystem now hosting thousands of tool servers, the need for systematic evaluation has never been greater.

The Four-Step Evaluation Loop

Step 1: Prototype

Build a quick prototype and test it locally. For HubAI, this means:

Implement the tool with BaseMCPTool and Zod schema validation
Run it through a handful of representative agent tasks manually
Collect initial feedback — does the agent select the tool? Does it use the right parameters? Is the response useful?

The prototype does not need to be perfect. It needs to be testable. A working tool with a mediocre description is more valuable than a perfectly described tool that does not exist yet.

Step 2: Evaluate

Run a programmatic evaluation — one agentic loop per task, each with a single task prompt and the tool set under test.

Our evaluation suite has 20 representative tasks spanning the full complexity range:

Single entity with basic fields (simple)
Multiple entities with one-to-many relationships (medium)
Complex project with embedded documents, virtual fields, and cross-entity validation (hard)

Each task is paired with a verifiable outcome — not “does the output look good,” but “does the generated schema contain exactly these fields with exactly these types.” Exact match where possible, LLM-as-judge for subjective quality dimensions.

We track five metrics per evaluation run:

Metric	What It Tells You
Accuracy	Did the agent produce the correct output?
Tool-call count	How many tool calls did the agent need? (Fewer = better tool design)
Token consumption	How many tokens were consumed? (Lower = more efficient tools)
Tool errors	How many tool calls failed? (More = unclear descriptions or schemas)
Runtime	How long did the task take? (Slower = unnecessary steps)

Step 3: Analyze

This is where the real insights live. Raw metrics tell you what is happening. Transcripts tell you why.

We read every evaluation transcript — not just the final output, but the full sequence of tool calls, agent reasoning, and intermediate results. The patterns we look for:

Redundant tool calls — If the agent calls the same tool twice with slightly different parameters, the tool probably needs pagination, filtering, or a different response format.
Invalid parameter errors — If agents frequently pass wrong parameter types or values, the description is ambiguous or the parameter names are unclear.
Tool selection confusion — If the agent hesitates between two tools (visible in reasoning traces), the tools overlap or their descriptions are too similar.
Context overflow after tool calls — If subsequent agent performance degrades after a specific tool call, the tool is returning too much data.

The critical skill is reading between the lines. Agents do not always say what they mean. An agent that says “Let me try a different approach” after a tool call is telling you that the tool’s response was confusing. An agent that calls the same tool three times with increasing specificity is telling you the tool needs better filtering.

Step 4: Collaborate

The final step closes the loop. We take the evaluation transcripts — including the agent reasoning, tool calls, and results — and feed them into an agent tasked with improving the tools.

This agent-as-collaborator approach works because the AI can spot patterns across dozens of transcripts that a human would miss:

“Agents consistently struggle with the scope parameter — consider renaming it to detailLevel and adding examples in the description”
“The getEntityContext tool returns relationship data that is only used 15% of the time — consider making it opt-in with a includeRelationships flag”
“Error messages for the validateSchema tool do not include the failing field name — agents waste 3-4 tool calls locating the error”

We maintain a held-out test set to prevent overfitting. Tool improvements are tested against tasks the improvement agent has never seen, ensuring that fixes generalize rather than memorize.

What This Looks Like in Practice

Here is a concrete example of the loop in action:

Problem detected: The getProjectContext tool had a 23% redundant call rate — agents frequently called it twice per task.

Transcript analysis: Agents called the tool once for project metadata, then again for project settings. The tool returned both, but the response was large (~1,500 tokens) and agents could not find the settings within the metadata block.

Fix: Added a responseFormat parameter with options concise (metadata only, ~200 tokens) and detailed (full context, ~1,500 tokens). Updated the description to recommend starting with concise.

Result: Redundant call rate dropped from 23% to 3%. Token consumption per task dropped 18%.

One parameter change. One description update. Measurable improvement across every metric.

The Key Insight

Evaluation is not a one-time quality gate. It is a continuous loop that drives systematic improvement. Every round of evaluation reveals tool design issues that are invisible without data.

The most impactful improvements are usually the smallest: a renamed parameter, a restructured response, a more specific error message. But you cannot find them without measuring, and you cannot measure without a structured evaluation process.

Tools are contracts between deterministic systems and non-deterministic agents. Evaluation is how you ensure those contracts are working. Without it, you are optimizing by intuition — and intuition is unreliable when the system is non-deterministic.