feat(campaign): llmJudge single-call bridge + prune dead exports by drewstone · Pull Request #280 · tangle-network/agent-eval

drewstone · 2026-06-23T08:05:19Z

What

Builds the llmJudge(name, prompt, opts?) helper that the JudgeConfig doc-comment (src/campaign/types.ts) promised but that never existed, and prunes a set of confirmed 0-consumer dead/duplicate exports.

A) `llmJudge` — the single-LLM-call judge bridge

llmJudge(name, prompt, opts?) returns a canonical JudgeConfig whose score():

makes one call against prompt through an injected ChatClient (the substrate's transport-agnostic LLM seam — router / sandbox / cli-bridge / mock),
parses the model's per-dimension scores, validating every declared dimension is present and numeric,
normalizes by scale (unit [0,1] default, or ten [0,10]→[0,1]) and clamps to [0,1],
composites via weightedComposite (the same sum-normalized reducer ensembleJudge uses),
returns the canonical JudgeScore { dimensions, composite ∈ [0,1], notes }.

Fail-loud: an unparseable response or a missing/non-numeric dimension throws JudgeParseError, so the campaign engine records a failed cell instead of a silent zero. Exported from the root barrel and /campaign. New test (tests/llm-judge.test.ts, 12 cases) proves a well-formed canonical JudgeScore, scale normalization, weights, fenced-JSON parsing, single-call behavior, and the fail-loud paths.

This is the landing pad the 6 deprecated judges migrate to next (separate pass).

B) Pruned confirmed 0-consumer dead exports

Each re-verified with a fleet-wide grep (excluding node_modules + agent-eval's own checkouts):

adversarialScenarioSearch removed — fuzzAgent is the live twin. AdversarialMutation is kept (the fuzz harness consumes it).
proposeAutomatedPullRequest removed — superseded by campaign openAutoPr. Its input validation (path-traversal / duplicate-path / branch==base / empty-title guards) is folded into each AutoPrClient.proposeChange so the safety invariant stays on the live path. httpGithubClient / ghCliClient are kept — they have live fleet consumers (tax-agent, gtm-agent, creative-agent production loops).
buildByAxis / summariseRows un-exported from the /matrix barrel — internal aggregation helpers behind runAgentMatrix, with no external consumers.

C) Investigated the 0-consumer yellow set — kept all 7

Each is a deliberate published-API affordance and/or internally consumed:

runAgentMatrix — live external consumers (tax-agent, blueprint-agent import it from /matrix).
mcnemar — the exact paired-binary significance test, paired with the live mcnemarPower/mcnemarRequiredN, has its own test.
exportRewardModel — tier-3 RL reward-model export product surface (tested, referenced in docs).
summarizeWorkflowTrace — heavily used internally + the /workflow public surface.
fromClaudeCodeSession / fromOpenCodeSession — members of the documented harness-intake adapter family + internal belief-state consumer.
parseAgentTrace — documented /contract/intake adapter affordance.

Kept

The 6 deprecated judges (createCustomJudge/createDomainExpertJudge/codeExecutionJudge/coherenceJudge/adversarialJudge/defaultJudges) and the legacy JudgeFn type — their consumer migration onto llmJudge is the next pass.

Verification

pnpm run typecheck clean
pnpm run lint clean (no errors)
pnpm test — 248 files, 2542 passed, 2 skipped, 0 failed
pnpm run build clean; verified llmJudge is a function exported from the built dist/index.js + dist/campaign/index.js, and score() returns a canonical JudgeScore from the built artifact

Version

0.98.0 → 0.99.0 (additive helper + removals = minor). npm + PyPI trio bumped together (package.json, clients/python/pyproject.toml, __init__.py).

Build the `llmJudge(name, prompt, opts?)` helper the `JudgeConfig` doc-comment (src/campaign/types.ts) promised but never shipped. It returns a canonical `JudgeConfig` whose `score()` makes ONE injected `ChatClient` call against `prompt`, parses the model's per-dimension scores, normalizes by scale, composites via `weightedComposite` (the same reducer `ensembleJudge` uses), and returns the canonical `JudgeScore` ({ dimensions, composite in [0,1], notes }). Fail-loud: an unparseable response or a missing/non-numeric dimension throws `JudgeParseError` so the engine records a failed cell, never a silent zero. Exported from the root barrel and `/campaign`. Prune confirmed 0-consumer dead/duplicate exports: - Remove `adversarialScenarioSearch` (fuzzAgent is the live twin); keep `AdversarialMutation`, still consumed by the fuzz harness. - Remove `proposeAutomatedPullRequest` (superseded by campaign `openAutoPr`); fold its input validation into each `AutoPrClient` so the path-traversal / duplicate-path / branch==base guards stay on the live `proposeChange` path. `httpGithubClient`/`ghCliClient` keep their fleet consumers. - Un-export `buildByAxis`/`summariseRows` from the `/matrix` barrel (internal aggregation helpers behind `runAgentMatrix`). Kept the deprecated judges (createCustomJudge/createDomainExpertJudge/ codeExecutionJudge/coherenceJudge/adversarialJudge/defaultJudges) and the legacy JudgeFn type — their migration onto llmJudge is a separate pass. Version trio 0.98.0 -> 0.99.0 (additive helper + removals).

tangletools

✅ Auto-approved PR — `98c9d027`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-23T08:05:26Z}

tangletools approved these changes Jun 23, 2026

View reviewed changes

drewstone merged commit 2312feb into main Jun 23, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(campaign): llmJudge single-call bridge + prune dead exports#280

feat(campaign): llmJudge single-call bridge + prune dead exports#280
drewstone merged 1 commit into
mainfrom
chore/llm-judge-bridge-cleanup

drewstone commented Jun 23, 2026

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented Jun 23, 2026

What

A) llmJudge — the single-LLM-call judge bridge

B) Pruned confirmed 0-consumer dead exports

C) Investigated the 0-consumer yellow set — kept all 7

Kept

Verification

Version

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 98c9d027

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

A) `llmJudge` — the single-LLM-call judge bridge

✅ Auto-approved PR — `98c9d027`