feat(campaign): llmJudge single-call bridge + prune dead exports#280
Merged
Conversation
Build the `llmJudge(name, prompt, opts?)` helper the `JudgeConfig`
doc-comment (src/campaign/types.ts) promised but never shipped. It
returns a canonical `JudgeConfig` whose `score()` makes ONE injected
`ChatClient` call against `prompt`, parses the model's per-dimension
scores, normalizes by scale, composites via `weightedComposite` (the
same reducer `ensembleJudge` uses), and returns the canonical
`JudgeScore` ({ dimensions, composite in [0,1], notes }). Fail-loud:
an unparseable response or a missing/non-numeric dimension throws
`JudgeParseError` so the engine records a failed cell, never a silent
zero. Exported from the root barrel and `/campaign`.
Prune confirmed 0-consumer dead/duplicate exports:
- Remove `adversarialScenarioSearch` (fuzzAgent is the live twin);
keep `AdversarialMutation`, still consumed by the fuzz harness.
- Remove `proposeAutomatedPullRequest` (superseded by campaign
`openAutoPr`); fold its input validation into each `AutoPrClient`
so the path-traversal / duplicate-path / branch==base guards stay
on the live `proposeChange` path. `httpGithubClient`/`ghCliClient`
keep their fleet consumers.
- Un-export `buildByAxis`/`summariseRows` from the `/matrix` barrel
(internal aggregation helpers behind `runAgentMatrix`).
Kept the deprecated judges (createCustomJudge/createDomainExpertJudge/
codeExecutionJudge/coherenceJudge/adversarialJudge/defaultJudges) and
the legacy JudgeFn type — their migration onto llmJudge is a separate
pass.
Version trio 0.98.0 -> 0.99.0 (additive helper + removals).
tangletools
approved these changes
Jun 23, 2026
tangletools
left a comment
Contributor
There was a problem hiding this comment.
✅ Auto-approved PR — 98c9d027
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-23T08:05:26Z
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Builds the
llmJudge(name, prompt, opts?)helper that theJudgeConfigdoc-comment (src/campaign/types.ts) promised but that never existed, and prunes a set of confirmed 0-consumer dead/duplicate exports.A)
llmJudge— the single-LLM-call judge bridgellmJudge(name, prompt, opts?)returns a canonicalJudgeConfigwhosescore():promptthrough an injectedChatClient(the substrate's transport-agnostic LLM seam — router / sandbox / cli-bridge / mock),scale(unit[0,1]default, orten[0,10]→[0,1]) and clamps to[0,1],weightedComposite(the same sum-normalized reducerensembleJudgeuses),JudgeScore{ dimensions, composite ∈ [0,1], notes }.Fail-loud: an unparseable response or a missing/non-numeric dimension throws
JudgeParseError, so the campaign engine records a failed cell instead of a silent zero. Exported from the root barrel and/campaign. New test (tests/llm-judge.test.ts, 12 cases) proves a well-formed canonicalJudgeScore, scale normalization, weights, fenced-JSON parsing, single-call behavior, and the fail-loud paths.This is the landing pad the 6 deprecated judges migrate to next (separate pass).
B) Pruned confirmed 0-consumer dead exports
Each re-verified with a fleet-wide grep (excluding
node_modules+ agent-eval's own checkouts):adversarialScenarioSearchremoved —fuzzAgentis the live twin.AdversarialMutationis kept (the fuzz harness consumes it).proposeAutomatedPullRequestremoved — superseded by campaignopenAutoPr. Its input validation (path-traversal / duplicate-path / branch==base / empty-title guards) is folded into eachAutoPrClient.proposeChangeso the safety invariant stays on the live path.httpGithubClient/ghCliClientare kept — they have live fleet consumers (tax-agent, gtm-agent, creative-agent production loops).buildByAxis/summariseRowsun-exported from the/matrixbarrel — internal aggregation helpers behindrunAgentMatrix, with no external consumers.C) Investigated the 0-consumer yellow set — kept all 7
Each is a deliberate published-API affordance and/or internally consumed:
runAgentMatrix— live external consumers (tax-agent,blueprint-agentimport it from/matrix).mcnemar— the exact paired-binary significance test, paired with the livemcnemarPower/mcnemarRequiredN, has its own test.exportRewardModel— tier-3 RL reward-model export product surface (tested, referenced in docs).summarizeWorkflowTrace— heavily used internally + the/workflowpublic surface.fromClaudeCodeSession/fromOpenCodeSession— members of the documented harness-intake adapter family + internal belief-state consumer.parseAgentTrace— documented/contract/intakeadapter affordance.Kept
The 6 deprecated judges (
createCustomJudge/createDomainExpertJudge/codeExecutionJudge/coherenceJudge/adversarialJudge/defaultJudges) and the legacyJudgeFntype — their consumer migration ontollmJudgeis the next pass.Verification
pnpm run typecheckcleanpnpm run lintclean (no errors)pnpm test— 248 files, 2542 passed, 2 skipped, 0 failedpnpm run buildclean; verifiedllmJudgeis a function exported from the builtdist/index.js+dist/campaign/index.js, andscore()returns a canonicalJudgeScorefrom the built artifactVersion
0.98.0→0.99.0(additive helper + removals = minor). npm + PyPI trio bumped together (package.json,clients/python/pyproject.toml,__init__.py).