Skip to content

feat(campaign): llmJudge single-call bridge + prune dead exports#280

Merged
drewstone merged 1 commit into
mainfrom
chore/llm-judge-bridge-cleanup
Jun 23, 2026
Merged

feat(campaign): llmJudge single-call bridge + prune dead exports#280
drewstone merged 1 commit into
mainfrom
chore/llm-judge-bridge-cleanup

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

What

Builds the llmJudge(name, prompt, opts?) helper that the JudgeConfig doc-comment (src/campaign/types.ts) promised but that never existed, and prunes a set of confirmed 0-consumer dead/duplicate exports.

A) llmJudge — the single-LLM-call judge bridge

llmJudge(name, prompt, opts?) returns a canonical JudgeConfig whose score():

  1. makes one call against prompt through an injected ChatClient (the substrate's transport-agnostic LLM seam — router / sandbox / cli-bridge / mock),
  2. parses the model's per-dimension scores, validating every declared dimension is present and numeric,
  3. normalizes by scale (unit [0,1] default, or ten [0,10][0,1]) and clamps to [0,1],
  4. composites via weightedComposite (the same sum-normalized reducer ensembleJudge uses),
  5. returns the canonical JudgeScore { dimensions, composite ∈ [0,1], notes }.

Fail-loud: an unparseable response or a missing/non-numeric dimension throws JudgeParseError, so the campaign engine records a failed cell instead of a silent zero. Exported from the root barrel and /campaign. New test (tests/llm-judge.test.ts, 12 cases) proves a well-formed canonical JudgeScore, scale normalization, weights, fenced-JSON parsing, single-call behavior, and the fail-loud paths.

This is the landing pad the 6 deprecated judges migrate to next (separate pass).

B) Pruned confirmed 0-consumer dead exports

Each re-verified with a fleet-wide grep (excluding node_modules + agent-eval's own checkouts):

  • adversarialScenarioSearch removed — fuzzAgent is the live twin. AdversarialMutation is kept (the fuzz harness consumes it).
  • proposeAutomatedPullRequest removed — superseded by campaign openAutoPr. Its input validation (path-traversal / duplicate-path / branch==base / empty-title guards) is folded into each AutoPrClient.proposeChange so the safety invariant stays on the live path. httpGithubClient / ghCliClient are kept — they have live fleet consumers (tax-agent, gtm-agent, creative-agent production loops).
  • buildByAxis / summariseRows un-exported from the /matrix barrel — internal aggregation helpers behind runAgentMatrix, with no external consumers.

C) Investigated the 0-consumer yellow set — kept all 7

Each is a deliberate published-API affordance and/or internally consumed:

  • runAgentMatrix — live external consumers (tax-agent, blueprint-agent import it from /matrix).
  • mcnemar — the exact paired-binary significance test, paired with the live mcnemarPower/mcnemarRequiredN, has its own test.
  • exportRewardModel — tier-3 RL reward-model export product surface (tested, referenced in docs).
  • summarizeWorkflowTrace — heavily used internally + the /workflow public surface.
  • fromClaudeCodeSession / fromOpenCodeSession — members of the documented harness-intake adapter family + internal belief-state consumer.
  • parseAgentTrace — documented /contract/intake adapter affordance.

Kept

The 6 deprecated judges (createCustomJudge/createDomainExpertJudge/codeExecutionJudge/coherenceJudge/adversarialJudge/defaultJudges) and the legacy JudgeFn type — their consumer migration onto llmJudge is the next pass.

Verification

  • pnpm run typecheck clean
  • pnpm run lint clean (no errors)
  • pnpm test — 248 files, 2542 passed, 2 skipped, 0 failed
  • pnpm run build clean; verified llmJudge is a function exported from the built dist/index.js + dist/campaign/index.js, and score() returns a canonical JudgeScore from the built artifact

Version

0.98.00.99.0 (additive helper + removals = minor). npm + PyPI trio bumped together (package.json, clients/python/pyproject.toml, __init__.py).

Build the `llmJudge(name, prompt, opts?)` helper the `JudgeConfig`
doc-comment (src/campaign/types.ts) promised but never shipped. It
returns a canonical `JudgeConfig` whose `score()` makes ONE injected
`ChatClient` call against `prompt`, parses the model's per-dimension
scores, normalizes by scale, composites via `weightedComposite` (the
same reducer `ensembleJudge` uses), and returns the canonical
`JudgeScore` ({ dimensions, composite in [0,1], notes }). Fail-loud:
an unparseable response or a missing/non-numeric dimension throws
`JudgeParseError` so the engine records a failed cell, never a silent
zero. Exported from the root barrel and `/campaign`.

Prune confirmed 0-consumer dead/duplicate exports:
- Remove `adversarialScenarioSearch` (fuzzAgent is the live twin);
  keep `AdversarialMutation`, still consumed by the fuzz harness.
- Remove `proposeAutomatedPullRequest` (superseded by campaign
  `openAutoPr`); fold its input validation into each `AutoPrClient`
  so the path-traversal / duplicate-path / branch==base guards stay
  on the live `proposeChange` path. `httpGithubClient`/`ghCliClient`
  keep their fleet consumers.
- Un-export `buildByAxis`/`summariseRows` from the `/matrix` barrel
  (internal aggregation helpers behind `runAgentMatrix`).

Kept the deprecated judges (createCustomJudge/createDomainExpertJudge/
codeExecutionJudge/coherenceJudge/adversarialJudge/defaultJudges) and
the legacy JudgeFn type — their migration onto llmJudge is a separate
pass.

Version trio 0.98.0 -> 0.99.0 (additive helper + removals).

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 98c9d027

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-23T08:05:26Z

@drewstone drewstone merged commit 2312feb into main Jun 23, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants