Multithreading for KeY and now it's deterministic too#3842
Conversation
The idea of a worker pool is to have a fix/max amount of threads, and to dispatch work on them.
|
no that is just the starting idea. Depending on the amounts of threads you configure to use, you have that nr of threads and it is capped at the number of processor cores of your computer. Then there is a worker pool and a lifo queue from which a free worker is assigned the next goal. I'll clarify the description thanks.
not yet. The first milestone was to get to a point where we have for some problems a win and KeY is (at least as far as one can test) race free. The problem is, I did not have any race conditions even under high pressure. But the problem is that other OSes with other schedulers might discover some still existing problems, so if someone has time and can just run it once in a while to get feedback, would be appreciated. For trying out, I recommend the preview integration with the single core improvemnts. It solves some congestions here already. |
|
Could you provide the command to be executed? I could try to get a 64-core machine if we would have proofs having such number of goals. |
3141113 to
9d8d91c
Compare
|
Status: proof-search determinism (goal-order independence) The last commits (
Measured effect (variance benchmark, 10 reps):
Note for RAP: like any change to rule-selection order this changes the shape of found proofs (some grow slightly, some shrink — e.g. symmArray −12 %). |
|
Update after
|
|
Determinism: the last run-to-run flip is fixed ( Root cause of the final coin flip: the Run-to-run node-count spread, 10 repetitions (
Sequential agreement: 12 of 13 variance-set examples are proved node-identically by the sequential and the 1-worker parallel prover (SimplifiedLinkedList.remove Δ4 nodes, down from Δ≈4500 before the series). 14 consecutive 8-worker runs of SimplifiedLinkedList.remove produce the identical proof. Comparison with current
So main's sequential prover is already reproducible on this example; the PR carries that reproducibility over to the multi-core prover — the surface this PR adds — and closes several determinism defects that are latent on main as well (polarity-shared index-cache entries, incomplete cost tie-break, label-sensitive duplicate veto). Known footnotes: proof sizes shift against main (rule-selection order changed; SLL 58987 vs 55300 on a fresh JVM), and on this branch the first proof of a JVM can differ — deterministically — from later ones through a static cache in the cost-reuse layer; the queued port of the |
|
Correction to the numbers above: it is 13 of 13 — the reported 4-node difference on SimplifiedLinkedList.remove was an artifact of the variance benchmark's warm-up run (a parallel warm-up leaves state in a static cost-reuse cache that deterministically shifts later proofs in the same JVM; the sequential and 1-worker configurations sat at different positions in that sequence). Measured cleanly — one sequential and one 1-worker proof, full tree walk — the two proofs are identical at every node. The benchmark hardening (per-configuration JVM isolation) and the removal of the static caches (port of the |
|
Follow-ups landed: the |
|
The last MT wobble is gone — fresh names are now branch-state-derived (#3851, multi-core part). The per-worker name allocator ( Result: symmArray, the one remaining run-to-run wobble (±0.08 %), is now bit-stable over 15 consecutive 8-worker runs and closes at the sequential node count. Every example in the variance set is now fully deterministic at every worker count. Ships with a naming regression harness ( Saved-proof compatibility unaffected: the touched channels replay through the NameRecorder; the added-rule format is unchanged. Suites green incl. proveRules 203/203. |
Compute NodeInfo's active statement lazily, cache the loop-invariant instantiation on the rule app instead of a static field, and make the per-call program transformers stateless, so the proving path carries no shared mutable state before the parallel prover is introduced.
ConcurrentLruCache (exact-LRU, single-lock, drop-in for synchronizedMap over an LRUCache) for eviction-sensitive caches, StripedLruCache for pure caches, and WeakValueInterner for identity-preserving interning under concurrency, with concurrency tests.
Add the parallelProverEnabled / parallelProverThreadCount settings and a Profile.supportsParallelAutomode() capability: the standard Java profile opts in, the well-definedness, information-flow and symbolic-execution profiles keep the single-core fallback.
Add EssentialProofListener and Proof.suspendNonEssentialListeners(), which detaches every proof-tree and rule-app listener not tagged essential for the duration of a run and re-attaches them afterwards, so nothing unrelated to proving fires on a worker thread.
ParallelProver runs automatic proof search on a worker pool, one open goal per worker, behind the AutoProvers selection seam. A single GoalScheduler monitor hands out goals depth-first; Goal.apply is split into a concurrent compute phase and a commit phase serialized on one lock; fresh names come from a partitioned per-proof allocator (atomic Counter), namespace flushes are deferred, and the merge rule is disabled during parallel runs. The taclet-index cache key is immutable so siblings sharing the cache set cannot race on it. Includes the scheduler unit test and the proof-equivalence gate.
Harden the shared state the workers touch during search: route the matching and cost-path caches through the thread-safe cache utilities, intern parametric operators and elementary updates identity-preservingly, give the per-call instantiation caches and counters safe publication or per-worker copies, and guard the specification repository's proving-time maps. Behaviour-preserving.
Add a settings pane and a clickable status-line indicator to switch between the single-core and multi-core prover, a --threads CLI option, and gating of the single-core-only features: proof caching, slicing and the merge rule are disabled (with explanatory tooltips, updated live) while the multi-core prover is active, and restored when switching back.
4677232 to
6ab478c
Compare
Replace the hardcoded ApplyStrategy in the strategy-running proof macros and the script auto command with the AutoProvers seam, so macro and script automation honour the selected prover mode. The toolbar's Full Automation action (DefaultAutoMacro) and the proof-script auto command are included -- both are plain automode and must follow the mode setting. TryCloseMacro deliberately stays on the single-threaded prover: it closes one goal at a time under a tight per-goal step budget, where several workers exploring a single goal's subtree apply rules in a less step-efficient order that can exhaust the budget.
Add the parallel-prover speedup/variance benchmarks and the stress tests, with CI wiring: a generous MtStressTest step cap so benign parallel divergence cannot fail it, a corrected MtSpeedupBenchmark sequential baseline (it was secretly parallel), and MtVarianceBenchmark.
Pool the matching cursor per thread instead of behind a global lock, stop the multi-core workers before restoring proof listeners on cancel, and make the SpecificationRepository spec maps thread-safe for the parallel prover.
Pre-create the default execution context once and stop re-registering it, and fail fast on Java type registration during a parallel proof run instead of racing on lazy registration.
Make LexPathOrdering's caches, ModalityCache, the quantifier-heuristic single-slot caches and the block/loop-contract instantiation memo thread-safe/thread-local, split the shared term-index cache by position polarity and report taclet apps cache-independently on formula modification, so cache state cannot change proofs across worker schedules.
Branch labels are a display concern -- nothing in the automatic proof search reads them. The sole apparent reader, QueryExpandCost's isStepCaseBranch (which classified step-case subtrees by substring- matching the label text), was dead code and is removed, so no strategy code depends on label text. Yet labels were materialised eagerly during rule application: NodeInfo.setBranchLabel ran the #sv schema-variable substitution + ProofSaver.printAnything term pretty-printing, and the contract/observer rules pre-formatted terms into the string. That work sat on the proving path (needlessly lengthening it under the parallel prover) purely to build strings only the GUI/save ever read. Defer it: NodeInfo stores the raw label (or a Supplier, for labels that stringify terms) and resolves it lazily and memoized on the first getBranchLabel() -- i.e. on display/save. The three "Null reference (term = null)" sites in UseOperationContractRule / ObserverToUpdateRule capture the term reference and defer the toString via the new Supplier overload. Verified: taclet #sv, constant and lazy-Supplier labels all resolve to the same content, the Supplier is evaluated once (memoized), and the sampled proofs still close.
Keep the synthetic __Default__ type out of type enumerations so it is not lazily registered during matching under the parallel prover, with a regression test.
Break equal-cost ties by content -- structural position first, then rule, then instantiations and assumes instantiations, comparing container positions before the rule-app identity shortcut, so the rule-app queue is a consistent total order. Position-first mirrors the historical generation-order surfacing of equal-cost candidates and keeps proof sizes unchanged (rule-first ordering doubled the TimSort.binarySort case study by splitting too early). Derive trigger metavariable names from the quantified formula so they do not depend on creation order, and restore the global proof settings in the MT stress tests so a leaked setting cannot skew measurements. Adds a sequential-vs-one-worker determinism regression test.
Fingerprint the duplicate-application veto by shallow, label-free structure, so history-dependent origin labels cannot let a duplicate escape the veto nondeterministically.
Derive fresh names and temporary program-element names/labels from the goal-local branch state on every worker count, add a regression harness, pin saved-proof replay compatibility for added-rule names, and code the term-level proof-identity acceptance criterion (gated).
Give each RunAllProofs test fork its own statistics file, stop the proof search when an essential proof listener fails, default to the single-threaded prover with CI coverage for the parallel one, remove the non-thread-safe LRUCache class, and keep the MT RunAllProofs smoke tests out of the unit-test job.
Derive \obtain's fresh name from the current namespace, move the node serial counter onto Proof, number class-axiom taclets by content (new reusable ContentOrderNumbering), and deprecate Services.getCounter, so names and ids come from the goal-local proof state rather than a proof-global counter (#3851).
Always write RunAllProofs statistics per process (dropping the sysprop switch), and run the parallel-prover benchmarks and probes as named Gradle tasks with -P options instead of a pile of key.mt.* system properties.
Intended Change
This pull request adds a multi-core (parallel) prover to KeY: automatic proof search runs on a worker pool. An open goal is either not yet assigned to a worker or it has exactly one worker, proving independent sibling goals concurrently. There are at most as many workers as configured and the number of workers is capped at the number of processor cores. It is a port and modernisation of the 2018 HacKeYthon multithreading prototype, rebuilt on current
main.The legacy single-threaded prover stays the default and the safe fallback; the multi-core prover is opt-in per the configuration below.
Proof search is deterministic: the multi-core prover constructs the identical proof, term for term, on every run and at every worker count (see Deterministic proof search below).
Architecture (conceptual)
The guiding principle is isolation, not locking.
Goal.applyis split into a concurrent compute phase (rule selection + matching, the expensive part) and a commit phase (proof-tree mutation) serialized under a single lock. A single-monitorGoalSchedulerhands goals to workers depth-first. Non-essential listeners are detached for the run, fresh names come from a per-worker partition of the name allocator, shared matching/cost caches use thread-safe cache utilities, and rules not yet concurrency-safe (the merge rule) disable themselves during a multi-worker run. Only the standard Java profile runs in parallel.Full conceptual write-up: Multi-Core Proving (key-docs).
Deterministic proof search
A headline property of this PR: proof search is reproducible — also on the multi-core prover. A series of commits (
c3f619862d…b939a7ff58) eliminated every identified source of goal-scheduling-dependent rule selection: a polarity-sharing term-index cache, cache-dependent candidate reporting, an incomplete equal-cost tie-break, and a term-label-sensitive duplicate-application veto — several of them latent onmainas well.Run-to-run node-count spread over 10 repetitions (
MtVarianceBenchmark):Sequential and 1-worker proofs are term-identical — the same rule at the same position producing the same sequent, fresh names included — at every node, verified by a full tree walk over the example set (
ParallelProverDeterminismTest). Proofs are also worker-count-independent: e.g. SimplifiedLinkedList.remove constructs the identical 58987-node proof at 1, 4 and 8 workers, and its run-to-run node-count spread at 8 workers is zero over repeated runs. Two independent sources of scheduling dependence were removed to reach this: the cost-reuse hardening series (merged here) retired the last JVM-global caches that made proofs depend on what ran before them in the same process, and fresh names were moved off global counters (below).Fresh names are derived from the goal-local branch state, not from global counters (addresses #3851). Previously a per-worker allocator gave skolems worker-dependent tags and several
#-indexed program-variable and label names were drawn from proof-global counters advanced as a side effect of unrelated mints — so two otherwise identical proofs (across workers, or across a reload / prune-and-redo) could disagree on names. Every fresh name on the proving path now comes from a smallest-free-index search against the goal-local namespaces: a pure function of the branch, identical no matter which worker (or how many) runs the goal. This was the last input the equal-cost tie-break read that depended on scheduling; with it gone, symmArray — the one proof with a residual run-to-run wobble — is bit-stable. Saved-proof compatibility is preserved: recorded name proposals still win over regeneration, and the added-rule name format (which replay resolves by regeneration) is unchanged. A dedicated naming harness pins the invariants (branch-locality, prune-reset, added-rule locality,addProgramVariablesoundness, replay compatibility). This closes the naming brittleness that motivated #3851: Closes #3851. (Three global name counters remain, all off the automode proving path and none affecting determinism: theGenCntcounter used only by the\obtainproof-script command, the init-timeclassAxiomtaclet-name counter, and the Symbolic-Execution / information-flow term-label id counters. They can be migrated with the same pattern as a follow-up if desired.)Problems encountered, fixed and verified
Goal, outside the proof-level suspension) fired on worker threads and touched Swing off the event-dispatch thread. The suspension now covers them too, and the views refresh once from the final proof state after a run.Single-core-only features
Proof caching, proof slicing and the merge rule are switched off while the multi-core prover is active (GUI greys them out with a tooltip; the status line shows the mode) and restored when switching back. Loading a proof script and running macros use the multi-core prover when active and the profile supports it, with one deliberate exception:
TryClose(and theFull Autopilot that ends in it) closes goals one at a time under a tight per-goal step budget and is pinned to the single-core prover, a single goal offers no parallelism, and several workers exploring its subtree apply rules less step-efficiently and can exhaust the budget before the goal closes. Either way a run returns only once all workers stop, so pruning sees a quiescent proof. These restrictions will be lifted incrementally with the feature owners once the subset is confirmed safe. Because the default is single-core, these features work exactly as before.Configuration
SC/MT N×status-line indicator.
--threads N.-Dkey.prover.parallel[.threads]overrides.Benchmarks
Measured on a 16-core machine, median of three runs per proof (run-to-run wall-clock noise ≈2 %), each proof loaded in a clean settings state. seq (s) is the single-threaded wall-clock time; the 2× / 4× / 8× columns give the speedup at that worker count over it. The multi-core prover does not change the single-threaded path, and the deterministic proof search (above) is per-node roughly cost-neutral, a bounded, proof-dependent constant factor (0–8 %) that does not grow with proof size (verified to ~130 k nodes).
Real-world proofs — automode (speedup by worker count).
Wide, splitting proofs scale furthest —
ArrayList_concatenatereaches 3.86× at 8 workers,add3.67×,median3.30×. Moderately wide proofs plateau around 4 workers (~2–2.4×). Narrow proofs gain little and can regress at 8× (see the shape table). (quicksort/sortis omitted — it needs its proof script to close, so it is not apure-automode entry.)
Diagrams. Rendered speedup bar charts are in the dev doc's Measured speedup section.
Plan
Type of pull request
Ensuring quality
key.core+ touched-module unit tests pass.Separable,
main-relevant fixes (kept as standalone commits for cherry-picking, useful even single-core):ClassCastException(present onmain) that can prevent KeY from launching when loading certain settings files (Cherry-Picked on Fixes for minor regressions on main #3844)nullas the string"null", so reloading (Cherry-Picked on Fixes for minor regressions on main #3844) it failed to resolve a profile and showed an error instead of opening the fileAdditional information and contact(s)
This PR provides a clear opt-in benefit (parallel speedup on splitting proofs)
with a safe fallback to the unchanged single-core prover.
This PR has been done with AI tooling support.
The contributions within this pull request are licensed under GPLv2 (only) for inclusion in KeY.