Skip to content

Commit 740bfec

Browse files
rijnbclaude
andcommitted
docs: design spec for mapcodelib speed optimization
Two-phase plan (A: local hot-path cleanups; B: precomputed companion tables) targeting 20-50% wall-time reduction on `time ./unittest` at -O3, preserving bit-exact output, strict portable C99/C11, and no runtime/heap growth. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent ea028ff commit 740bfec

1 file changed

Lines changed: 194 additions & 0 deletions

File tree

Lines changed: 194 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,194 @@
1+
# mapcodelib speed optimization — design spec
2+
3+
- **Date:** 2026-05-28
4+
- **Branch:** `feat/optimize` (off `fix/bugfix`)
5+
- **Author:** Rijn Buve (with Claude)
6+
7+
## 1. Scope & goals
8+
9+
Achieve a substantial speedup on the `mapcoder.c` encode and decode hot paths, measured by `time ./unittest` in `test/` (real wall-clock against the existing unit test suite, which already loops over millions of points).
10+
11+
**Target:** 20–50% wall-time reduction at `-O3`.
12+
13+
### Hard guardrails
14+
15+
- **Bit-exact output.** All existing unit tests must pass unchanged. No test files modified.
16+
- **Strict portable C99/C11.** No `__builtin_*`, no SIMD intrinsics, no compiler-specific attributes.
17+
- **No runtime / heap growth.** Per-call stack and heap usage must not increase. Public ABI (struct sizes in `mapcoder.h`) unchanged.
18+
- **Small static-table growth OK.** Precomputed companion tables in the order of tens of KB are acceptable (negligible against existing data tables in `internal_data.h`).
19+
20+
### Out of scope
21+
22+
- Public API changes.
23+
- Refactoring unrelated to performance.
24+
- New features.
25+
- Build-system changes (CMake, scripts) beyond what's strictly needed to compile.
26+
- Approach C (loop restructuring, DFA decode parser) — deferred to a future branch if A+B underdeliver.
27+
28+
## 2. Measurement methodology
29+
30+
1. Build with `-O3` exactly as `test/run_normal.sh` does:
31+
```
32+
cd mapcodelib && gcc -O3 -c mapcoder.c
33+
cd ../test && gcc -O3 unittest.c -lm -lpthread -o unittest ../mapcodelib/mapcoder.o
34+
```
35+
2. Run `time ./unittest` three times back-to-back; record the best `user` time.
36+
3. Repeat after each commit that touches `mapcoder.c`.
37+
38+
### Baseline
39+
40+
Capture the baseline on the **first commit** of `feat/optimize` (before any code change) so we have a clean reference number.
41+
42+
### Per-commit reporting
43+
44+
Each optimization commit's message includes:
45+
46+
- `time` output before vs. after (best of 3).
47+
- Cumulative speedup vs. the baseline.
48+
- One-line description of the change.
49+
50+
### Test integrity
51+
52+
Every commit must end with `unittest` printing `Unit tests passed` and zero errors. Any commit that regresses tests is rolled back before continuing.
53+
54+
### Noise caveat
55+
56+
The unit test does correctness work (asserts, sprintf, comparisons) in addition to encode/decode, so a 30% speedup of encode/decode core will appear as a smaller wall-time delta (perhaps 10–20%). That is expected; "significant" is judged on the total wall-clock number since that is the measurement method chosen.
57+
58+
## 3. Approach A — Local hot-path cleanups
59+
60+
Discrete, low-risk edits in `mapcoder.c`. Each is independently revertable and individually testable.
61+
62+
### A1. Cache `flags` per iteration in encode/decode loops
63+
64+
In `encoderEngine` (`mapcoder.c:1460`) and `decoderEngine` (`mapcoder.c:2651`), the loop body invokes `IS_NAMELESS(i)`, `IS_RESTRICTED(i)`, `IS_SPECIAL_SHAPE(i)`, `REC_TYPE(i)`, `coDex(i)`, `SMART_DIV(i)` on the same `i`. Each macro re-reads `TERRITORY_BOUNDARIES[i].flags`.
65+
66+
Replace with a single `const int flags = TERRITORY_BOUNDARIES[i].flags;` at the top of the loop body, plus local extracted variables where used more than once.
67+
68+
Same pattern in `firstNamelessRecord` (`mapcoder.c:822`) and `countNamelessRecords` (`mapcoder.c:835`).
69+
70+
### A2. Reorder checks in `fitsInsideBoundaries`
71+
72+
`mapcoder.c:602` checks lat-min, lat-max, lon-range. For mapcode territory boundaries (most are narrow longitudinally relative to the full earth lon range), test longitude band first to maximise early-out. Same logic, same correctness, fewer comparisons on average.
73+
74+
### A3. Length-tracked result assembly in `encoderEngine`
75+
76+
Around `mapcoder.c:1520` the per-result code uses `strcpy` + `strcat` patterns. `strcat` re-scans the destination from the start each time. Replace with `memcpy` using known lengths.
77+
78+
### A4. Speed up `encodeBase31`
79+
80+
`mapcoder.c:1083` emits a base-31 representation by repeated `%31` / `/31`. The compiler turns `/31` into a magic-multiply, but the loop is small. A tight unroll (or const lookup for the common small `nrchars` cases) can help.
81+
82+
### A5. Early-exit in `repackIfAllDigits` / `unpackIfAllDigits`
83+
84+
`mapcoder.c:892`, `:933`. If the input clearly fails the "all digits" precondition (e.g. first non-digit position), bail before any string mutation. The existing code already does some of this; tighten the check.
85+
86+
### Risk per change
87+
88+
Very low; each one is mechanically equivalent. Run full unit tests after each commit.
89+
90+
### Expected gain from A
91+
92+
~10–20% wall-time.
93+
94+
## 4. Approach B — Precomputed companion tables
95+
96+
Core insight: `encoderEngine`/`decoderEngine` loop over a range of records and re-derive metadata (`codex`, `recType`, `isNameless`, `isSpecialShape`, `smartDiv`) from `flags` on every iteration. Approach A caches `flags` per-iteration, but across many iterations we still recompute the same masks/shifts on the same record data. Precompute that metadata into a compact companion table once at library init, and have the hot loops read directly from it.
97+
98+
### B1. Companion arrays sized to `TERRITORY_BOUNDARIES`
99+
100+
Add to `mapcoder.c` at file scope (not exported):
101+
102+
```c
103+
// One byte per record. Initialized once via initCompanionTables().
104+
static unsigned char RECORD_CODEX[MAPCODE_BOUNDARY_MAX + 1]; // flags & 31
105+
static unsigned char RECORD_REC_TYPE[MAPCODE_BOUNDARY_MAX + 1]; // (flags >> 7) & 3
106+
static unsigned char RECORD_KIND[MAPCODE_BOUNDARY_MAX + 1]; // bit0=nameless, bit1=restricted,
107+
// bit2=specialShape, bit3=hasHeaderLetter
108+
static unsigned char RECORD_HEADER_LETTER[MAPCODE_BOUNDARY_MAX + 1]; // ENCODE_CHARS[(flags >> 11) & 31]
109+
static unsigned short RECORD_SMART_DIV[MAPCODE_BOUNDARY_MAX + 1]; // flags >> 16
110+
```
111+
112+
Per-record overhead: 6 bytes. With ~12,000 records, total ≈ 72 KB of static `.bss`/`.data`. Well within "small additions OK".
113+
114+
### B2. Per-territory derived metadata
115+
116+
```c
117+
static int TERRITORY_FIRST_NAMELESS[_TERRITORY_MAX + 1]; // -1 if none
118+
static int TERRITORY_NAMELESS_COUNT[_TERRITORY_MAX + 1]; // 0 if none
119+
```
120+
121+
Eliminates linear scans in `firstNamelessRecord` (`mapcoder.c:822`) and `countNamelessRecords` (`mapcoder.c:835`), turning O(records-per-territory) into O(1). With ~241 territories, ~2 KB extra.
122+
123+
### B3. One-time initialization
124+
125+
A `static int companion_initialized = 0;` flag and an `initCompanionTables()` function called at the top of every public entry point that uses the data (`encodeLatLonToMapcodes_internal`, `decoderEngine`, etc.). The init computes everything from `TERRITORY_BOUNDARIES` and the flag-extraction macros — guaranteeing the precomputed values are *definitionally* the same as the macros.
126+
127+
**Thread safety.** The public API is not specified as thread-safe for first call; the existing code uses pthread only in tests. Init is idempotent (writes deterministic values), so even under a first-call race the worst case is duplicate work — never wrong values. This assumption is documented inline.
128+
129+
### B4. Replace macro call sites in hot loops only
130+
131+
`IS_NAMELESS(m)` and the other macros stay defined (used in cold paths, debug asserts, and the init function itself). In `encoderEngine`/`decoderEngine` inner loops, switch to direct table reads:
132+
133+
```c
134+
const unsigned char kind = RECORD_KIND[i];
135+
const int codex = RECORD_CODEX[i];
136+
// ...
137+
```
138+
139+
This decouples hot-loop performance from the bit-pack layout of `flags`.
140+
141+
### B5. Debug-build sanity check
142+
143+
In `#ifdef DEBUG` builds, `initCompanionTables` also asserts that the precomputed values match the macro-derived values for every record. Free correctness guarantee during development; zero cost in `-O3` release.
144+
145+
### Risk
146+
147+
Medium-low. Companion tables are derived directly from the same macros they replace, so by construction they hold identical values. The debug assertions catch any drift if someone later modifies the flag layout.
148+
149+
### Interaction with A1
150+
151+
A1 caches `flags` locally in the very same hot loops that B4 later switches to direct companion-table reads. After B4 lands, the local `flags` cache in those specific loop bodies becomes unused and must be removed as part of that commit. A1 remains useful in any cold-path call site that does not switch to companion tables.
152+
153+
### Expected combined gain from A+B
154+
155+
~20–35% wall-time.
156+
157+
## 5. Workflow
158+
159+
One commit per checkpoint, linear progression:
160+
161+
| # | Commit | Purpose |
162+
|---|--------|---------|
163+
| 1 | `chore: branch baseline` | Pin the baseline. Run `time ./unittest` 3× and record best `user` time in commit message. |
164+
| 2 | `perf: A1 — cache flags per loop iteration` | Smallest, safest first. Re-run timing. |
165+
| 3 | `perf: A2 — reorder fitsInsideBoundaries checks` | |
166+
| 4 | `perf: A3 — length-tracked result assembly` | |
167+
| 5 | `perf: A4 — tighten encodeBase31` | |
168+
| 6 | `perf: A5 — early-exit in repack/unpack` | |
169+
| 7 | `perf: B1+B2+B3 — companion tables init` | Tables defined and populated, but not yet read in hot path. Verifies init correctness in isolation. |
170+
| 8 | `perf: B4 — hot loops read from companion tables` | The main payoff commit. |
171+
| 9 | `perf: B5 — debug-build sanity check` | Optional polish. |
172+
173+
Each commit ends with:
174+
175+
- Full unit tests green (`Unit tests passed`, 0 errors).
176+
- `time ./unittest` (best of 3) noted in the commit body, plus cumulative speedup vs. commit #1.
177+
178+
## 6. Risks & mitigations
179+
180+
| Risk | Mitigation |
181+
|------|------------|
182+
| A change breaks bit-exact behaviour | Run full test suite after every commit; revert if red. |
183+
| Companion table values drift from `flags` semantics | B5 debug-build assertion verifies equivalence at init. |
184+
| Thread-safety regression on first call | `initCompanionTables` is idempotent (deterministic writes). Documented. Safe under existing usage patterns. |
185+
| Gains too small to justify | Stop after A and reassess; A alone should be ≥10% and is the lowest-risk subset. |
186+
| Test wall-time too noisy to read signal | Use best-of-3 `user` time (not `real`), build with `-O3`, run on a quiet machine. |
187+
188+
## 7. Success criteria
189+
190+
- **Mandatory:** all unit tests pass after every commit.
191+
- **Mandatory:** binary size growth ≤ 100 KB (companion tables are ~74 KB; allow margin).
192+
- **Target:** ≥20% wall-time reduction on `time ./unittest` at `-O3` after commit #8.
193+
- **Stretch:** ≥30% wall-time reduction.
194+
- **Stop condition:** if Approach A alone delivers ≥30%, B is optional. If A+B underdeliver vs. 20%, re-evaluate before any further restructuring.

0 commit comments

Comments
 (0)