potion-base-8M tokenizer.json: post_processor stores teacher ids (101/102 = '×'/'ß' in compacted vocab), polluting every embedding

### Summary

The published `minishlab/potion-base-8M` artifact has an internally inconsistent `tokenizer.json`: the vocabulary was compacted during distillation (`[CLS]` → id 2, `[SEP]` → id 3, vocab size 29,528), but the `TemplateProcessing` post-processor still stores the **teacher's** original special-token ids:

```json
"special_tokens": {
  "[CLS]": {"id": "[CLS]", "ids": [101], "tokens": ["[CLS]"]},
  "[SEP]": {"id": "[SEP]", "ids": [102], "tokens": ["[SEP]"]}
}
```

In the compacted vocab, id **101 is the token `×`** and id **102 is `ß`**. Since `tokenizers` inserts the stored ids verbatim at encode time, every encoding of potion-base-8M mean-pools the embedding rows of `×` and `ß` where `[CLS]`/`[SEP]` rows were intended.

### Reproduction

```python
import json, urllib.request

url = "https://huggingface.co/minishlab/potion-base-8M/resolve/main/tokenizer.json"
tk = json.load(urllib.request.urlopen(url))
vocab = tk["model"]["vocab"]
pp = tk["post_processor"]["special_tokens"]

print("vocab        [CLS]:", vocab["[CLS]"], "  [SEP]:", vocab["[SEP]"])   # 2, 3
print("post_processor ids [CLS]:", pp["[CLS]"]["ids"], " [SEP]:", pp["[SEP]"]["ids"])  # [101], [102]
inv = {v: k for k, v in vocab.items()}
print("token at id 101:", inv[101], "  token at id 102:", inv[102])       # ×, ß
```

And the encode-time consequence:

```python
from tokenizers import Tokenizer
t = Tokenizer.from_pretrained("minishlab/potion-base-8M")
print(t.encode("hello world").ids)   # ids include 101 and 102 — the rows for '×' and 'ß'
```

### Measured impact

Mean-pooling with the template's stored ids (rows 101/102) vs the by-name rows (2/3) diverges at **cosine ≈ 0.843** for a simple input like `"Hello world"` — i.e., the pooled specials are not a rounding error; they materially shift every embedding. (Found while building infrastructure that consumes model2vec artifacts and cross-checks tokenizer internal consistency; happy to share more details.)

### Notes / suggested fix

- Fresh distillations with model2vec **0.8.2** (e.g. from `BAAI/bge-base-en-v1.5`) produced consistent special-token ids in our testing, so the exporter may already be fixed for current versions — in which case this is primarily about the **published potion-base-8M artifact** (and possibly other older compacted-vocab uploads, e.g. `M2V_base_output` may be worth checking).
- Suggested remediation: regenerate/rewrite `post_processor.special_tokens[*].ids` to the compacted vocab's ids when exporting (or strip the post-processor if specials aren't intended to be pooled), and consider re-uploading corrected `tokenizer.json` files for affected published models. A consistency assertion (`post_processor` ids == `vocab[name]`) in the exporter or in skeletoken's validators would prevent recurrence.

Thanks for model2vec, skeletoken, and the potion models — this ecosystem is excellent to build on.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

potion-base-8M tokenizer.json: post_processor stores teacher ids (101/102 = '×'/'ß' in compacted vocab), polluting every embedding #335

Summary

Reproduction

Measured impact

Notes / suggested fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

potion-base-8M tokenizer.json: post_processor stores teacher ids (101/102 = '×'/'ß' in compacted vocab), polluting every embedding #335

Description

Summary

Reproduction

Measured impact

Notes / suggested fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions