Summary
The published minishlab/potion-base-8M artifact has an internally inconsistent tokenizer.json: the vocabulary was compacted during distillation ([CLS] → id 2, [SEP] → id 3, vocab size 29,528), but the TemplateProcessing post-processor still stores the teacher's original special-token ids:
"special_tokens": {
"[CLS]": {"id": "[CLS]", "ids": [101], "tokens": ["[CLS]"]},
"[SEP]": {"id": "[SEP]", "ids": [102], "tokens": ["[SEP]"]}
}
In the compacted vocab, id 101 is the token × and id 102 is ß. Since tokenizers inserts the stored ids verbatim at encode time, every encoding of potion-base-8M mean-pools the embedding rows of × and ß where [CLS]/[SEP] rows were intended.
Reproduction
import json, urllib.request
url = "https://huggingface.co/minishlab/potion-base-8M/resolve/main/tokenizer.json"
tk = json.load(urllib.request.urlopen(url))
vocab = tk["model"]["vocab"]
pp = tk["post_processor"]["special_tokens"]
print("vocab [CLS]:", vocab["[CLS]"], " [SEP]:", vocab["[SEP]"]) # 2, 3
print("post_processor ids [CLS]:", pp["[CLS]"]["ids"], " [SEP]:", pp["[SEP]"]["ids"]) # [101], [102]
inv = {v: k for k, v in vocab.items()}
print("token at id 101:", inv[101], " token at id 102:", inv[102]) # ×, ß
And the encode-time consequence:
from tokenizers import Tokenizer
t = Tokenizer.from_pretrained("minishlab/potion-base-8M")
print(t.encode("hello world").ids) # ids include 101 and 102 — the rows for '×' and 'ß'
Measured impact
Mean-pooling with the template's stored ids (rows 101/102) vs the by-name rows (2/3) diverges at cosine ≈ 0.843 for a simple input like "Hello world" — i.e., the pooled specials are not a rounding error; they materially shift every embedding. (Found while building infrastructure that consumes model2vec artifacts and cross-checks tokenizer internal consistency; happy to share more details.)
Notes / suggested fix
- Fresh distillations with model2vec 0.8.2 (e.g. from
BAAI/bge-base-en-v1.5) produced consistent special-token ids in our testing, so the exporter may already be fixed for current versions — in which case this is primarily about the published potion-base-8M artifact (and possibly other older compacted-vocab uploads, e.g. M2V_base_output may be worth checking).
- Suggested remediation: regenerate/rewrite
post_processor.special_tokens[*].ids to the compacted vocab's ids when exporting (or strip the post-processor if specials aren't intended to be pooled), and consider re-uploading corrected tokenizer.json files for affected published models. A consistency assertion (post_processor ids == vocab[name]) in the exporter or in skeletoken's validators would prevent recurrence.
Thanks for model2vec, skeletoken, and the potion models — this ecosystem is excellent to build on.
Summary
The published
minishlab/potion-base-8Martifact has an internally inconsistenttokenizer.json: the vocabulary was compacted during distillation ([CLS]→ id 2,[SEP]→ id 3, vocab size 29,528), but theTemplateProcessingpost-processor still stores the teacher's original special-token ids:In the compacted vocab, id 101 is the token
×and id 102 isß. Sincetokenizersinserts the stored ids verbatim at encode time, every encoding of potion-base-8M mean-pools the embedding rows of×andßwhere[CLS]/[SEP]rows were intended.Reproduction
And the encode-time consequence:
Measured impact
Mean-pooling with the template's stored ids (rows 101/102) vs the by-name rows (2/3) diverges at cosine ≈ 0.843 for a simple input like
"Hello world"— i.e., the pooled specials are not a rounding error; they materially shift every embedding. (Found while building infrastructure that consumes model2vec artifacts and cross-checks tokenizer internal consistency; happy to share more details.)Notes / suggested fix
BAAI/bge-base-en-v1.5) produced consistent special-token ids in our testing, so the exporter may already be fixed for current versions — in which case this is primarily about the published potion-base-8M artifact (and possibly other older compacted-vocab uploads, e.g.M2V_base_outputmay be worth checking).post_processor.special_tokens[*].idsto the compacted vocab's ids when exporting (or strip the post-processor if specials aren't intended to be pooled), and consider re-uploading correctedtokenizer.jsonfiles for affected published models. A consistency assertion (post_processorids ==vocab[name]) in the exporter or in skeletoken's validators would prevent recurrence.Thanks for model2vec, skeletoken, and the potion models — this ecosystem is excellent to build on.