Skip to content

potion-base-8M tokenizer.json: post_processor stores teacher ids (101/102 = '×'/'ß' in compacted vocab), polluting every embedding #335

@urcades

Description

@urcades

Summary

The published minishlab/potion-base-8M artifact has an internally inconsistent tokenizer.json: the vocabulary was compacted during distillation ([CLS] → id 2, [SEP] → id 3, vocab size 29,528), but the TemplateProcessing post-processor still stores the teacher's original special-token ids:

"special_tokens": {
  "[CLS]": {"id": "[CLS]", "ids": [101], "tokens": ["[CLS]"]},
  "[SEP]": {"id": "[SEP]", "ids": [102], "tokens": ["[SEP]"]}
}

In the compacted vocab, id 101 is the token × and id 102 is ß. Since tokenizers inserts the stored ids verbatim at encode time, every encoding of potion-base-8M mean-pools the embedding rows of × and ß where [CLS]/[SEP] rows were intended.

Reproduction

import json, urllib.request

url = "https://huggingface.co/minishlab/potion-base-8M/resolve/main/tokenizer.json"
tk = json.load(urllib.request.urlopen(url))
vocab = tk["model"]["vocab"]
pp = tk["post_processor"]["special_tokens"]

print("vocab        [CLS]:", vocab["[CLS]"], "  [SEP]:", vocab["[SEP]"])   # 2, 3
print("post_processor ids [CLS]:", pp["[CLS]"]["ids"], " [SEP]:", pp["[SEP]"]["ids"])  # [101], [102]
inv = {v: k for k, v in vocab.items()}
print("token at id 101:", inv[101], "  token at id 102:", inv[102])       # ×, ß

And the encode-time consequence:

from tokenizers import Tokenizer
t = Tokenizer.from_pretrained("minishlab/potion-base-8M")
print(t.encode("hello world").ids)   # ids include 101 and 102 — the rows for '×' and 'ß'

Measured impact

Mean-pooling with the template's stored ids (rows 101/102) vs the by-name rows (2/3) diverges at cosine ≈ 0.843 for a simple input like "Hello world" — i.e., the pooled specials are not a rounding error; they materially shift every embedding. (Found while building infrastructure that consumes model2vec artifacts and cross-checks tokenizer internal consistency; happy to share more details.)

Notes / suggested fix

  • Fresh distillations with model2vec 0.8.2 (e.g. from BAAI/bge-base-en-v1.5) produced consistent special-token ids in our testing, so the exporter may already be fixed for current versions — in which case this is primarily about the published potion-base-8M artifact (and possibly other older compacted-vocab uploads, e.g. M2V_base_output may be worth checking).
  • Suggested remediation: regenerate/rewrite post_processor.special_tokens[*].ids to the compacted vocab's ids when exporting (or strip the post-processor if specials aren't intended to be pooled), and consider re-uploading corrected tokenizer.json files for affected published models. A consistency assertion (post_processor ids == vocab[name]) in the exporter or in skeletoken's validators would prevent recurrence.

Thanks for model2vec, skeletoken, and the potion models — this ecosystem is excellent to build on.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions