Text tokenizer module
The text tokenizer module turns strings into token ID sequences that can be fed into models.
This layer only handles tokenization, encoding, and reverse decoding. It does not run inference. A common workflow is to encode text into MLMultiArray or onnxruntime.tensor, then pass the result into the matching CoreML or ONNX Runtime request.
The current goal of this module is “good enough for on-device inference”:
WordPiecehas the highest compatibility and works well for models such asBERTandCN-CLIPBPEandSentencePieceuse lightweight compatibility-oriented implementations suitable for most on-device encoding cases- It does not try to bring tokenizer training, sampling, or the entire upstream ecosystem into runtime
This module is available in versions released after 20260319
Create a tokenizer
coreml.new_text_tokenizer(opts)
tokenizer, err = coreml.new_text_tokenizer({
type = tokenizer_type,
vocab_path = vocab_path,
merges_path = merges_path,
model_path = sentencepiece_model_path,
pattern = regex_pattern,
context_length = context_length,
do_lower_case = do_lower_case,
vocab_limit = vocab_limit,
clean_text = clean_text,
add_bos = add_bos,
add_eos = add_eos,
bos_token = bos_token,
eos_token = eos_token,
pad_token = pad_token,
unk_token = unk_token,
})
type values
wordpiece/bert/cn_clipbpe/gpt2_bpe/clip_bpesentencepiece/spmregex/patternbyte/byteswhitespace/spacecharacter/char
The default is wordpiece.
If the first argument to coreml.new_text_tokenizer(...) is not a table, the call falls back to the WordPiece shortcut path, which is equivalent to usages such as coreml.new_wordpiece_tokenizer(vocab_path).
Common parameters
-
vocab_path
String. Path to the vocabulary file. -
merges_path
String. Path to the merges file used by BPE tokenizers. -
model_path
String. Path to a.modelfile used by the lightweight SentencePiece-compatible mode. -
pattern
String. Regular expression used by the regex tokenizer. -
context_length
Integer, optional. Output token sequence length. -
do_lower_case
Boolean, optional. Whether to lowercase the input first. -
vocab_limit
Integer, optional. Mostly useful for WordPiece, to cap the effective vocabulary size. -
clean_text
Boolean, optional. Whether to run basic text cleanup before tokenization. -
add_bos / add_eosBoolean, optional. Whether to prepend / append begin-of-sequence and end-of-sequence tokens. -
bos_token / eos_token / pad_token / unk_tokenString, optional. Token text used for the corresponding special token.
Not every tokenizer type supports every field above. The current implementation behaves as follows:
wordpieceAccepts either avocab_pathstring or an options table; requiresvocab_path; defaults tocontext_length = 52,do_lower_case = true, andvocab_limit = 21128; the vocab must contain[PAD],[UNK],[CLS], and[SEP]bpeTable only; requiresvocab_pathandmerges_path; defaults tocontext_length = 77,do_lower_case = false,clean_text = false,add_bos = false, andadd_eos = falsesentencepieceTable only; requires at least one ofvocab_pathormodel_path; defaults tocontext_length = 77,do_lower_case = false,clean_text = true,bos_token = "<s>",eos_token = "</s>",pad_token = "<pad>", andunk_token = "<unk>"regexTable only; requiresvocab_pathandpattern; defaults tocontext_length = 77,do_lower_case = false, andclean_text = truewhitespace/character/byteTable only; requirevocab_path; default tocontext_length = 77,do_lower_case = false, andclean_text = true
Selection guide:
- If the model uses
vocab.txt + WordPiece, choosewordpiece - If the model uses
vocab.json + merges.txt, choosebpe - If the model uses
.vocabor.model, choosesentencepiece - Use
regex,whitespace,character, orbyteonly for simple rule-based cases
Returns
-
tokenizer
Text tokenizer object, ornilon failure. -
err
String.nilon success; error message on failure.
Notes
- Tokenizer objects are intended to be created once and reused
- Creation mainly decides three things: tokenization algorithm, vocabulary source, and fixed output length
- If model behavior looks wrong, first verify the tokenizer type, vocabulary file, special-token configuration, and
context_length WordPieceis currently the most complete and reliable optionBPEandSentencePieceaim for practical on-device compatibility rather than reproducing every upstream implementation detail bit-for-bit
Shortcut constructors
The following helpers are thin wrappers around coreml.new_text_tokenizer(...). Use them when the tokenizer type is already known.
coreml.new_wordpiece_tokenizer(opts)
- Equivalent to
coreml.new_text_tokenizer({ type = "wordpiece", ... }) - Suitable for
BERT,CN-CLIP, and other WordPiece-style text models - Also supports passing the vocab path string directly:
coreml.new_wordpiece_tokenizer(vocab_path)
coreml.new_bpe_tokenizer(opts)
- Equivalent to
coreml.new_text_tokenizer({ type = "bpe", ... }) - Also covers GPT-2- and CLIP-style
vocab.json + merges.txtmodels
coreml.new_sentencepiece_tokenizer(opts)
- Equivalent to
coreml.new_text_tokenizer({ type = "sentencepiece", ... }) - Uses the lightweight compatibility-oriented implementation
- Supports both
.vocaband.model
coreml.new_regex_tokenizer(opts)
- Equivalent to
coreml.new_text_tokenizer({ type = "regex", ... }) - Suitable for rule-based tokenization and simple vocabulary lookup
coreml.new_byte_tokenizer(opts)
- Equivalent to
coreml.new_text_tokenizer({ type = "byte", ... }) - Converts text into UTF-8 bytes first, then looks up tokens by byte value
coreml.new_whitespace_tokenizer(opts)
- Equivalent to
coreml.new_text_tokenizer({ type = "whitespace", ... }) - Suitable for whitespace tokenization with direct vocabulary lookup
coreml.new_character_tokenizer(opts)
- Equivalent to
coreml.new_text_tokenizer({ type = "character", ... }) - Suitable for character-level models
Type check
coreml.is_text_tokenizer(value)
is_tokenizer = coreml.is_text_tokenizer(value)
Checks whether a value is a coreml_text_tokenizer_object.
Object methods
:encode(text[, opts])
result, err = tokenizer:encode(text)
or
result, err = tokenizer:encode(text, {
output = "table" or "MLMultiArray" or "ort_tensor",
data_type = data_type,
pair_text = paired_text,
max_length = max_length,
padding = padding_strategy,
truncation = truncation_strategy,
return_attention_mask = return_attention_mask,
return_token_type_ids = return_token_type_ids,
return_special_tokens_mask = return_special_tokens_mask,
})
Encodes a single text string into a token sequence.
:encode_batch(texts[, opts])
result, err = tokenizer:encode_batch(texts)
or
result, err = tokenizer:encode_batch(texts, {
output = "table" or "MLMultiArray" or "ort_tensor",
data_type = data_type,
pair_text = paired_text_or_array,
max_length = max_length,
padding = padding_strategy,
truncation = truncation_strategy,
return_attention_mask = return_attention_mask,
return_token_type_ids = return_token_type_ids,
return_special_tokens_mask = return_special_tokens_mask,
})
Encodes multiple texts at once and returns token sequences in batch form.
:decode(ids)
text, err = tokenizer:decode(ids)
Reverse-decodes a single token ID sequence back into text.
idsmay be a Lua array, anMLMultiArray, or tensor-like userdata such as an ORT tensoridsmay also be any tensor-like userdata that implements bothshape()andto_table()- Passing batch data is an error; use
decode_batch()for that case
:decode_batch(batch_ids)
texts, err = tokenizer:decode_batch(batch_ids)
Reverse-decodes a batch of token ID sequences into a Lua string array.
batch_idsmay be a nested Lua array, anMLMultiArray, or tensor-like userdata such as an ORT tensorbatch_idsmay also be any tensor-like userdata that implements bothshape()andto_table()
:vocab_size()
size = tokenizer:vocab_size()
Returns the available vocabulary size.
:context_length()
length = tokenizer:context_length()
Returns the tokenizer's fixed output length.
Encoding and return-value notes
outputdefaults to"MLMultiArray", which is convenient when the result goes straight into a CoreML modeloutput = "table"is useful for debugging, inspecting token IDs, or supporting older scriptsoutput = "ort_tensor"is useful when the result should go directly into an ONNX Runtime text modeloutput = "ort_tensor"requiresrequire("onnxruntime")first, because the ORT bridge is injected at that timeoutput = "ort_tensor"depends on ONNX Runtime and therefore requires iOS 13+- For compatibility with older scripts, the legacy field
multi_array_outputis still read; new code should preferoutput
data_type rules
- With
output = "MLMultiArray",data_typecan be"int32","float32","float16", or"double";"float64"is accepted as an alias for"double" - With
output = "MLMultiArray", the defaultdata_typeis"int32" - With
output = "ort_tensor",data_typecan be"float16","float32","uint8","int8","int32","int64","double", or"bool" - With
output = "ort_tensor", the defaultdata_typeis"int64"
padding / truncation
paddingcan be a boolean or a stringtruemaps to"max_length"falsemaps to"do_not_pad"
truncationcan be a boolean or a stringtruemaps to"longest_first"falsemaps to"do_not_truncate"
pair_text
encode()accepts a singlepair_textencode_batch()accepts either one sharedpair_textor an array of pair texts matching the batch size
Structured returns
When any of the following options is true, encode() / encode_batch() return a structured result table instead of a bare token sequence:
return_attention_maskreturn_token_type_idsreturn_special_tokens_mask
Common fields in the structured result:
input_idslengthattention_masktoken_type_idsspecial_tokens_mask
For batch encoding:
- With
output = "table", each sample keeps its own natural length - With
output = "MLMultiArray"or"ort_tensor", the batch is padded into a rectangular form before returning
Example
local tokenizer = assert(coreml.new_text_tokenizer({
type = "wordpiece",
vocab_path = XXT_HOME_PATH.."/models/demo/vocab.txt",
context_length = 52,
}))
local ids = assert(tokenizer:encode("stars", {
output = "MLMultiArray",
data_type = "int32",
}))
local structured = assert(tokenizer:encode("stars", {
output = "table",
return_attention_mask = true,
return_token_type_ids = true,
}))
local ort = require("onnxruntime")
local input_ids = assert(tokenizer:encode("stars", {
output = "ort_tensor",
data_type = "int64",
}))
local batch = assert(tokenizer:encode_batch({
"stars",
"moon",
}, {
output = "ort_tensor",
}))
print(tokenizer:decode(structured.input_ids))
print(tokenizer:vocab_size())
print(tokenizer:context_length())