Text tokenizer module

The text tokenizer module turns strings into token ID sequences that can be fed into models.
This layer only handles tokenization, encoding, and reverse decoding. It does not run inference. A common workflow is to encode text into MLMultiArray or onnxruntime.tensor, then pass the result into the matching CoreML or ONNX Runtime request.

The current goal of this module is “good enough for on-device inference”:

WordPiece has the highest compatibility and works well for models such as BERT and CN-CLIP
BPE and SentencePiece use lightweight compatibility-oriented implementations suitable for most on-device encoding cases
It does not try to bring tokenizer training, sampling, or the entire upstream ecosystem into runtime

This module is available in versions released after 20260319

Create a tokenizer

`coreml.new_text_tokenizer(opts)`

tokenizer, err = coreml.new_text_tokenizer({
    type = tokenizer_type,
    vocab_path = vocab_path,
    merges_path = merges_path,
    model_path = sentencepiece_model_path,
    pattern = regex_pattern,
    context_length = context_length,
    do_lower_case = do_lower_case,
    vocab_limit = vocab_limit,
    clean_text = clean_text,
    add_bos = add_bos,
    add_eos = add_eos,
    bos_token = bos_token,
    eos_token = eos_token,
    pad_token = pad_token,
    unk_token = unk_token,
})

`type` values

wordpiece / bert / cn_clip
bpe / gpt2_bpe / clip_bpe
sentencepiece / spm
regex / pattern
byte / bytes
whitespace / space
character / char

The default is wordpiece.

If the first argument to coreml.new_text_tokenizer(...) is not a table, the call falls back to the WordPiece shortcut path, which is equivalent to usages such as coreml.new_wordpiece_tokenizer(vocab_path).

Common parameters

vocab_path
String. Path to the vocabulary file.
merges_path
String. Path to the merges file used by BPE tokenizers.
model_path
String. Path to a .model file used by the lightweight SentencePiece-compatible mode.
pattern
String. Regular expression used by the regex tokenizer.
context_length
Integer, optional. Output token sequence length.
do_lower_case
Boolean, optional. Whether to lowercase the input first.
vocab_limit
Integer, optional. Mostly useful for WordPiece, to cap the effective vocabulary size.
clean_text
Boolean, optional. Whether to run basic text cleanup before tokenization.
add_bos / add_eos Boolean, optional. Whether to prepend / append begin-of-sequence and end-of-sequence tokens.
bos_token / eos_token / pad_token / unk_token String, optional. Token text used for the corresponding special token.

Not every tokenizer type supports every field above. The current implementation behaves as follows:

wordpiece Accepts either a vocab_path string or an options table; requires vocab_path; defaults to context_length = 52, do_lower_case = true, and vocab_limit = 21128; the vocab must contain [PAD], [UNK], [CLS], and [SEP]
bpe Table only; requires vocab_path and merges_path; defaults to context_length = 77, do_lower_case = false, clean_text = false, add_bos = false, and add_eos = false
sentencepiece Table only; requires at least one of vocab_path or model_path; defaults to context_length = 77, do_lower_case = false, clean_text = true, bos_token = "<s>", eos_token = "</s>", pad_token = "<pad>", and unk_token = "<unk>"
regex Table only; requires vocab_path and pattern; defaults to context_length = 77, do_lower_case = false, and clean_text = true
whitespace / character / byte Table only; require vocab_path; default to context_length = 77, do_lower_case = false, and clean_text = true

Selection guide:

If the model uses vocab.txt + WordPiece, choose wordpiece
If the model uses vocab.json + merges.txt, choose bpe
If the model uses .vocab or .model, choose sentencepiece
Use regex, whitespace, character, or byte only for simple rule-based cases

Returns

tokenizer
Text tokenizer object, or nil on failure.
err
String. nil on success; error message on failure.

Notes

Tokenizer objects are intended to be created once and reused
Creation mainly decides three things: tokenization algorithm, vocabulary source, and fixed output length
If model behavior looks wrong, first verify the tokenizer type, vocabulary file, special-token configuration, and context_length
WordPiece is currently the most complete and reliable option
BPE and SentencePiece aim for practical on-device compatibility rather than reproducing every upstream implementation detail bit-for-bit

Shortcut constructors

The following helpers are thin wrappers around coreml.new_text_tokenizer(...). Use them when the tokenizer type is already known.

`coreml.new_wordpiece_tokenizer(opts)`

Equivalent to coreml.new_text_tokenizer({ type = "wordpiece", ... })
Suitable for BERT, CN-CLIP, and other WordPiece-style text models
Also supports passing the vocab path string directly: coreml.new_wordpiece_tokenizer(vocab_path)

`coreml.new_bpe_tokenizer(opts)`

Equivalent to coreml.new_text_tokenizer({ type = "bpe", ... })
Also covers GPT-2- and CLIP-style vocab.json + merges.txt models

`coreml.new_sentencepiece_tokenizer(opts)`

Equivalent to coreml.new_text_tokenizer({ type = "sentencepiece", ... })
Uses the lightweight compatibility-oriented implementation
Supports both .vocab and .model

`coreml.new_regex_tokenizer(opts)`

Equivalent to coreml.new_text_tokenizer({ type = "regex", ... })
Suitable for rule-based tokenization and simple vocabulary lookup

`coreml.new_byte_tokenizer(opts)`

Equivalent to coreml.new_text_tokenizer({ type = "byte", ... })
Converts text into UTF-8 bytes first, then looks up tokens by byte value

`coreml.new_whitespace_tokenizer(opts)`

Equivalent to coreml.new_text_tokenizer({ type = "whitespace", ... })
Suitable for whitespace tokenization with direct vocabulary lookup

`coreml.new_character_tokenizer(opts)`

Equivalent to coreml.new_text_tokenizer({ type = "character", ... })
Suitable for character-level models

Type check

`coreml.is_text_tokenizer(value)`

is_tokenizer = coreml.is_text_tokenizer(value)

Checks whether a value is a coreml_text_tokenizer_object.

Object methods

`:encode(text[, opts])`

result, err = tokenizer:encode(text)

result, err = tokenizer:encode(text, {
    output = "table" or "MLMultiArray" or "ort_tensor",
    data_type = data_type,
    pair_text = paired_text,
    max_length = max_length,
    padding = padding_strategy,
    truncation = truncation_strategy,
    return_attention_mask = return_attention_mask,
    return_token_type_ids = return_token_type_ids,
    return_special_tokens_mask = return_special_tokens_mask,
})

Encodes a single text string into a token sequence.

`:encode_batch(texts[, opts])`

result, err = tokenizer:encode_batch(texts)

result, err = tokenizer:encode_batch(texts, {
    output = "table" or "MLMultiArray" or "ort_tensor",
    data_type = data_type,
    pair_text = paired_text_or_array,
    max_length = max_length,
    padding = padding_strategy,
    truncation = truncation_strategy,
    return_attention_mask = return_attention_mask,
    return_token_type_ids = return_token_type_ids,
    return_special_tokens_mask = return_special_tokens_mask,
})

Encodes multiple texts at once and returns token sequences in batch form.

`:decode(ids)`

text, err = tokenizer:decode(ids)

Reverse-decodes a single token ID sequence back into text.

ids may be a Lua array, an MLMultiArray, or tensor-like userdata such as an ORT tensor
ids may also be any tensor-like userdata that implements both shape() and to_table()
Passing batch data is an error; use decode_batch() for that case

`:decode_batch(batch_ids)`

texts, err = tokenizer:decode_batch(batch_ids)

Reverse-decodes a batch of token ID sequences into a Lua string array.

batch_ids may be a nested Lua array, an MLMultiArray, or tensor-like userdata such as an ORT tensor
batch_ids may also be any tensor-like userdata that implements both shape() and to_table()

`:vocab_size()`

size = tokenizer:vocab_size()

Returns the available vocabulary size.

`:context_length()`

length = tokenizer:context_length()

Returns the tokenizer's fixed output length.

Encoding and return-value notes

output defaults to "MLMultiArray", which is convenient when the result goes straight into a CoreML model
output = "table" is useful for debugging, inspecting token IDs, or supporting older scripts
output = "ort_tensor" is useful when the result should go directly into an ONNX Runtime text model
output = "ort_tensor" requires require("onnxruntime") first, because the ORT bridge is injected at that time
output = "ort_tensor" depends on ONNX Runtime and therefore requires iOS 13+
For compatibility with older scripts, the legacy field multi_array_output is still read; new code should prefer output

`data_type` rules

With output = "MLMultiArray", data_type can be "int32", "float32", "float16", or "double"; "float64" is accepted as an alias for "double"
With output = "MLMultiArray", the default data_type is "int32"
With output = "ort_tensor", data_type can be "float16", "float32", "uint8", "int8", "int32", "int64", "double", or "bool"
With output = "ort_tensor", the default data_type is "int64"

`padding` / `truncation`

padding can be a boolean or a string
- true maps to "max_length"
- false maps to "do_not_pad"
truncation can be a boolean or a string
- true maps to "longest_first"
- false maps to "do_not_truncate"

`pair_text`

encode() accepts a single pair_text
encode_batch() accepts either one shared pair_text or an array of pair texts matching the batch size

Structured returns

When any of the following options is true, encode() / encode_batch() return a structured result table instead of a bare token sequence:

return_attention_mask
return_token_type_ids
return_special_tokens_mask

Common fields in the structured result:

input_ids
length
attention_mask
token_type_ids
special_tokens_mask

For batch encoding:

With output = "table", each sample keeps its own natural length
With output = "MLMultiArray" or "ort_tensor", the batch is padded into a rectangular form before returning

Example

local tokenizer = assert(coreml.new_text_tokenizer({
    type = "wordpiece",
    vocab_path = XXT_HOME_PATH.."/models/demo/vocab.txt",
    context_length = 52,
}))

local ids = assert(tokenizer:encode("stars", {
    output = "MLMultiArray",
    data_type = "int32",
}))

local structured = assert(tokenizer:encode("stars", {
    output = "table",
    return_attention_mask = true,
    return_token_type_ids = true,
}))

local ort = require("onnxruntime")

local input_ids = assert(tokenizer:encode("stars", {
    output = "ort_tensor",
    data_type = "int64",
}))

local batch = assert(tokenizer:encode_batch({
    "stars",
    "moon",
}, {
    output = "ort_tensor",
}))

print(tokenizer:decode(structured.input_ids))
print(tokenizer:vocab_size())
print(tokenizer:context_length())

Create a tokenizer​

coreml.new_text_tokenizer(opts)​

type values​

Common parameters​

Returns​

Notes​

Shortcut constructors​

coreml.new_wordpiece_tokenizer(opts)​

coreml.new_bpe_tokenizer(opts)​

coreml.new_sentencepiece_tokenizer(opts)​

coreml.new_regex_tokenizer(opts)​

coreml.new_byte_tokenizer(opts)​

coreml.new_whitespace_tokenizer(opts)​

coreml.new_character_tokenizer(opts)​

Type check​

coreml.is_text_tokenizer(value)​

Object methods​

:encode(text[, opts])​

:encode_batch(texts[, opts])​

:decode(ids)​

:decode_batch(batch_ids)​

:vocab_size()​

:context_length()​

Encoding and return-value notes​

data_type rules​

padding / truncation​

pair_text​

Structured returns​

Example​