Skip to main content

Text tokenizer module

The text tokenizer module turns strings into token ID sequences that can be fed into models.
This layer only handles tokenization, encoding, and reverse decoding. It does not run inference. A common workflow is to encode text into MLMultiArray or onnxruntime.tensor, then pass the result into the matching CoreML or ONNX Runtime request.

The current goal of this module is “good enough for on-device inference”:

  • WordPiece has the highest compatibility and works well for models such as BERT and CN-CLIP
  • BPE and SentencePiece use lightweight compatibility-oriented implementations suitable for most on-device encoding cases
  • It does not try to bring tokenizer training, sampling, or the entire upstream ecosystem into runtime

This module is available in versions released after 20260319

Create a tokenizer

coreml.new_text_tokenizer(opts)

tokenizer, err = coreml.new_text_tokenizer({
type = tokenizer_type,
vocab_path = vocab_path,
merges_path = merges_path,
model_path = sentencepiece_model_path,
pattern = regex_pattern,
context_length = context_length,
do_lower_case = do_lower_case,
vocab_limit = vocab_limit,
clean_text = clean_text,
add_bos = add_bos,
add_eos = add_eos,
bos_token = bos_token,
eos_token = eos_token,
pad_token = pad_token,
unk_token = unk_token,
})

type values

  • wordpiece / bert / cn_clip
  • bpe / gpt2_bpe / clip_bpe
  • sentencepiece / spm
  • regex / pattern
  • byte / bytes
  • whitespace / space
  • character / char

The default is wordpiece.

If the first argument to coreml.new_text_tokenizer(...) is not a table, the call falls back to the WordPiece shortcut path, which is equivalent to usages such as coreml.new_wordpiece_tokenizer(vocab_path).

Common parameters

  • vocab_path
    String. Path to the vocabulary file.

  • merges_path
    String. Path to the merges file used by BPE tokenizers.

  • model_path
    String. Path to a .model file used by the lightweight SentencePiece-compatible mode.

  • pattern
    String. Regular expression used by the regex tokenizer.

  • context_length
    Integer, optional. Output token sequence length.

  • do_lower_case
    Boolean, optional. Whether to lowercase the input first.

  • vocab_limit
    Integer, optional. Mostly useful for WordPiece, to cap the effective vocabulary size.

  • clean_text
    Boolean, optional. Whether to run basic text cleanup before tokenization.

  • add_bos / add_eos Boolean, optional. Whether to prepend / append begin-of-sequence and end-of-sequence tokens.

  • bos_token / eos_token / pad_token / unk_token String, optional. Token text used for the corresponding special token.

Not every tokenizer type supports every field above. The current implementation behaves as follows:

  • wordpiece Accepts either a vocab_path string or an options table; requires vocab_path; defaults to context_length = 52, do_lower_case = true, and vocab_limit = 21128; the vocab must contain [PAD], [UNK], [CLS], and [SEP]
  • bpe Table only; requires vocab_path and merges_path; defaults to context_length = 77, do_lower_case = false, clean_text = false, add_bos = false, and add_eos = false
  • sentencepiece Table only; requires at least one of vocab_path or model_path; defaults to context_length = 77, do_lower_case = false, clean_text = true, bos_token = "<s>", eos_token = "</s>", pad_token = "<pad>", and unk_token = "<unk>"
  • regex Table only; requires vocab_path and pattern; defaults to context_length = 77, do_lower_case = false, and clean_text = true
  • whitespace / character / byte Table only; require vocab_path; default to context_length = 77, do_lower_case = false, and clean_text = true

Selection guide:

  • If the model uses vocab.txt + WordPiece, choose wordpiece
  • If the model uses vocab.json + merges.txt, choose bpe
  • If the model uses .vocab or .model, choose sentencepiece
  • Use regex, whitespace, character, or byte only for simple rule-based cases

Returns

  • tokenizer
    Text tokenizer object, or nil on failure.

  • err
    String. nil on success; error message on failure.

Notes

  • Tokenizer objects are intended to be created once and reused
  • Creation mainly decides three things: tokenization algorithm, vocabulary source, and fixed output length
  • If model behavior looks wrong, first verify the tokenizer type, vocabulary file, special-token configuration, and context_length
  • WordPiece is currently the most complete and reliable option
  • BPE and SentencePiece aim for practical on-device compatibility rather than reproducing every upstream implementation detail bit-for-bit

Shortcut constructors

The following helpers are thin wrappers around coreml.new_text_tokenizer(...). Use them when the tokenizer type is already known.

coreml.new_wordpiece_tokenizer(opts)

  • Equivalent to coreml.new_text_tokenizer({ type = "wordpiece", ... })
  • Suitable for BERT, CN-CLIP, and other WordPiece-style text models
  • Also supports passing the vocab path string directly: coreml.new_wordpiece_tokenizer(vocab_path)

coreml.new_bpe_tokenizer(opts)

  • Equivalent to coreml.new_text_tokenizer({ type = "bpe", ... })
  • Also covers GPT-2- and CLIP-style vocab.json + merges.txt models

coreml.new_sentencepiece_tokenizer(opts)

  • Equivalent to coreml.new_text_tokenizer({ type = "sentencepiece", ... })
  • Uses the lightweight compatibility-oriented implementation
  • Supports both .vocab and .model

coreml.new_regex_tokenizer(opts)

  • Equivalent to coreml.new_text_tokenizer({ type = "regex", ... })
  • Suitable for rule-based tokenization and simple vocabulary lookup

coreml.new_byte_tokenizer(opts)

  • Equivalent to coreml.new_text_tokenizer({ type = "byte", ... })
  • Converts text into UTF-8 bytes first, then looks up tokens by byte value

coreml.new_whitespace_tokenizer(opts)

  • Equivalent to coreml.new_text_tokenizer({ type = "whitespace", ... })
  • Suitable for whitespace tokenization with direct vocabulary lookup

coreml.new_character_tokenizer(opts)

  • Equivalent to coreml.new_text_tokenizer({ type = "character", ... })
  • Suitable for character-level models

Type check

coreml.is_text_tokenizer(value)

is_tokenizer = coreml.is_text_tokenizer(value)

Checks whether a value is a coreml_text_tokenizer_object.

Object methods

:encode(text[, opts])

result, err = tokenizer:encode(text)

or

result, err = tokenizer:encode(text, {
output = "table" or "MLMultiArray" or "ort_tensor",
data_type = data_type,
pair_text = paired_text,
max_length = max_length,
padding = padding_strategy,
truncation = truncation_strategy,
return_attention_mask = return_attention_mask,
return_token_type_ids = return_token_type_ids,
return_special_tokens_mask = return_special_tokens_mask,
})

Encodes a single text string into a token sequence.

:encode_batch(texts[, opts])

result, err = tokenizer:encode_batch(texts)

or

result, err = tokenizer:encode_batch(texts, {
output = "table" or "MLMultiArray" or "ort_tensor",
data_type = data_type,
pair_text = paired_text_or_array,
max_length = max_length,
padding = padding_strategy,
truncation = truncation_strategy,
return_attention_mask = return_attention_mask,
return_token_type_ids = return_token_type_ids,
return_special_tokens_mask = return_special_tokens_mask,
})

Encodes multiple texts at once and returns token sequences in batch form.

:decode(ids)

text, err = tokenizer:decode(ids)

Reverse-decodes a single token ID sequence back into text.

  • ids may be a Lua array, an MLMultiArray, or tensor-like userdata such as an ORT tensor
  • ids may also be any tensor-like userdata that implements both shape() and to_table()
  • Passing batch data is an error; use decode_batch() for that case

:decode_batch(batch_ids)

texts, err = tokenizer:decode_batch(batch_ids)

Reverse-decodes a batch of token ID sequences into a Lua string array.

  • batch_ids may be a nested Lua array, an MLMultiArray, or tensor-like userdata such as an ORT tensor
  • batch_ids may also be any tensor-like userdata that implements both shape() and to_table()

:vocab_size()

size = tokenizer:vocab_size()

Returns the available vocabulary size.

:context_length()

length = tokenizer:context_length()

Returns the tokenizer's fixed output length.

Encoding and return-value notes

  • output defaults to "MLMultiArray", which is convenient when the result goes straight into a CoreML model
  • output = "table" is useful for debugging, inspecting token IDs, or supporting older scripts
  • output = "ort_tensor" is useful when the result should go directly into an ONNX Runtime text model
  • output = "ort_tensor" requires require("onnxruntime") first, because the ORT bridge is injected at that time
  • output = "ort_tensor" depends on ONNX Runtime and therefore requires iOS 13+
  • For compatibility with older scripts, the legacy field multi_array_output is still read; new code should prefer output

data_type rules

  • With output = "MLMultiArray", data_type can be "int32", "float32", "float16", or "double"; "float64" is accepted as an alias for "double"
  • With output = "MLMultiArray", the default data_type is "int32"
  • With output = "ort_tensor", data_type can be "float16", "float32", "uint8", "int8", "int32", "int64", "double", or "bool"
  • With output = "ort_tensor", the default data_type is "int64"

padding / truncation

  • padding can be a boolean or a string
    • true maps to "max_length"
    • false maps to "do_not_pad"
  • truncation can be a boolean or a string
    • true maps to "longest_first"
    • false maps to "do_not_truncate"

pair_text

  • encode() accepts a single pair_text
  • encode_batch() accepts either one shared pair_text or an array of pair texts matching the batch size

Structured returns

When any of the following options is true, encode() / encode_batch() return a structured result table instead of a bare token sequence:

  • return_attention_mask
  • return_token_type_ids
  • return_special_tokens_mask

Common fields in the structured result:

  • input_ids
  • length
  • attention_mask
  • token_type_ids
  • special_tokens_mask

For batch encoding:

  • With output = "table", each sample keeps its own natural length
  • With output = "MLMultiArray" or "ort_tensor", the batch is padded into a rectangular form before returning

Example

local tokenizer = assert(coreml.new_text_tokenizer({
type = "wordpiece",
vocab_path = XXT_HOME_PATH.."/models/demo/vocab.txt",
context_length = 52,
}))

local ids = assert(tokenizer:encode("stars", {
output = "MLMultiArray",
data_type = "int32",
}))

local structured = assert(tokenizer:encode("stars", {
output = "table",
return_attention_mask = true,
return_token_type_ids = true,
}))

local ort = require("onnxruntime")

local input_ids = assert(tokenizer:encode("stars", {
output = "ort_tensor",
data_type = "int64",
}))

local batch = assert(tokenizer:encode_batch({
"stars",
"moon",
}, {
output = "ort_tensor",
}))

print(tokenizer:decode(structured.input_ids))
print(tokenizer:vocab_size())
print(tokenizer:context_length())