dart_sentencepiece_tokenizer
A lightweight, pure Dart implementation of SentencePiece tokenizer. Supports BPE (Gemma) and Unigram (Llama) algorithms.
Features
- Pure Dart - Zero dependencies, works everywhere (Flutter, Server, CLI, Web)
- Memory Efficient - Typed arrays (
Int32List,Uint8List) for 50-70% memory reduction - BPE & Unigram - Supports both algorithms used by Gemma and Llama models
- Optimized BPE - O(1) merge operations with linked list and merge caching
- Full API - Encoding, decoding, padding, truncation, offset mapping
- Batch Processing - Sequential and parallel (Isolate-based) batch encoding
- Input Validation - Protects against OOM with configurable size limits
- Well Tested - 158 tests with 100% pass rate
Installation
dependencies:
dart_sentencepiece_tokenizer: ^1.1.0
Quick Start
import 'package:dart_sentencepiece_tokenizer/dart_sentencepiece_tokenizer.dart';
void main() {
// Load tokenizer with Llama config (BOS only)
final tokenizer = SentencePieceTokenizer.fromModelFileSync(
'tokenizer.model',
config: SentencePieceConfig.llama,
);
// Encode text
final encoding = tokenizer.encode('Hello, world!');
print(encoding.tokens); // [<s>, ▁Hello, ,, ▁world, !]
print(encoding.ids); // [1, 15043, 29892, 3186, 29991]
// Decode back to text
final text = tokenizer.decode(encoding.ids, skipSpecialTokens: true);
print(text); // Hello, world!
}
Usage
Single Text Encoding
final encoding = tokenizer.encode('Hello world');
print(encoding.tokens); // Token strings
print(encoding.ids); // Token IDs (Int32List)
print(encoding.attentionMask); // Attention mask (Uint8List)
print(encoding.typeIds); // Type IDs (Uint8List)
print(encoding.offsets); // Character offsets [(start, end), ...]
print(encoding.wordIds); // Word indices
print(encoding.sequenceIds); // Sequence indices (0, 1, or null)
// Without special tokens
final raw = tokenizer.encode('Hello', addSpecialTokens: false);
Note: Input text exceeding 500,000 characters throws
ArgumentErrorto prevent OOM.
Sentence Pair Encoding
// For QA, NLI, sentence similarity tasks
final encoding = tokenizer.encodePair(
'What is machine learning?',
'Machine learning is a subset of AI.',
);
print(encoding.typeIds); // [0,0,0,0,0,0, 1,1,1,1,1,1,1]
print(encoding.sequenceIds); // [null,0,0,0,0,null, 1,1,1,1,1,1,null]
// With truncation
final encoding = tokenizer.encodePair(
longQuestion,
longAnswer,
maxLength: 512,
strategy: TruncationStrategy.longestFirst,
);
Batch Encoding
// Sequential batch
final encodings = tokenizer.encodeBatch(['Hello', 'World', 'Test']);
// Parallel batch (uses Isolates for batches >= 8)
final encodings = await tokenizer.encodeBatchParallel(texts);
// Pair batch
final pairs = [('Q1', 'A1'), ('Q2', 'A2')];
final encodings = tokenizer.encodePairBatch(pairs, maxLength: 256);
Padding
// Fluent API
final tokenizer = SentencePieceTokenizer.fromModelFileSync('model.model')
..enablePadding(length: 512, direction: SpPaddingDirection.right);
// Or pad to longest in batch
tokenizer.enablePadding(); // Auto-pads to longest
// Manual padding
final padded = encoding.withPadding(
targetLength: 128,
padTokenId: tokenizer.vocab.padId,
padOnRight: true,
);
// Pad to multiple of N
final padded = encoding.withPaddingToMultipleOf(
multiple: 8,
padTokenId: tokenizer.vocab.padId,
);
Truncation
// Fluent API
final tokenizer = SentencePieceTokenizer.fromModelFileSync('model.model')
..enableTruncation(maxLength: 512, direction: SpTruncationDirection.right);
// Manual truncation
final truncated = encoding.withTruncation(maxLength: 64);
// Truncation strategies for pairs
final (truncA, truncB) = Encoding.truncatePair(
encodingA: encodingA,
encodingB: encodingB,
maxLength: 128,
strategy: TruncationStrategy.longestFirst,
);
Truncation Strategies:
longestFirst- Remove from longest sequence iterativelyonlyFirst- Truncate first sequence onlyonlySecond- Truncate second sequence onlydoNotTruncate- No truncation
Offset Mapping
final encoding = tokenizer.encode('Hello world');
// Character position -> Token index
final tokenIdx = encoding.charToToken(6); // 'w' -> token index
// Token index -> Character span
final (start, end) = encoding.tokenToChars(1)!; // token -> (0, 5)
// Word index -> Token span
final (startToken, endToken) = encoding.wordToTokens(0)!;
// Token -> Word index
final wordIdx = encoding.tokenToWord(1);
// Token -> Sequence index (0, 1, or null for special tokens)
final seqIdx = encoding.tokenToSequence(1);
Vocabulary Access
print(tokenizer.vocabSize); // 32000
print(tokenizer.vocab.unkId); // 0
print(tokenizer.vocab.bosId); // 1
print(tokenizer.vocab.eosId); // 2
print(tokenizer.vocab.padId); // -1 (if not defined)
// Token <-> ID conversion
tokenizer.convertTokensToIds(['▁hello', '▁world']); // [15043, 3186]
tokenizer.convertIdsToTokens([15043, 3186]); // ['▁hello', '▁world']
// Check if token exists
tokenizer.vocab.contains('▁hello'); // true
// Get vocabulary map
final vocabMap = tokenizer.vocab.vocabularyMap; // Map<String, int>
Decoding
// Decode with special tokens
final text = tokenizer.decode(encoding.ids, skipSpecialTokens: false);
// Decode without special tokens (default: true)
final text = tokenizer.decode(encoding.ids);
// Batch decode
final texts = tokenizer.decodeBatch(idsBatch);
ONNX Runtime Integration
Use with ONNX Runtime for on-device ML inference:
import 'package:dart_sentencepiece_tokenizer/dart_sentencepiece_tokenizer.dart';
import 'dart:typed_data';
final tokenizer = SentencePieceTokenizer.fromModelFileSync('model.model',
config: SentencePieceConfig.llama)
..enableTruncation(maxLength: 512);
final encoding = tokenizer.encode('What is machine learning?');
// Encoding.ids is already Int32List, convert to Int64List for ONNX
final inputIds = Int64List.fromList(encoding.ids);
final attentionMask = Int64List.fromList(encoding.attentionMask);
// Pass to ONNX session
// final outputs = await session.run({
// 'input_ids': inputIds,
// 'attention_mask': attentionMask,
// });
Configuration
// Gemma: adds BOS and EOS tokens
final gemmaTokenizer = SentencePieceTokenizer.fromModelFileSync(
'gemma.model',
config: SentencePieceConfig.gemma,
);
// Llama: adds BOS token only
final llamaTokenizer = SentencePieceTokenizer.fromModelFileSync(
'llama.model',
config: SentencePieceConfig.llama,
);
// Custom configuration
final customTokenizer = SentencePieceTokenizer.fromModelFileSync(
'model.model',
config: const SentencePieceConfig(
addBosToken: true,
addEosToken: false,
),
);
| Config | BOS Token | EOS Token | Use Case |
|---|---|---|---|
SentencePieceConfig() |
No | No | Raw tokenization |
SentencePieceConfig.gemma |
Yes | Yes | Gemma models |
SentencePieceConfig.llama |
Yes | No | Llama models |
API Reference
SentencePieceTokenizer
| Method | Description |
|---|---|
fromModelFile(path, config?) |
Load from .model file (async) |
fromModelFileSync(path, config?) |
Load from .model file (sync) |
fromBytes(bytes, config?) |
Load from byte data |
encode(text, addSpecialTokens?) |
Encode single text |
encodePair(textA, textB, ...) |
Encode text pair |
encodeBatch(texts, addSpecialTokens?) |
Encode multiple texts |
encodeBatchParallel(texts, ...) |
Parallel batch encoding |
encodePairBatch(pairs, ...) |
Batch encode text pairs |
decode(ids, skipSpecialTokens?) |
Decode IDs to text |
decodeBatch(idsBatch, ...) |
Batch decode |
enablePadding() / noPadding() |
Configure padding |
enableTruncation() / noTruncation() |
Configure truncation |
convertTokensToIds(tokens) |
Convert tokens to IDs |
convertIdsToTokens(ids) |
Convert IDs to tokens |
numSpecialTokensToAdd(isPair?) |
Get special token count |
Encoding
| Property | Type | Description |
|---|---|---|
tokens |
List<String> |
Token strings |
ids |
Int32List |
Token IDs |
attentionMask |
Uint8List |
Attention mask (1=attend, 0=ignore) |
typeIds |
Uint8List |
Token type IDs (0=first, 1=second) |
specialTokensMask |
Uint8List |
Special token mask |
offsets |
List<(int, int)> |
Character offsets |
wordIds |
List<int?> |
Word indices |
sequenceIds |
List<int?> |
Sequence indices |
length |
int |
Number of tokens |
Performance
| Metric | Value |
|---|---|
| Throughput | ~500K+ tokens/sec |
| Model loading | ~50ms (32K vocab) |
| Memory (vocab) | ~3MB |
| Lookup complexity | O(k) per token |
| BPE merge | O(1) per merge |
| Max input length | 500,000 chars |
Memory Efficiency
Uses typed arrays for 50-70% memory reduction:
| Field | Type | Bytes/token |
|---|---|---|
ids |
Int32List |
4 |
typeIds |
Uint8List |
1 |
attentionMask |
Uint8List |
1 |
specialTokensMask |
Uint8List |
1 |
Model File
Download SentencePiece models from HuggingFace:
Format: Binary protobuf (.model files from SentencePiece C++ library).
Testing
# Run all tests (158 tests)
dart test
# Run specific test file
dart test test/sentencepiece_test.dart
# Run benchmarks
dart run benchmark/performance_benchmark.dart
# Run HuggingFace compatibility benchmark
dart run benchmark/hf_compatibility_benchmark.dart
HuggingFace Compatibility Verification
# Run HuggingFace compatibility benchmark
dart run benchmark/hf_compatibility_benchmark.dart
# Regenerate benchmark expected values (requires Python + sentencepiece)
pip install sentencepiece
python scripts/generate_hf_benchmark_data.py --model tokenizer.model
License
MIT License
Libraries
- dart_sentencepiece_tokenizer
- Pure Dart SentencePiece tokenizer library.