dart_sentencepiece_tokenizer

Dart License Hugging Face

A lightweight, pure Dart implementation of SentencePiece tokenizer. Supports BPE (Gemma) and Unigram (Llama) algorithms.

Features

  • Pure Dart - Zero dependencies, works everywhere (Flutter, Server, CLI, Web)
  • Memory Efficient - Typed arrays (Int32List, Uint8List) for 50-70% memory reduction
  • BPE & Unigram - Supports both algorithms used by Gemma and Llama models
  • Optimized BPE - O(1) merge operations with linked list and merge caching
  • Full API - Encoding, decoding, padding, truncation, offset mapping
  • Batch Processing - Sequential and parallel (Isolate-based) batch encoding
  • Input Validation - Protects against OOM with configurable size limits
  • Well Tested - 158 tests with 100% pass rate

Installation

dependencies:
  dart_sentencepiece_tokenizer: ^1.1.0

Quick Start

import 'package:dart_sentencepiece_tokenizer/dart_sentencepiece_tokenizer.dart';

void main() {
  // Load tokenizer with Llama config (BOS only)
  final tokenizer = SentencePieceTokenizer.fromModelFileSync(
    'tokenizer.model',
    config: SentencePieceConfig.llama,
  );

  // Encode text
  final encoding = tokenizer.encode('Hello, world!');
  print(encoding.tokens); // [<s>, ▁Hello, ,, ▁world, !]
  print(encoding.ids);    // [1, 15043, 29892, 3186, 29991]

  // Decode back to text
  final text = tokenizer.decode(encoding.ids, skipSpecialTokens: true);
  print(text); // Hello, world!
}

Usage

Single Text Encoding

final encoding = tokenizer.encode('Hello world');

print(encoding.tokens);           // Token strings
print(encoding.ids);              // Token IDs (Int32List)
print(encoding.attentionMask);    // Attention mask (Uint8List)
print(encoding.typeIds);          // Type IDs (Uint8List)
print(encoding.offsets);          // Character offsets [(start, end), ...]
print(encoding.wordIds);          // Word indices
print(encoding.sequenceIds);      // Sequence indices (0, 1, or null)

// Without special tokens
final raw = tokenizer.encode('Hello', addSpecialTokens: false);

Note: Input text exceeding 500,000 characters throws ArgumentError to prevent OOM.

Sentence Pair Encoding

// For QA, NLI, sentence similarity tasks
final encoding = tokenizer.encodePair(
  'What is machine learning?',
  'Machine learning is a subset of AI.',
);

print(encoding.typeIds);     // [0,0,0,0,0,0, 1,1,1,1,1,1,1]
print(encoding.sequenceIds); // [null,0,0,0,0,null, 1,1,1,1,1,1,null]

// With truncation
final encoding = tokenizer.encodePair(
  longQuestion,
  longAnswer,
  maxLength: 512,
  strategy: TruncationStrategy.longestFirst,
);

Batch Encoding

// Sequential batch
final encodings = tokenizer.encodeBatch(['Hello', 'World', 'Test']);

// Parallel batch (uses Isolates for batches >= 8)
final encodings = await tokenizer.encodeBatchParallel(texts);

// Pair batch
final pairs = [('Q1', 'A1'), ('Q2', 'A2')];
final encodings = tokenizer.encodePairBatch(pairs, maxLength: 256);

Padding

// Fluent API
final tokenizer = SentencePieceTokenizer.fromModelFileSync('model.model')
  ..enablePadding(length: 512, direction: SpPaddingDirection.right);

// Or pad to longest in batch
tokenizer.enablePadding(); // Auto-pads to longest

// Manual padding
final padded = encoding.withPadding(
  targetLength: 128,
  padTokenId: tokenizer.vocab.padId,
  padOnRight: true,
);

// Pad to multiple of N
final padded = encoding.withPaddingToMultipleOf(
  multiple: 8,
  padTokenId: tokenizer.vocab.padId,
);

Truncation

// Fluent API
final tokenizer = SentencePieceTokenizer.fromModelFileSync('model.model')
  ..enableTruncation(maxLength: 512, direction: SpTruncationDirection.right);

// Manual truncation
final truncated = encoding.withTruncation(maxLength: 64);

// Truncation strategies for pairs
final (truncA, truncB) = Encoding.truncatePair(
  encodingA: encodingA,
  encodingB: encodingB,
  maxLength: 128,
  strategy: TruncationStrategy.longestFirst,
);

Truncation Strategies:

  • longestFirst - Remove from longest sequence iteratively
  • onlyFirst - Truncate first sequence only
  • onlySecond - Truncate second sequence only
  • doNotTruncate - No truncation

Offset Mapping

final encoding = tokenizer.encode('Hello world');

// Character position -> Token index
final tokenIdx = encoding.charToToken(6); // 'w' -> token index

// Token index -> Character span
final (start, end) = encoding.tokenToChars(1)!; // token -> (0, 5)

// Word index -> Token span
final (startToken, endToken) = encoding.wordToTokens(0)!;

// Token -> Word index
final wordIdx = encoding.tokenToWord(1);

// Token -> Sequence index (0, 1, or null for special tokens)
final seqIdx = encoding.tokenToSequence(1);

Vocabulary Access

print(tokenizer.vocabSize);     // 32000
print(tokenizer.vocab.unkId);   // 0
print(tokenizer.vocab.bosId);   // 1
print(tokenizer.vocab.eosId);   // 2
print(tokenizer.vocab.padId);   // -1 (if not defined)

// Token <-> ID conversion
tokenizer.convertTokensToIds(['▁hello', '▁world']); // [15043, 3186]
tokenizer.convertIdsToTokens([15043, 3186]);         // ['▁hello', '▁world']

// Check if token exists
tokenizer.vocab.contains('▁hello'); // true

// Get vocabulary map
final vocabMap = tokenizer.vocab.vocabularyMap; // Map<String, int>

Decoding

// Decode with special tokens
final text = tokenizer.decode(encoding.ids, skipSpecialTokens: false);

// Decode without special tokens (default: true)
final text = tokenizer.decode(encoding.ids);

// Batch decode
final texts = tokenizer.decodeBatch(idsBatch);

ONNX Runtime Integration

Use with ONNX Runtime for on-device ML inference:

import 'package:dart_sentencepiece_tokenizer/dart_sentencepiece_tokenizer.dart';
import 'dart:typed_data';

final tokenizer = SentencePieceTokenizer.fromModelFileSync('model.model',
    config: SentencePieceConfig.llama)
  ..enableTruncation(maxLength: 512);

final encoding = tokenizer.encode('What is machine learning?');

// Encoding.ids is already Int32List, convert to Int64List for ONNX
final inputIds = Int64List.fromList(encoding.ids);
final attentionMask = Int64List.fromList(encoding.attentionMask);

// Pass to ONNX session
// final outputs = await session.run({
//   'input_ids': inputIds,
//   'attention_mask': attentionMask,
// });

Configuration

// Gemma: adds BOS and EOS tokens
final gemmaTokenizer = SentencePieceTokenizer.fromModelFileSync(
  'gemma.model',
  config: SentencePieceConfig.gemma,
);

// Llama: adds BOS token only
final llamaTokenizer = SentencePieceTokenizer.fromModelFileSync(
  'llama.model',
  config: SentencePieceConfig.llama,
);

// Custom configuration
final customTokenizer = SentencePieceTokenizer.fromModelFileSync(
  'model.model',
  config: const SentencePieceConfig(
    addBosToken: true,
    addEosToken: false,
  ),
);
Config BOS Token EOS Token Use Case
SentencePieceConfig() No No Raw tokenization
SentencePieceConfig.gemma Yes Yes Gemma models
SentencePieceConfig.llama Yes No Llama models

API Reference

SentencePieceTokenizer

Method Description
fromModelFile(path, config?) Load from .model file (async)
fromModelFileSync(path, config?) Load from .model file (sync)
fromBytes(bytes, config?) Load from byte data
encode(text, addSpecialTokens?) Encode single text
encodePair(textA, textB, ...) Encode text pair
encodeBatch(texts, addSpecialTokens?) Encode multiple texts
encodeBatchParallel(texts, ...) Parallel batch encoding
encodePairBatch(pairs, ...) Batch encode text pairs
decode(ids, skipSpecialTokens?) Decode IDs to text
decodeBatch(idsBatch, ...) Batch decode
enablePadding() / noPadding() Configure padding
enableTruncation() / noTruncation() Configure truncation
convertTokensToIds(tokens) Convert tokens to IDs
convertIdsToTokens(ids) Convert IDs to tokens
numSpecialTokensToAdd(isPair?) Get special token count

Encoding

Property Type Description
tokens List<String> Token strings
ids Int32List Token IDs
attentionMask Uint8List Attention mask (1=attend, 0=ignore)
typeIds Uint8List Token type IDs (0=first, 1=second)
specialTokensMask Uint8List Special token mask
offsets List<(int, int)> Character offsets
wordIds List<int?> Word indices
sequenceIds List<int?> Sequence indices
length int Number of tokens

Performance

Metric Value
Throughput ~500K+ tokens/sec
Model loading ~50ms (32K vocab)
Memory (vocab) ~3MB
Lookup complexity O(k) per token
BPE merge O(1) per merge
Max input length 500,000 chars

Memory Efficiency

Uses typed arrays for 50-70% memory reduction:

Field Type Bytes/token
ids Int32List 4
typeIds Uint8List 1
attentionMask Uint8List 1
specialTokensMask Uint8List 1

Model File

Download SentencePiece models from HuggingFace:

Format: Binary protobuf (.model files from SentencePiece C++ library).

Testing

# Run all tests (158 tests)
dart test

# Run specific test file
dart test test/sentencepiece_test.dart

# Run benchmarks
dart run benchmark/performance_benchmark.dart

# Run HuggingFace compatibility benchmark
dart run benchmark/hf_compatibility_benchmark.dart

HuggingFace Compatibility Verification

# Run HuggingFace compatibility benchmark
dart run benchmark/hf_compatibility_benchmark.dart

# Regenerate benchmark expected values (requires Python + sentencepiece)
pip install sentencepiece
python scripts/generate_hf_benchmark_data.py --model tokenizer.model

License

MIT License

Libraries

dart_sentencepiece_tokenizer
Pure Dart SentencePiece tokenizer library.