Minimal, modular transformer library for training your own LLM

These details have not been verified by PyPI

Project links

Project description

Transformer-Toolkit

Overview

Transformer-Toolkit is a production-ready, modular transformer library from scratch for training and experimenting with modern LLM architectures. Build compact, customizable models with swappable components — attention mechanisms, positional encodings, feed-forward networks, and normalization strategies — all configured via a single TransformerConfig object.

Why Transformer-Toolkit?

Fully modular: Pick your attention type, FFN variant, positional encoding, and normalization independently
Production features: Mixed precision training, gradient checkpointing, KV caching, HuggingFace Hub integration
SFT support: Full supervised fine-tuning pipeline with multi-turn conversation handling and loss masking
Fast tokenizers: Rust-backed BPE tokenizer trained 100x faster than pure Python
Efficient dataloading: Memmap-based dataset loading supports GB-scale datasets with zero RAM overhead
Clear inference API: Temperature, top-k, top-p, repetition penalty controls out-of-the-box

pip install transformer-toolkit

Core Modules
Data & Tokenization
- Tokenizers
- Dataloader
Training
Inference & Utilities
Examples
API Reference

Core Modules

Transformer Model

Location: transformer_toolkit/model.py
Main Classes: Transformer, TransformerConfig, TransformerBlock

The core transformer model built from composable modules. A single TransformerConfig object controls every architectural decision.

TransformerConfig

Complete model configuration via a dataclass. All attributes are optional with sensible defaults.

from transformer_toolkit.model import TransformerConfig

cfg = TransformerConfig(
    # ── core ──────────────────────────────────────────────────────────
    vocab_size  = 32000,      # tokenizer vocabulary size
    dim         = 512,        # model embedding dimension
    n_layers    = 8,          # number of transformer blocks
    n_heads     = 8,          # number of attention heads
    max_seq     = 2048,       # maximum sequence length

    # ── attention ─────────────────────────────────────────────────────
    attn       = "gqa",      # "mha" | "gqa" | "mqa" | "flash" | "mla"
    n_kv_heads = 4,          # gqa only — n_heads must be divisible by n_kv_heads
    latent_dim = 64,         # mla only — latent compression dimension

    # ── feed-forward ──────────────────────────────────────────────────
    ffn        = "swiglu",   # "ffn" | "relu_ffn" | "glu" | "reglu" | "geglu"
                             # | "swiglu" | "moe" | "moe_ec" | "moe_shared"
    hidden_dim = 2048,       # FFN inner dimension (default: dim × 4)
    n_experts  = 8,          # moe / moe_ec / moe_shared — total experts
    top_k      = 2,          # moe / moe_shared — experts activated per token
    moe_aux_weight = 0.01,   # moe / moe_shared — load-balancing loss coefficient
    moe_capacity   = 1.0,    # moe_ec — capacity factor
    moe_n_shared   = 2,      # moe_shared — always-active experts
    moe_n_routed   = 6,      # moe_shared — sparse routed experts

    # ── normalization ─────────────────────────────────────────────────
    norm       = "rmsnorm",  # "rmsnorm" | "layernorm"
    eps        = 1e-6,

    # ── positional encoding ───────────────────────────────────────────
    pos_enc    = "rope",     # "rope" | "sinusoidal" | "learned" | "alibi" | "none"

    # ── regularisation ────────────────────────────────────────────────
    dropout    = 0.0,        # 0.0 recommended for SFT and inference

    # ── output ────────────────────────────────────────────────────────
    tie_weights = True,      # share embedding and output projection weights

    # ── inference ─────────────────────────────────────────────────────
    use_kv_cache = False,    # enable KV cache during generation (inference only)
)

Key attributes:

Attribute	Type	Default	Purpose
`vocab_size`	int	32000	Tokenizer vocabulary size
`dim`	int	512	Model hidden dimension
`n_layers`	int	8	Number of transformer blocks
`n_heads`	int	8	Number of attention heads
`max_seq`	int	2048	Maximum sequence length
`attn`	str	"gqa"	Attention mechanism variant
`ffn`	str	"swiglu"	Feed-forward network variant
`norm`	str	"rmsnorm"	Normalization layer type
`pos_enc`	str	"rope"	Positional encoding type
`tie_weights`	bool	True	Share embedding and output weights

Transformer

Main model class. Initialize with a config and send to device.

import torch
from transformer_toolkit.model import Transformer, TransformerConfig

cfg = TransformerConfig(vocab_size=8000, dim=384, n_layers=6)
model = Transformer(cfg).to("cuda")

# Get parameter count (human readable)
print(model.n_params())  # "12.45M"

# Forward pass — returns (logits, aux_loss)
logits, aux_loss = model(tokens)   # tokens: [B, T]  →  logits: [B, T, vocab_size]

# Generation with temperature, top-k, top-p
output = model.generate(
    tokens      = prompt_ids,      # [B, T]
    max_new     = 200,
    temperature = 0.8,
    top_k       = 40,
)

Methods:

forward(tokens) → (logits, aux_loss): Main forward pass. Returns logits and auxiliary MoE loss (0.0 if not MoE).
generate(tokens, max_new, temperature, top_k, top_p) → tokens: Auto-regressive generation with sampling.
n_params() → str: Human-readable parameter count.
debug_gradients() → None: Print gradient statistics per parameter (call after loss.backward()).
debug_weights() → None: Print weight statistics per parameter.
state_dict_for_save() → dict: For weight-tied models, strips redundant weights before saving.
load_state_dict_with_tie() → None: For weight-tied models, restores weights correctly after loading.

Weight Tying

Weight tying shares the embedding matrix with the output projection, reducing parameters. However, it requires careful initialization.

When tied, nn.Embedding initializes with N(0, 1) — values around ±5. Without scaling, this produces logits of ±400 instead of ±3, crashing training.

Recommended: Disable tying for training from scratch:

cfg = TransformerConfig(tie_weights=False)

If enabling tying, scale the embedding at init:

model = Transformer(cfg).to("cuda")
if cfg.tie_weights:
    with torch.no_grad():
        model.embed.weight.mul_(0.02)  # scale into ±3 range

Debug Mode

Enable debug=True to inspect model structure and forward pass:

model = Transformer(cfg, debug=True).to("cuda")
# Prints model summary at construction:
#  🏗️  Model summary
#  params             16.35M
#  dim                384
#  n_layers           6
#  entropy check → should be > 90% of log(vocab_size) at init

Turn off after inspecting (runs on every forward pass):

model.debug = False

Attention Mechanisms

Location: transformer_toolkit/attention.py

Five attention variants, all swappable via TransformerConfig.attn. Pick the right one for your use case:

Value	Class	Key property	Used in
`"mha"`	`MultiHeadAttention`	Full KV cache per head	Original Transformer, BERT, GPT-2
`"gqa"`	`GroupedQueryAttention`	Grouped KV heads (4x-8x faster)	LLaMA 3, Mistral
`"mqa"`	`MultiQueryAttention`	Single KV head (fastest)	Falcon, early Gemini
`"flash"`	`FlashAttention`	Fused CUDA kernels	All (PyTorch ≥ 2.0)
`"mla"`	`MLAttention`	Latent compression	DeepSeek-V2/V3

Multi-Head Attention (MHA)

Classic attention from the original Transformer. Each head has separate K/V caches.

cfg = TransformerConfig(
    dim     = 512,
    n_heads = 8,
    attn    = "mha",  # full cache: [B, n_heads, T, head_dim]
)

Good for: Small models where memory is not a constraint.

Grouped Query Attention (GQA)

Multiple query heads share fewer key-value heads. Reduces KV cache size by n_heads / n_kv_heads.

cfg = TransformerConfig(
    dim        = 512,
    n_heads    = 8,
    attn       = "gqa",
    n_kv_heads = 2,   # 4 query groups, each sharing 1 KV head
)
# Constraint: n_heads % n_kv_heads == 0

Good for: Production models. LLaMA-3 uses n_heads=128, n_kv_heads=8 for massive KV cache reduction.

Multi-Query Attention (MQA)

All heads share a single K/V head. Fastest for inference, less expressive.

cfg = TransformerConfig(
    dim        = 512,
    n_heads    = 8,
    attn       = "mqa",
    n_kv_heads = 1,   # all 8 heads share 1 KV head
)

Flash Attention

Uses PyTorch's fused scaled_dot_product_attention (requires PyTorch ≥ 2.0 + CUDA). Faster and more memory efficient.

cfg = TransformerConfig(
    dim     = 512,
    n_heads = 8,
    attn    = "flash",  # uses fused kernel — no extra KV head config
)

Multi-Latent Attention (MLA)

DeepSeek's variant. Compresses Q/K/V to a lower-dimensional latent space via latent_dim.

cfg = TransformerConfig(
    dim        = 512,
    n_heads    = 8,
    attn       = "mla",
    latent_dim = 128,   # compression dimension — typically dim/4
)

RoPE (Rotary Position Encoding) is applied inside each attention module to Q and K after head-splitting. Applied once per forward pass and shared across all layers.

ALiBi bias is computed once per forward pass and passed as an additive mask to every attention block.

Causal masking is applied automatically. No manual mask needed for standard language model training.

Feed-Forward Networks

Location: transformer_toolkit/feed_forward.py

Nine FFN variants, each trading off between simplicity and expressiveness.

Class	Activation	Formula	Use case
`FFN`	GELU	dense → GELU → dense	Original Transformer
`ReLUFFN`	ReLU	dense → ReLU → dense	Classic (older)
`GLU`	Sigmoid	(dense ⊙ sigmoid(dense))	Simple gating
`ReGLU`	ReLU gating	(dense ⊗ ReLU(dense))	ReLU-gated variant
`GeGLU`	GELU gating	(dense ⊗ GELU(dense))	GELU-gated variant
`SwiGLU`	Swish gating	(dense ⊗ Swish(dense))	LLaMA, Mistral, Qwen (recommended)
`MoE`	Sparse routing	k-of-n experts (load balanced)	Sparse scaling
`MoE_EC`	Expert choice	Token-to-expert assignment	Balanced capacity
`MoE_Shared`	Hybrid	Always-active + sparse experts	Best of both

Standard FFNs

# Original Transformer
cfg = TransformerConfig(ffn="ffn", hidden_dim=2048)

# ReLU variant (older)
cfg = TransformerConfig(ffn="relu_ffn", hidden_dim=2048)

Gated FFNs

Gated FFNs learn to selectively activate subsets of parameters. Generally outperform standard FFN.

# SwiGLU — most popular (LLaMA, Mistral, Qwen)
cfg = TransformerConfig(ffn="swiglu", hidden_dim=2048)

# Other gates
cfg = TransformerConfig(ffn="geglu", hidden_dim=2048)   # GELU gate
cfg = TransformerConfig(ffn="reglu", hidden_dim=2048)   # ReLU gate
cfg = TransformerConfig(ffn="glu", hidden_dim=2048)     # Sigmoid gate

Mixture of Experts (MoE)

Conditionally activates only top_k out of n_experts experts per token. Dramatically scales parameter count without scaling compute.

Standard MoE — Each token independently chooses its top-k experts. Requires load-balancing loss to prevent expert collapse.

cfg = TransformerConfig(
    ffn            = "moe",
    n_experts      = 8,
    top_k          = 2,
    moe_aux_weight = 0.01,   # Mixtral uses 0.02
)

logits, aux_loss = model(tokens)
ce_loss = F.cross_entropy(logits.view(-1, vocab_size), targets.view(-1))
loss = ce_loss + aux_loss   # Always add aux_loss for MoE

The Trainer handles aux_loss automatically — no changes needed.

Expert Choice MoE (moe_ec) — Experts choose which tokens they process, not vice versa. Better load balancing and lower variance.

cfg = TransformerConfig(
    ffn           = "moe_ec",
    n_experts     = 8,
    moe_capacity  = 1.25,    # capacity factor
)

Shared Expert MoE (moe_shared) — Hybrid approach. Some experts are always active, some are sparse.

cfg = TransformerConfig(
    ffn         = "moe_shared",
    n_experts   = 8,          # total experts
    n_shared    = 2,          # always active
    n_routed    = 6,          # sparse routed
    top_k       = 2,          # routed tokens choose top-k
)

Normalization Layers

Location: transformer_toolkit/normalization.py

Two normalization options, each with tradeoffs:

Class	Subtraction	Bias	Scaling	Speed	Used in
`LayerNorm`	Yes (μ)	Yes	Yes	Slower	BERT, GPT-2
`RMSNorm`	No	No	Yes	Faster	LLaMA, Mistral, Qwen
`DeepNorm`	Both per block	Both	Both	Slower	1000+ layer transformers

LayerNorm

Classic normalization. Subtracts mean and divides by standard deviation.

from transformer_toolkit.normalization import LayerNorm

norm = LayerNorm(dim=512, eps=1e-5)
x_norm = norm(x)

Formula: $\text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$

Good for: Training stability, better for smaller models.

RMSNorm

Root Mean Square normalization. No mean subtraction, no bias — faster and cleaner.

from transformer_toolkit.normalization import RMSNorm

norm = RMSNorm(dim=512, eps=1e-6)
x_norm = norm(x)

Formula: $\text{RMSNorm}(x) = \gamma \cdot \frac{x}{\sqrt{\text{RMS}(x)^2 + \epsilon}}$

Good for: Modern LLMs (LLaMA, Mistral, Qwen). Slightly lower memory, same training stability.

DeepNorm

Specialized for very deep transformers (1000+ layers). Scales residuals before normalization.

from transformer_toolkit.normalization import DeepNorm

norm = DeepNorm(dim=512, alpha=2.0)
x_norm = norm(x, residual)

Good for: Ultra-deep models. Not needed for typical 6-32 layer models.

Positional Encodings

Location: transformer_toolkit/positional_encodings.py

Five positional encoding strategies. RoPE is applied inside attention; others are applied to the residual stream before the first block.

Type	Location	Learnable	Best for
`RoPE`	Inside attention (q, k)	No	Modern models (LLaMA, Mistral, Qwen)
`SinusoidalPE`	Residual stream	No	Original Transformer
`LearnedPE`	Residual stream	Yes	BERT, GPT-2
`ALiBi`	Attention scores	No	Length generalization
`none`	Not applied	—	Ablation studies

RoPE (Rotary Position Encoding)

Applies rotation matrices to query and key vectors inside attention. No learnable parameters.

cfg = TransformerConfig(pos_enc="rope")  # default

Advantages: Enables length extrapolation (infer beyond training length), used in LLaMA, Mistral, Qwen.

Details: Applied after head-splitting, so each attention head gets independent rotations.

Sinusoidal Positional Encoding

Fixed sine/cosine patterns added to embeddings before the transformer blocks. From the original Transformer paper.

cfg = TransformerConfig(pos_enc="sinusoidal", max_seq=2048)

Formula: $PE_{(pos, 2i)} = \sin(\text{pos} / 10000^{2i/d})$, $PE_{(pos, 2i+1)} = \cos(\text{pos} / 10000^{2i/d})$

Learned Positional Encoding

Trainable embedding table for positions. Used in BERT and GPT-2.

cfg = TransformerConfig(pos_enc="learned", max_seq=2048)

ALiBi (Attention with Linear Biases)

Adds linear biases to attention scores based on relative position. No learnable parameters, supports arbitrary lengths.

cfg = TransformerConfig(pos_enc="alibi")

Advantages: Enables length generalization without training on longer sequences.

No Positional Encoding

Ablation option — model receives no position information.

cfg = TransformerConfig(pos_enc="none")  # for ablation studies

Transformer Block

Location: transformer_toolkit/block.py

A single transformer block. Pre-norm architecture: norm → attention → residual → norm → ffn → residual.

Key features:

Gradient checkpointing — trade compute for memory (~20% slower, 60% less VRAM)
KV caching — for fast inference
Auxiliary MoE loss — if using MoE FFN
Flexible component swapping — inject any attention, FFN, norm

from transformer_toolkit.block import TransformerBlock

block = TransformerBlock(
    dim            = 512,
    n_heads        = 8,
    hidden         = 2048,
    norm           = None,      # None = default to LayerNorm
    attn           = None,      # None = default to MultiHeadAttention
    ffn            = None,      # None = default to FFN
    dropout        = 0.1,
    use_checkpoint = False,     # enable for large models
)

# Forward returns (output, aux_loss, present_kv)
x, aux_loss, present_kv = block(x, past_kv=None)

Gradient checkpointing (for memory efficiency):

block = TransformerBlock(
    ..., use_checkpoint=True
)
# Recomputes activations during backward — saves ~60% VRAM, ~20% slower

Data & Tokenization

Tokenizers

Location: transformer_toolkit/c_tokenizers.py

Three tokenizer classes with a unified interface. Each implements encode(), decode(), train(), save(), load().

from transformer_toolkit.c_tokenizers import (
    ByteLevelTokenizer,
    RustBPETokenizer,
    HFTokenizer,
)

ByteLevelTokenizer

Zero dependencies. Every byte (0-255) is a token. Works on any text or encoding immediately.

tok = ByteLevelTokenizer()
ids = tok.encode("Hello")           # [72, 101, 108, 108, 111]
txt = tok.decode(ids)               # "Hello"
print(tok.vocab_size)               # 256

Pros: Universal, zero setup.
Cons: Inefficient — long sequences need many tokens.

RustBPETokenizer

Byte-Pair Encoding backed by HuggingFace's Rust tokenizers library. ~100x faster than pure Python BPE.

Installation:

pip install tokenizers

Usage:

from transformer_toolkit.c_tokenizers import RustBPETokenizer

tok = RustBPETokenizer()

# Train once
tok.train(
    texts=open("data.txt", encoding="utf-8").readlines(),
    vocab_size=8000
)
tok.save("tokenizer.json")

# On subsequent runs — just load
tok = RustBPETokenizer()
tok.load("tokenizer.json")

ids = tok.encode("Hello world")
txt = tok.decode(ids)
print(tok.vocab_size)  # 8000

Special tokens: All SFT-related special tokens (chat format tokens, BOS, EOS, PAD) are registered automatically at train time.

tok.train(texts=lines, vocab_size=32000)
# Special tokens registered automatically at fixed IDs:
# ID 0: [UNK], ID 1: [PAD], ID 2: [BOS], ID 3: [EOS]
# ID 7: <|im_start|>, ID 8: <|im_end|>  (ChatML)
# ID 9: <|start_header_id|>, ID 10: <|end_header_id|>  (LLaMA3)
# ... and more (see README "Chat Templates" section)

Call tok.validate_template() before SFT to ensure all special tokens are properly registered:

from transformer_toolkit.chat_template import ChatTemplate

template = ChatTemplate("llama3")
tok.validate_template(template)
# Raises if special tokens are fragmented (not single vocab entries)

HFTokenizer

Thin wrapper around any HuggingFace pretrained tokenizer. Access thousands of pretrained tokenizers.

Installation:

pip install transformers

Usage:

from transformer_toolkit.c_tokenizers import HFTokenizer

tok = HFTokenizer("gpt2")

ids = tok.encode("Hello world")
txt = tok.decode(ids)
print(tok.vocab_size)  # 50257

# Load any HuggingFace tokenizer
tok = HFTokenizer("meta-llama/Llama-2-7b-hf")
tok = HFTokenizer("mistralai/Mistral-7B-Instruct-v0.1")

Dataloader

Location: transformer_toolkit/dataloader.py

Efficient data pipeline for training. Supports multiple sources and loading strategies, with memmap for minimal memory overhead on large datasets.

DataConfig

from transformer_toolkit.dataloader import DataConfig

cfg = DataConfig(
    seq_len     = 512,        # sequence length fed to model
    batch_size  = 32,         # samples per batch
    split       = 0.9,        # fraction for training (rest for validation)
    stride      = None,       # None = non-overlapping; int = overlapping windows
    shuffle     = True,       # shuffle training data
    num_workers = 4,          # parallel dataloading workers
    pin_memory  = True,       # pin tensors to GPU memory
    debug       = False,      # print sample preview
    debug_n     = 3,          # number of debug samples
)

Key attributes:

Attribute	Default	Purpose
`seq_len`	512	Sequence length for model
`batch_size`	32	Samples per batch
`split`	0.9	Train/val split ratio
`stride`	None	Window stride (None = seq_len, no overlap)
`shuffle`	True	Shuffle training batches
`num_workers`	4	Parallel loading workers
`debug`	False	Print decoded sample preview

stride parameter:

stride=None (default): Non-overlapping windows, few clean samples
stride=<int>: Overlapping windows, many samples but faster overfitting on small data

Example: 1.86M tokens with seq_len=128:

stride=None (128): ~14,600 samples
stride=1: ~1.86M samples (rapid overfitting)

Loading from a Binary File

One-time tokenization:

from transformer_toolkit.dataloader import save_binary, from_binary

# Tokenize once, save binary
tokens = tok.encode(open("data.txt", encoding="utf-8").read())
save_binary(tokens, "data.bin")

# Load anytime
cfg = DataConfig(seq_len=128, batch_size=32)
train_dl, val_dl = from_binary("data.bin", cfg, tokenizer=tok)

With automatic NPY split (recommended for reuse):

train_dl, val_dl = from_binary(
    "data.bin", cfg,
    train_path="train.npy",  # saves splits here
    val_path="val.npy",
    tokenizer=tok,
)

On subsequent runs, load the pre-split .npy files directly (zero latency).

Memmap — Load Pre-split NPY Files

After first run, skip tokenization. The .npy files stay on disk — only accessed pages load into RAM. Scales to 100GB+ datasets.

from transformer_toolkit.dataloader import from_npy_split

cfg = DataConfig(seq_len=512, batch_size=32)
train_dl, val_dl = from_npy_split(
    "train.npy", "val.npy", cfg, tokenizer=tok
)

Loading from Text Files

Multiple text files:

from transformer_toolkit.dataloader import from_files

train_dl, val_dl = from_files(
    paths=["data1.txt", "data2.txt", "data3.txt"],
    tokenizer=tok,
    cfg=cfg,
    train_path="train.npy",  # optional — saves splits for future reuse
    val_path="val.npy",
    bos_id=tok.bos_id,       # optional — wrap documents with BOS/EOS
    eos_id=tok.eos_id,
)

Loading from HuggingFace

Streaming (no disk required, works with infinite datasets):

from transformer_toolkit.dataloader import from_hf

cfg_stream = DataConfig(seq_len=512, batch_size=16, streaming=True)
train_dl, val_dl = from_hf(
    dataset_name="roneneldan/TinyStories",
    tokenizer=tok,
    cfg=cfg_stream,
)

In-memory (download fully, split, optionally save as .npy):

train_dl, val_dl = from_hf(
    dataset_name="roneneldan/TinyStories",
    tokenizer=tok,
    cfg=cfg,
    text_col="text",           # column containing text
    bos_id=tok.bos_id,
    eos_id=tok.eos_id,
    train_path="train.npy",    # save splits for future memmap loads
    val_path="val.npy",
)

Dataloader Debug Mode

Preview decoded samples before training:

cfg = DataConfig(seq_len=128, batch_size=32, debug=True, debug_n=2)
train_dl, val_dl = from_binary("data.bin", cfg, tokenizer=tok)

Output:

🔍 Debug samples (train)
seq_len=128  stride=128  batch_size=32

sample 1
x ids : [23, 451, 12, 8, 1203 ...] ... +121
y ids : [451, 12, 8, 1203, 44 ...] ... +121
x text: 'ROMEO:\nBut soft, what light through yonder window...'
y text: '\nBut soft, what light through yonder window breaks'
✓  x/y alignment correct (y = x shifted by 1)

Training

Pretraining

Location: transformer_toolkit/trainer.py

Full training loop for pretraining on raw text. Handles optimizer, learning rate schedule, gradient clipping, mixed precision, HuggingFace Hub integration, and graceful interruption.

TrainConfig

from transformer_toolkit.trainer import TrainConfig

cfg = TrainConfig(
    # ── steps ─────────────────────────────────────────────────────────
    max_steps        = 10000,   # total optimizer steps
    eval_every       = 500,     # validation frequency
    save_every       = 1000,    # checkpoint frequency
    log_every        = 50,      # print loss every N steps
    interruptible    = True,    # Ctrl+C → clean checkpoint

    # ── optimiser ─────────────────────────────────────────────────────
    lr               = 3e-4,    # peak learning rate after warmup
    min_lr           = 3e-5,    # floor LR at end of cosine decay
    weight_decay     = 0.1,     # L2 penalty on 2D weights only
    beta1            = 0.9,     # AdamW β₁
    beta2            = 0.95,    # AdamW β₂
    grad_clip        = 1.0,     # max gradient norm

    # ── lr schedule ───────────────────────────────────────────────────
    warmup_steps     = 200,     # linear ramp from 0 to peak_lr

    # ── efficiency ────────────────────────────────────────────────────
    grad_accum_steps = 4,       # effective batch = batch_size × grad_accum
    mixed_precision  = True,    # automatic bf16/fp16 on CUDA
    grad_checkpoint  = False,   # recompute activations (~20% slower, 60% less VRAM)

    # ── checkpoints ───────────────────────────────────────────────────
    ckpt_dir         = "checkpoints",
    save_best        = True,    # save best.pt when val loss improves
    save_step_ckpts  = True,    # save step_N.pt every save_every steps

    # ── huggingface hub ───────────────────────────────────────────────
    hf_repo          = None,              # "username/model-name"
    hf_private       = True,
    hf_push_best     = True,    # push whenever val loss improves
    hf_push_every_n  = False,   # push every save_every steps
    hf_push_end      = True,    # push at end of training
    hf_push_on_pause = True,    # push on Ctrl+C
)

Training Loop

from transformer_toolkit.trainer import Trainer

trainer = Trainer(
    model      = model,
    train_dl   = train_dl,
    val_dl     = val_dl,
    vocab_size = tok.vocab_size,
    cfg        = cfg,
    tokenizer  = tok,        # optional — used for Hub uploads
)

# Start training
trainer.train()

# Resume from checkpoint
trainer.train(resume_from="checkpoints/step_2000.pt")

Example training output:

⚡ Transformer Toolkit Trainer
steps=3000  lr=0.0003  warmup=200  accum=4
mixed_precision=True  grad_clip=1.0

step    100/3000  ████████░░░░░░░░░░░░░░░░  loss 3.1423  lr 1.5e-04  eta 4m
step    200/3000  ████████████░░░░░░░░░░░░  loss 2.8901  lr 3.0e-04  eta 3m

● eval  step 300  val_loss 2.7130  ppl 15.07  ▼0.1823  ★ best

Expected loss curve (healthy run):

Step	Target val loss	Notes
init	~log(vocab_size)	~8.99 for vocab=8000
100	5-7	Learning patterns
300	3-5	Confirm training works
1000	2-3.5	Good progress
3000	1.5-2.5	Typical small model

If val loss > 8.0 at step 300 → initialization issue (check weight tying).
If val loss < 1.0 before step 1000 → overfitting on small dataset.

Supervised Fine-Tuning (SFT)

Location: transformer_toolkit/sft_trainer.py, transformer_toolkit/sft_dataloader.py

Full SFT pipeline — teach a pretrained model to follow instructions in a specific conversation format. Handles multi-turn conversations, loss masking, special tokens validation.

How SFT Works

During pretraining, the model learns language from raw text. SFT teaches it to follow a specific conversation format with proper roles, tokens, and stop conditions.

Key idea: Loss masking. Only the assistant's response contributes to loss:

<|start_header_id|>user<|end_header_id|>        → loss=0  (context)
What is Python?<|eot_id|>                        → loss=0
<|start_header_id|>assistant<|end_header_id|>   → loss=0  (header)

Python is a programming language.<|eot_id|>     → loss=1  (response)
[EOS]                                            → loss=1  (model learns to stop)

Chat Templates

Location: transformer_toolkit/chat_template.py

Defines how conversations are formatted. Four presets available; pick one and use consistently.

Available Presets

Preset	Format	Special tokens	Modern
`llama3`	`<\|start_header_id\|>role<\|end_header_id\|>\n\ncontent<\|eot_id\|>`	IDs 9-11	✓ Recommended
`chatml`	`<\|im_start\|>role<\|im_end\|>\ncontent<\|im_end\|>`	IDs 7-8	✓ Popular
`gemma`	`<start_of_turn>role<end_of_turn>\ncontent<end_of_turn>`	IDs 12-13	✓ Supported
`alpaca`	`### Instruction:\ncontent\n\n### Response:\ncontent`	None	Older
`raw`	`User: content\nAssistant: content`	None	Fallback

Using a Chat Template

from transformer_toolkit.chat_template import ChatTemplate

template = ChatTemplate("llama3")

# Format messages for display/logging
msgs = [
    {"role": "user", "content": "What is Python?"},
    {"role": "assistant", "content": "A programming language."},
]
text, loss_mask_ranges = template.format_messages(msgs)
print(text)

Custom Template

template = ChatTemplate(
    preset            = "chatml",
    assistant_header  = "<|im_start|>assistant\n",   # loss=0
    assistant_closer  = "<|im_end|>\n",              # loss=1
)

Inference & Utilities

Inference Engine

Location: transformer_toolkit/inference.py

High-level API for generation. Handles sampling parameters, streaming output, device selection.

InferenceConfig

from transformer_toolkit.inference import InferenceConfig

cfg = InferenceConfig(
    max_new_tokens      = 200,      # max tokens to generate
    temperature         = 0.8,      # higher = random, lower = focused
    top_k               = 50,       # keep top-k tokens
    top_p               = 0.9,      # nucleus sampling
    repetition_penalty  = 1.1,      # penalize repeated tokens
    stream              = True,     # print tokens as they generate
    device              = "cuda",   # or "cpu"
)

Using Inference

from transformer_toolkit.inference import Inference

inference = Inference(model, tok, cfg)

# Single generation
output = inference.generate(prompt="Once upon a time")

# Streaming output
inference.stream = True
output = inference.generate(prompt="Hello world")

HuggingFace Hub Integration

Location: transformer_toolkit/hf_hub.py

Push and pull models to/from HuggingFace Hub. Automatic during training if configured.

Login

from transformer_toolkit.hf_hub import login

login(token="hf_your_token_here")

Push to Hub

from transformer_toolkit.hf_hub import push_to_hub

push_to_hub(
    repo_id   = "username/my-model",
    model     = model,
    cfg       = cfg_model,
    tokenizer = tok,
    metrics   = {"val_loss": 1.83, "perplexity": 6.23},
    step      = 3000,
    private   = True,
)

Pull from Hub

from transformer_toolkit.hf_hub import pull_from_hub

pull_from_hub("username/my-model", save_dir="checkpoints")
# Downloads: model.pt, tokenizer.json, config.json, metrics.json

Color Utilities

Location: transformer_toolkit/colors.py

Internal ANSI color codes for formatted console output. Used throughout the library for training logs, debug output, error messages.

from transformer_toolkit.colors import C

print(f"{C.BOLD}{C.GREEN}Success!{C.RESET}")
print(f"{C.YELLOW}⚠  Warning{C.RESET}")
print(f"{C.RED}✗ Error{C.RESET}")

Colors available: RED, GREEN, YELLOW, BLUE, CYAN, MAGENTA, WHITE
Styles available: BOLD, DIM, RESET

Quick Start

Examples

Small Model — Shakespeare

Complete example training a small transformer on Shakespeare text (< 5 minutes on 4GB GPU).

import torch, os
from transformer_toolkit.model import Transformer, TransformerConfig
from transformer_toolkit.c_tokenizers import RustBPETokenizer
from transformer_toolkit.dataloader import DataConfig, from_binary, from_npy_split, save_binary
from transformer_toolkit.trainer import Trainer, TrainConfig

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# tokenizer — train once, reuse
tok = RustBPETokenizer()
if os.path.exists("tokenizer.json"):
    tok.load("tokenizer.json")
else:
    tok.train(open("shakespeare.txt", encoding="utf-8").readlines(), vocab_size=8000)
    tok.save("tokenizer.json")

# data — tokenize once, reuse memmap splits
cfg_data = DataConfig(seq_len=128, batch_size=32, split=0.9, stride=None)
if os.path.exists("train.npy") and os.path.exists("val.npy"):
    train_dl, val_dl = from_npy_split("train.npy", "val.npy", cfg_data, tokenizer=tok)
else:
    if not os.path.exists("data.bin"):
        save_binary(tok.encode(open("shakespeare.txt", encoding="utf-8").read()), "data.bin")
    train_dl, val_dl = from_binary("data.bin", cfg_data,
                                    train_path="train.npy", val_path="val.npy",
                                    tokenizer=tok)

# model
model = Transformer(TransformerConfig(
    vocab_size  = tok.vocab_size,
    dim         = 384,
    n_layers    = 6,
    n_heads     = 6,
    n_kv_heads  = 3,
    attn        = "gqa",
    ffn         = "swiglu",
    hidden_dim  = 1536,
    norm        = "rmsnorm",
    pos_enc     = "rope",
    dropout     = 0.1,
    tie_weights = False,
)).to(DEVICE)
print(f"params: {model.n_params()}")   # ~15M

# train
trainer = Trainer(model, train_dl, val_dl, tok.vocab_size, TrainConfig(
    max_steps        = 3000,
    warmup_steps     = 200,
    eval_every       = 300,
    lr               = 3e-4,
    grad_accum_steps = 4,
    mixed_precision  = True,
    save_best        = True,
    save_step_ckpts  = True,
))
trainer.train()

Large Dataset — HuggingFace Streaming

Stream a dataset without downloading it fully:

from transformer_toolkit.dataloader import DataConfig, from_hf, from_npy_split
from transformer_toolkit.c_tokenizers import HFTokenizer

tok = HFTokenizer("HuggingFaceTB/SmolLM-135M")
cfg = DataConfig(seq_len=512, batch_size=16, stride=None, num_workers=4)

# first run — streams, tokenizes, caches as memmap splits
train_dl, val_dl = from_hf(
    dataset_name = "roneneldan/TinyStories",
    tokenizer    = tok,
    cfg          = cfg,
    bos_id       = tok._tok.bos_token_id,
    eos_id       = tok._tok.eos_token_id,
    train_path   = "train.npy",
    val_path     = "val.npy",
)

# future runs — zero download, memmap loads directly
train_dl, val_dl = from_npy_split("train.npy", "val.npy", cfg, tokenizer=tok)

MoE Model — Sparse Experts

Train a mixture-of-experts model for parameter efficiency:

model = Transformer(TransformerConfig(
    vocab_size     = tok.vocab_size,
    dim            = 512,
    n_layers       = 8,
    n_heads        = 8,
    attn           = "flash",
    ffn            = "moe",
    n_experts      = 8,
    top_k          = 2,
    moe_aux_weight = 0.01,
    pos_enc        = "rope",
    dropout        = 0.1,
    tie_weights    = False,
)).to("cuda")

# The Trainer adds aux_loss to loss automatically
trainer = Trainer(model, train_dl, val_dl, tok.vocab_size, TrainConfig(
    max_steps = 5000,
    lr        = 3e-4,
))
trainer.train()

SFT on Pretrained Model

Fine-tune a pretrained model on instruction-following data:

import torch
from transformer_toolkit import Transformer, TransformerConfig
from transformer_toolkit import SFTTrainer, TrainConfig
from transformer_toolkit import from_sft_json, SFTDataConfig
from transformer_toolkit.c_tokenizers import RustBPETokenizer

DEVICE = "cuda"

# Load pretrained tokenizer and model
tok = RustBPETokenizer()
tok.load("tokenizer.json")

model = Transformer(TransformerConfig(
    vocab_size=tok.vocab_size,
    dim=512, n_layers=8, n_heads=8, n_kv_heads=2,
    attn="gqa", ffn="swiglu", hidden_dim=2048,
    norm="rmsnorm", pos_enc="rope", max_seq=512,
)).to(DEVICE)

# Load pretrained checkpoint
ckpt = torch.load("pretraining_checkpoints/best.pt", map_location=DEVICE)
model.load_state_dict(ckpt["model"])

# Prepare SFT data
cfg_sft = SFTDataConfig(
    tokenizer=tok, seq_len=512, batch_size=8, split=0.95,
    template="llama3", truncation_strategy="turn", debug=True
)
train_dl, val_dl = from_sft_json("instructions.jsonl", tok, cfg_sft)

# Fine-tune
trainer = SFTTrainer(
    model=model, train_dl=train_dl, val_dl=val_dl,
    vocab_size=tok.vocab_size,
    cfg=TrainConfig(
        max_steps=2000, lr=1e-4, warmup_steps=100,
        eval_every=100, save_every=200,
        save_best=True, ckpt_dir="sft_checkpoints",
    ),
    tokenizer=tok,
)
trainer.train()

API Reference

Model API

Module: transformer_toolkit.model

Class/Function	Purpose
`TransformerConfig`	Dataclass controlling all architecture decisions
`Transformer`	Main model class — forward pass and generation
`TransformerBlock`	Single transformer block with gradient checkpointing

Key Methods:

# Transformer
model.forward(tokens: Tensor) → (logits, aux_loss)
model.generate(tokens, max_new, temperature, top_k, top_p) → Tensor
model.n_params() → str
model.debug_gradients() → None
model.debug_weights() → None
model.state_dict_for_save() → dict  # for weight-tied models
model.load_state_dict_with_tie(state_dict) → None  # for weight-tied models

Attention API

Module: transformer_toolkit.attention

from transformer_toolkit.attention import (
    MultiHeadAttention,
    GroupedQueryAttention,
    MultiQueryAttention,
    FlashAttention,
    MLAttention,
)

# All follow same interface
attn = MultiHeadAttention(dim, n_heads, pos_enc=...)(x)

Feed-Forward API

Module: transformer_toolkit.feed_forward

from transformer_toolkit.feed_forward import (
    FFN, ReLUFFN, GLU, ReGLU, GeGLU, SwiGLU,
    MoE, ExpertChoiceMoE, SharedExpertMoE,
)

# All follow same interface
ffn = SwiGLU(dim, hidden_dim)
output, aux_loss = ffn(x)

Normalization API

Module: transformer_toolkit.normalization

from transformer_toolkit.normalization import LayerNorm, RMSNorm, DeepNorm

norm = RMSNorm(dim, eps=1e-6)
x_normalized = norm(x)

Positional Encoding API

Module: transformer_toolkit.positional_encodings

from transformer_toolkit.positional_encodings import (
    SinusoidalPE, LearnedPE, RoPE, ALiBi
)

pe = RoPE(dim, max_seq=2048)
q_rotated, k_rotated = pe.rotate(q, k)

Tokenizer API

Module: transformer_toolkit.c_tokenizers

from transformer_toolkit.c_tokenizers import (
    ByteLevelTokenizer,
    RustBPETokenizer,
    HFTokenizer,
)

# All follow BaseTokenizer interface
tok.train(texts, vocab_size)
ids = tok.encode(text)
text = tok.decode(ids)
tok.save(path)
tok.load(path)
vocab_size = tok.vocab_size

Dataloader API

Module: transformer_toolkit.dataloader

from transformer_toolkit.dataloader import (
    DataConfig,
    from_binary,
    from_npy_split,
    from_files,
    from_hf,
    save_binary,
)

cfg = DataConfig(seq_len=512, batch_size=32)
train_dl, val_dl = from_binary("data.bin", cfg, tokenizer=tok)

Trainer API

Module: transformer_toolkit.trainer

from transformer_toolkit.trainer import Trainer, TrainConfig

cfg = TrainConfig(max_steps=10000, lr=3e-4)
trainer = Trainer(model, train_dl, val_dl, vocab_size, cfg, tokenizer=tok)
trainer.train()
trainer.train(resume_from="checkpoints/step_5000.pt")

SFT API

Module: transformer_toolkit.sft_trainer, transformer_toolkit.sft_dataloader

from transformer_toolkit import (
    SFTTrainer,
    SFTDataConfig,
    from_sft_strings,
    from_sft_json,
    from_sft_files,
    from_sft_hf,
    ChatTemplate,
)

cfg = SFTDataConfig(
    tokenizer=tok, seq_len=512, batch_size=8,
    template="llama3", truncation_strategy="turn"
)
train_dl, val_dl = from_sft_json("data.jsonl", tok, cfg)

trainer = SFTTrainer(model, train_dl, val_dl, vocab_size, cfg, tokenizer=tok)
trainer.train()

Chat Template API

Module: transformer_toolkit.chat_template

from transformer_toolkit.chat_template import ChatTemplate

template = ChatTemplate("llama3")  # or "chatml", "gemma", "alpaca", "raw"
text, loss_ranges = template.format_messages(messages)
template.validate_template()  # check special tokens

Inference API

Module: transformer_toolkit.inference

from transformer_toolkit.inference import Inference, InferenceConfig

cfg = InferenceConfig(
    max_new_tokens=200, temperature=0.8, top_k=50, top_p=0.9
)
engine = Inference(model, tokenizer, cfg)
output = engine.generate(prompt="Hello")

HuggingFace Hub API

Module: transformer_toolkit.hf_hub

from transformer_toolkit.hf_hub import login, push_to_hub, pull_from_hub

login(token="...")
push_to_hub(repo_id="username/model", model=model, cfg=cfg, tokenizer=tok)
pull_from_hub("username/model", save_dir="checkpoints")

Architecture Reference

Complete transformer architecture overview:

Input tokens                [B, T]
      │
      ▼
Embedding + DropOut
      │
      ▼  (SinusoidalPE or LearnedPE added here, if selected)
      │
      ▼  × n_layers
┌─────────────────────────────────────────────────┐
│  RMSNorm / LayerNorm                            │
│  ├─ Attention (MHA/GQA/MQA/Flash/MLA)          │
│  │  ├─ RoPE applied to q, k (if selected)      │
│  │  ├─ ALiBi bias added to scores (if sel.)    │
│  │  └─ Causal mask applied automatically       │
│  └─ + Residual connection                       │
│                                                 │
│  RMSNorm / LayerNorm                            │
│  ├─ FFN / SwiGLU / MoE                          │
│  └─ + Residual connection                       │
└─────────────────────────────────────────────────┘
      │
      ▼
Final RMSNorm / LayerNorm
      │
      ▼
Output Linear Head              [B, T, vocab_size]
      │
      ▼
Logits + Aux Loss (MoE only)

Data Flow:

Tokenize input text → token IDs: [B, T]
Embed tokens → embeddings: [B, T, dim]
Add positional encoding (if sinusoidal/learned)
Pass through n_layers transformer blocks (each applies attention + FFN)
Apply final norm
Linear projection to vocabulary → logits: [B, T, vocab_size]
Cross-entropy loss computed only on response tokens (SFT) or all tokens (pretraining)

Requirements & Installation

Core Requirements

Package	Version	Purpose	Required?
`torch`	≥ 2.0	PyTorch (GPU recommended)	✓ Yes
`numpy`	any	Memmap dataloading	✓ Yes
`pydantic`	any	Config validation	✓ Yes

Optional Dependencies

Package	Version	Purpose	Command
`tokenizers`	any	RustBPETokenizer	`pip install tokenizers`
`transformers`	any	HFTokenizer, HF datasets	`pip install transformers`
`datasets`	any	HF dataset streaming	`pip install datasets`
`huggingface_hub`	any	Hub push/pull	`pip install huggingface_hub`
`hf-transfer`	any	Faster Hub uploads	`pip install hf-transfer`

Quick Install

Full installation (all features):

pip install transformer-toolkit tokenizers transformers datasets huggingface_hub hf-transfer

Minimal installation (PyTorch + core):

pip install torch numpy pydantic
pip install transformer-toolkit

GPU Setup

NVIDIA CUDA (recommended):

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

AMD ROCm:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.7

Troubleshooting

Model Training Issues

Issue	Cause	Solution
Loss stuck at `log(vocab_size)`	Weight tying + initialization	Disable tying or scale embedding by 0.02
Loss NaN after few steps	Learning rate too high	Reduce `lr` (try `1e-4` for SFT)
OOM (out of memory)	Batch too large	Reduce `batch_size` or enable `grad_checkpoint`
Training slow	Missing optimizations	Enable `mixed_precision=True`, use `flash` attention
Validation loss plateaus	Underfitting	Increase model size or training steps

Data Loading Issues

Issue	Cause	Solution
Memory spike on first load	Full dataset in RAM	Use `stride=None`, enable memmap saving
Slow data loading	Python GIL contention	Increase `num_workers`
Tokenizer error in SFT	Missing special tokens	Retrain tokenizer before SFT
Very low mask ratio	seq_len too large	Lower `seq_len` to match data

Device Issues

Issue	Cause	Solution
CUDA out of memory	Model/batch too large	Use mixed precision, gradient checkpointing, smaller batch
Wrong device placement	Tensor on CPU, model on GPU	Ensure `.to(device)` before forward pass
Slow on GPU	Using CPU tensors	Use `.cuda()` on inputs before model calls

Citation

If you use Transformer-Toolkit in your research, please cite:

@software{transformer_toolkit_2024,
  title={Transformer-Toolkit: A Production Modular Transformer Library},
  author={Barbade, Govind},
  year={2024},
  url={https://github.com/Barbade22/transformer-toolkit}
}

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.33

Apr 3, 2026

0.0.32

Apr 3, 2026

0.0.31

Apr 1, 2026

0.0.30

Mar 30, 2026

0.0.29

Mar 29, 2026

0.0.28

Mar 29, 2026

0.0.27

Mar 29, 2026

0.0.26

Mar 29, 2026

0.0.25

Mar 19, 2026

0.0.24

Mar 17, 2026

0.0.23

Mar 17, 2026

0.0.20

Mar 16, 2026

0.0.18

Mar 15, 2026

0.0.17

Mar 15, 2026

0.0.16

Mar 15, 2026

0.0.15

Mar 15, 2026

0.0.14

Mar 15, 2026

0.0.13

Mar 15, 2026

0.0.12

Mar 14, 2026

0.0.0

Mar 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

transformer_toolkit-0.0.33.tar.gz (95.6 kB view details)

Uploaded Apr 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

transformer_toolkit-0.0.33-py3-none-any.whl (72.1 kB view details)

Uploaded Apr 3, 2026 Python 3

File details

Details for the file transformer_toolkit-0.0.33.tar.gz.

File metadata

Download URL: transformer_toolkit-0.0.33.tar.gz
Upload date: Apr 3, 2026
Size: 95.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for transformer_toolkit-0.0.33.tar.gz
Algorithm	Hash digest
SHA256	`9b5966d5060e55b09aeb312efc4db5e7ea9116b5460f089306aa0e917a0a726a`
MD5	`94dde8c1066f2cc343ff74bc7bb4bacf`
BLAKE2b-256	`7fb6929d15e08d427bc256cd079a204394221f7232333e59e7cf505f01de5a71`

See more details on using hashes here.

File details

Details for the file transformer_toolkit-0.0.33-py3-none-any.whl.

File metadata

Download URL: transformer_toolkit-0.0.33-py3-none-any.whl
Upload date: Apr 3, 2026
Size: 72.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for transformer_toolkit-0.0.33-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fd466415533ba6aca5ecac016650bb69ca01ec0d5c5310e4e8e1d3a7bcc970ee`
MD5	`9c9f4309d3e862086514dc9102866b74`
BLAKE2b-256	`a3dfbf18e9d51e9943521c532a157d824c78f1ab11f308e66d96268ca10ec966`

See more details on using hashes here.

transformer-toolkit 0.0.33

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

Transformer-Toolkit

Overview

Why Transformer-Toolkit?

Table of Contents

Core Modules

Transformer Model

TransformerConfig

Transformer

Weight Tying

Debug Mode

Attention Mechanisms

Multi-Head Attention (MHA)

Grouped Query Attention (GQA)

Multi-Query Attention (MQA)

Flash Attention

Multi-Latent Attention (MLA)

Feed-Forward Networks

Standard FFNs

Gated FFNs

Mixture of Experts (MoE)

Normalization Layers

LayerNorm

RMSNorm

DeepNorm

Positional Encodings

RoPE (Rotary Position Encoding)

Sinusoidal Positional Encoding

Learned Positional Encoding

ALiBi (Attention with Linear Biases)

No Positional Encoding

Transformer Block

Data & Tokenization

Tokenizers

ByteLevelTokenizer

RustBPETokenizer

HFTokenizer

Dataloader

DataConfig

Loading from a Binary File

Memmap — Load Pre-split NPY Files

Loading from Text Files

Loading from HuggingFace

Dataloader Debug Mode

Training

Pretraining

TrainConfig

Training Loop

Supervised Fine-Tuning (SFT)

How SFT Works

Chat Templates

Available Presets

Using a Chat Template

Custom Template

Inference & Utilities

Inference Engine

InferenceConfig

Using Inference

HuggingFace Hub Integration

Login

Push to Hub

Pull from Hub

Color Utilities

Quick Start

Examples

Small Model — Shakespeare

Large Dataset — HuggingFace Streaming

MoE Model — Sparse Experts

SFT on Pretrained Model

API Reference

Model API