calm — Lilush API

←index

Overview

CALM (Catastrophically Abridged Language Models) module. Provides tokenization, template-driven sequence building, model loading and creation, training, and dataset operations for CALM language models.

Submodules

ModuleDescription
calm.pipelineCALM pipeline utilities.
calm.registryCALM model registry.
calm.templateCALM template DSL parser and sequence builder.

Functions

NameSignature
tokenizetokenize(text) -> tokens, err
detokenizedetokenize(tokens) -> text, err
token_texttoken_text(id) -> text
token_infotoken_info(id) -> info
vocab_sizevocab_size() -> sizes
format_gitformat_git(git) -> formatted
build_sequencebuild_sequence(model_or_spec, opts) -> tokens, err
parse_field_inputparse_field_input(field_names, text) -> fields
build_raw_sequencebuild_raw_sequence(text) -> tokens, err
build_template_sequencebuild_template_sequence(text) -> tokens, err
load_modelload_model(path) -> model, err
new_modelnew_model(opts) -> model, err
trainertrainer(model, opts) -> trainer, err
load_datasetload_dataset(path) -> dataset, err
pack_floatspack_floats(values) -> data
unpack_floatsunpack_floats(data) -> values

tokenize(text) -> tokens, err

Tokenize text using normal mode (all tiers)

detokenize(tokens) -> text, err

Detokenize a table of token IDs back to text

token_text(id) -> text

Get surface text of a token by ID

token_info(id) -> info

Get full token info by ID

vocab_size() -> sizes

Get vocabulary size breakdown

format_git(git) -> formatted

Format a structured git status table into a compact string

Converts a git status table {branch, ahead, behind, dirty, untracked} into a compact string like "main+3*?". Passes strings through unchanged. Returns nil if the input is neither a table nor a string.

build_sequence(model_or_spec, opts) -> tokens, err

Build a token sequence using a model's template (or explicit template spec)

Builds a token sequence from a template and context opts.

First argument can be:

The opts table contains field values referenced by the template. For the shell domain, the git field is preprocessed: if it's a table with {branch, ahead, behind, dirty, untracked}, it's formatted to a string.

If opts.eos is true, EOS is appended (for training).

parse_field_input(field_names, text) -> fields

Parse inline field:value patterns from text using known field names

Scans text for patterns like field_name:value where field_name is one of the known names. Each field's value runs until the next field anchor or end of string. One trailing space is stripped from each value.

Returns a table of {} pairs, or nil if no patterns were found.

Example: parse_field_input({"headword","pos"}, "pos:n. headword:anything you want") → {pos="n.", headword="anything you want"}

build_raw_sequence(text) -> tokens, err

Build a raw token sequence from plain text (no context frames)

Builds a minimal sequence: <BOS> [byte tokens] <EOS>. No context frames, no CMD token. Use with cmd_pos = 0 for full-sequence loss (training on plain text, code, etc.).

build_template_sequence(text) -> tokens, err

Build a token sequence from a template string with inline special tokens

Parses <NAME> patterns in the input and replaces them with the corresponding special token IDs. Text segments between patterns are byte-tokenized. No automatic BOS is prepended -- the caller controls the full sequence via patterns.

Example input: <BOS>Some text<WORD>Word<POS>n.<END><ATN> Example output: {257, 83, 111, ..., 270, 87, ..., 271, 110, 46, 269, 259}

load_model(path) -> model, err

Load a trained model from a weight file

new_model(opts) -> model, err

Create a new model with random weights

trainer(model, opts) -> trainer, err

Create a trainer for a model

load_dataset(path) -> dataset, err

Load a CTDS dataset from file

pack_floats(values) -> data

Pack a table of floats into a binary string (little-endian fp32)

unpack_floats(data) -> values

Unpack a binary string (little-endian fp32) into a table of floats