CALM (Catastrophically Abridged Language Models) module. Provides tokenization, template-driven sequence building, model loading and creation, training, and dataset operations for CALM language models.
| Module | Description |
|---|---|
| calm.pipeline | CALM pipeline utilities. |
| calm.registry | CALM model registry. |
| calm.template | CALM template DSL parser and sequence builder. |
| Name | Signature |
|---|---|
tokenize | tokenize(text) -> tokens, err |
detokenize | detokenize(tokens) -> text, err |
token_text | token_text(id) -> text |
token_info | token_info(id) -> info |
vocab_size | vocab_size() -> sizes |
format_git | format_git(git) -> formatted |
build_sequence | build_sequence(model_or_spec, opts) -> tokens, err |
parse_field_input | parse_field_input(field_names, text) -> fields |
build_raw_sequence | build_raw_sequence(text) -> tokens, err |
build_template_sequence | build_template_sequence(text) -> tokens, err |
load_model | load_model(path) -> model, err |
new_model | new_model(opts) -> model, err |
trainer | trainer(model, opts) -> trainer, err |
load_dataset | load_dataset(path) -> dataset, err |
pack_floats | pack_floats(values) -> data |
unpack_floats | unpack_floats(data) -> values |
tokenize(
text) ->tokens,err
Tokenize text using normal mode (all tiers)
detokenize(
tokens) ->text,err
Detokenize a table of token IDs back to text
token_text(
id) ->text
Get surface text of a token by ID
token_info(
id) ->info
Get full token info by ID
vocab_size() ->
sizes
Get vocabulary size breakdown
format_git(
git) ->formatted
Format a structured git status table into a compact string
Converts a git status table {branch, ahead, behind, dirty, untracked}
into a compact string like "main+3*?". Passes strings through unchanged.
Returns nil if the input is neither a table nor a string.
build_sequence(
model_or_spec,opts) ->tokens,err
Build a token sequence using a model's template (or explicit template spec)
Builds a token sequence from a template and context opts.
First argument can be:
A model userdata (reads template from model:info().template)
A template spec string (parsed directly)
nil (falls back to BOS + raw input)
The opts table contains field values referenced by the template. For the shell domain, the git field is preprocessed: if it's a table with {branch, ahead, behind, dirty, untracked}, it's formatted to a string.
If opts.eos is true, EOS is appended (for training).
parse_field_input(
field_names,text) ->fields
Parse inline field:value patterns from text using known field names
Scans text for patterns like field_name:value where field_name is one
of the known names. Each field's value runs until the next field anchor
or end of string. One trailing space is stripped from each value.
Returns a table of {} pairs, or nil if no patterns were found.
Example: parse_field_input({"headword","pos"}, "pos:n. headword:anything you want") → {pos="n.", headword="anything you want"}
build_raw_sequence(
text) ->tokens,err
Build a raw token sequence from plain text (no context frames)
Builds a minimal sequence: <BOS> [byte tokens] <EOS>.
No context frames, no CMD token. Use with cmd_pos = 0 for
full-sequence loss (training on plain text, code, etc.).
build_template_sequence(
text) ->tokens,err
Build a token sequence from a template string with inline special tokens
Parses <NAME> patterns in the input and replaces them with the
corresponding special token IDs. Text segments between patterns are
byte-tokenized. No automatic BOS is prepended -- the caller controls
the full sequence via patterns.
Example input: <BOS>Some text<WORD>Word<POS>n.<END><ATN>
Example output: {257, 83, 111, ..., 270, 87, ..., 271, 110, 46, 269, 259}
load_model(
path) ->model,err
Load a trained model from a weight file
new_model(
opts) ->model,err
Create a new model with random weights
trainer(
model,opts) ->trainer,err
Create a trainer for a model
load_dataset(
path) ->dataset,err
Load a CTDS dataset from file
pack_floats(
values) ->data
Pack a table of floats into a binary string (little-endian fp32)
unpack_floats(
data) ->values
Unpack a binary string (little-endian fp32) into a table of floats