std.utf — Lilush API

←index

← std

Overview

UTF-8 string utilities for character counting, display width calculation, substring extraction, and validation. Handles ANSI escape sequences and Kitty text sizing protocol.

Functions

NameSignature
valid_b1valid_b1(b1) -> ok
byte_countbyte_count(b1) -> count
valid_seqvalid_seq(seq) -> ok
lenlen(str) -> count
display_lendisplay_len(str) -> width
display_clipdisplay_clip(str, max_width) -> clipped
cell_lencell_len(str) -> width
cell_heightcell_height(str) -> height
set_ts_width_modeset_ts_width_mode(mode) -> ok, err
get_ts_width_modeget_ts_width_mode() -> mode
find_all_spacesfind_all_spaces(str) -> positions
subsub(str, i, j) -> substring, err
charchar(...) -> str, err
validvalid(str) -> ok, pos

valid_b1(b1) -> ok

Check if a byte is a valid starting byte for a multi-byte UTF-8 sequence

A valid leading byte falls in the range 0xC2–0xF4. Bytes 0xC0 and 0xC1 are excluded because they would produce overlong encodings. Bytes 0xF5 and above are outside the Unicode range. ASCII bytes (< 0x80) and continuation bytes (0x80–0xBF) also return false.

byte_count(b1) -> count

Return the number of continuation bytes for a multi-byte UTF-8 character

Given the leading byte of a multi-byte UTF-8 sequence, returns the number of continuation bytes that follow it: 1 for a 2-byte sequence (0xC0–0xDF), 2 for a 3-byte sequence (0xE0–0xEF), or 3 for a 4-byte sequence (0xF0–0xF7). Behaviour is undefined for ASCII or continuation bytes — call valid_b1 first.

valid_seq(seq) -> ok

Validate a complete UTF-8 multi-byte sequence

Checks that seq is a well-formed 2–4 byte UTF-8 sequence. Verifies the leading byte with valid_b1 and checks that all continuation bytes are in range (0x80–0xBF). Also enforces the Unicode-mandated exclusions:

Does not accept single-byte (ASCII) input; use valid for full strings.

len(str) -> count

Return the number of UTF-8 characters in a string, ignoring ANSI escapes

Counts the number of UTF-8 characters in a string, properly handling multi-byte characters and ignoring ANSI escape sequences.

display_len(str) -> width

Return the display width of a plain string in terminal cells

Calculates the display width of a string in terminal cells (monospace font units). This is the width as it would appear on screen.

display_clip(str, max_width) -> clipped

Clip a string to a maximum display width, preserving ANSI escape sequences

cell_len(str) -> width

Return the display width of a string in terminal cells, accounting for text sizing

Calculates the cell width of a string that may contain Kitty text sizing (OSC 66) sequences. Delegates to std.core.cell_len when available, otherwise falls back to a pure-Lua implementation.

cell_height(str) -> height

Return the cell height of a string, accounting for text sizing

Calculates the cell height of a string that may contain Kitty text sizing (OSC 66) sequences. Delegates to std.core.cell_height when available, otherwise falls back to a pure-Lua implementation.

set_ts_width_mode(mode) -> ok, err

Set the text sizing width calculation mode

Accepts "combined" (scale × explicit width) or "w_only" (explicit width only). Returns nil and an error for any other value.

get_ts_width_mode() -> mode

Get the current text sizing width calculation mode

Returns the current text sizing width calculation mode, which can be either "combined" (scale × explicit width) or "w_only" (explicit width only).

find_all_spaces(str) -> positions

Find the character positions of all whitespace in a string

Returns an array of character positions (1-indexed) where whitespace characters appear in the string. Only considers printable characters, ignoring ANSI escape sequences.

sub(str, i, j) -> substring, err

Extract a substring by UTF-8 character indices, ignoring ANSI escapes

Extracts a substring from str using UTF-8 character indices (not bytes). Negative indices count from the end. Returns nil and an error message if the string or index is missing.

char(...) -> str, err

Convert one or more Unicode codepoints to a UTF-8 string

Converts each integer argument (a Unicode codepoint) to its UTF-8 byte representation and concatenates the results. Accepts any value in the range 0–0x10FFFF (the full Unicode range). Returns nil and an error message for any out-of-range codepoint.

Port of Lua-5.1-UTF-8 by willox.

valid(str) -> ok, pos

Validate that a string is well-formed UTF-8

Returns true if the entire string is valid UTF-8. On the first invalid byte, returns false and the byte position of the error.