UTF-8 string utilities for character counting, display width calculation, substring extraction, and validation. Handles ANSI escape sequences and Kitty text sizing protocol.
| Name | Signature |
|---|---|
valid_b1 | valid_b1(b1) -> ok |
byte_count | byte_count(b1) -> count |
valid_seq | valid_seq(seq) -> ok |
len | len(str) -> count |
display_len | display_len(str) -> width |
display_clip | display_clip(str, max_width) -> clipped |
cell_len | cell_len(str) -> width |
cell_height | cell_height(str) -> height |
set_ts_width_mode | set_ts_width_mode(mode) -> ok, err |
get_ts_width_mode | get_ts_width_mode() -> mode |
find_all_spaces | find_all_spaces(str) -> positions |
sub | sub(str, i, j) -> substring, err |
char | char(...) -> str, err |
valid | valid(str) -> ok, pos |
valid_b1(
b1) ->ok
Check if a byte is a valid starting byte for a multi-byte UTF-8 sequence
A valid leading byte falls in the range 0xC2–0xF4. Bytes 0xC0 and 0xC1 are excluded because they would produce overlong encodings. Bytes 0xF5 and above are outside the Unicode range. ASCII bytes (< 0x80) and continuation bytes (0x80–0xBF) also return false.
byte_count(
b1) ->count
Return the number of continuation bytes for a multi-byte UTF-8 character
Given the leading byte of a multi-byte UTF-8 sequence, returns the number
of continuation bytes that follow it: 1 for a 2-byte sequence
(0xC0–0xDF), 2 for a 3-byte sequence (0xE0–0xEF), or 3 for a 4-byte
sequence (0xF0–0xF7). Behaviour is undefined for ASCII or continuation
bytes — call valid_b1 first.
valid_seq(
seq) ->ok
Validate a complete UTF-8 multi-byte sequence
Checks that seq is a well-formed 2–4 byte UTF-8 sequence. Verifies the
leading byte with valid_b1 and checks that all continuation bytes are in
range (0x80–0xBF). Also enforces the Unicode-mandated exclusions:
0xE0 sequences must have a second byte ≥ 0xA0 (no overlong 3-byte)
0xED sequences must have a second byte ≤ 0x9F (no surrogates)
0xF0 sequences must have a second byte ≥ 0x90 (no overlong 4-byte)
0xF4 sequences must have a second byte ≤ 0x8F (no out-of-range)
Does not accept single-byte (ASCII) input; use valid for full strings.
len(
str) ->count
Return the number of UTF-8 characters in a string, ignoring ANSI escapes
Counts the number of UTF-8 characters in a string, properly handling multi-byte characters and ignoring ANSI escape sequences.
display_len(
str) ->width
Return the display width of a plain string in terminal cells
Calculates the display width of a string in terminal cells (monospace font units). This is the width as it would appear on screen.
display_clip(
str,max_width) ->clipped
Clip a string to a maximum display width, preserving ANSI escape sequences
cell_len(
str) ->width
Return the display width of a string in terminal cells, accounting for text sizing
Calculates the cell width of a string that may contain Kitty text sizing
(OSC 66) sequences. Delegates to std.core.cell_len when available,
otherwise falls back to a pure-Lua implementation.
cell_height(
str) ->height
Return the cell height of a string, accounting for text sizing
Calculates the cell height of a string that may contain Kitty text sizing
(OSC 66) sequences. Delegates to std.core.cell_height when available,
otherwise falls back to a pure-Lua implementation.
set_ts_width_mode(
mode) ->ok,err
Set the text sizing width calculation mode
Accepts "combined" (scale × explicit width) or "w_only" (explicit
width only). Returns nil and an error for any other value.
get_ts_width_mode() ->
mode
Get the current text sizing width calculation mode
Returns the current text sizing width calculation mode, which can be either "combined" (scale × explicit width) or "w_only" (explicit width only).
find_all_spaces(
str) ->positions
Find the character positions of all whitespace in a string
Returns an array of character positions (1-indexed) where whitespace characters appear in the string. Only considers printable characters, ignoring ANSI escape sequences.
sub(
str,i,j) ->substring,err
Extract a substring by UTF-8 character indices, ignoring ANSI escapes
Extracts a substring from str using UTF-8 character indices (not bytes).
Negative indices count from the end. Returns nil and an error message if
the string or index is missing.
char(
...) ->str,err
Convert one or more Unicode codepoints to a UTF-8 string
Converts each integer argument (a Unicode codepoint) to its UTF-8 byte representation and concatenates the results. Accepts any value in the range 0–0x10FFFF (the full Unicode range). Returns nil and an error message for any out-of-range codepoint.
Port of Lua-5.1-UTF-8 by willox.
valid(
str) ->ok,pos
Validate that a string is well-formed UTF-8
Returns true if the entire string is valid UTF-8. On the first
invalid byte, returns false and the byte position of the error.