markdown.from_html — Lilush API

Overview

HTML → markdown content extraction. Picks the article root from messy real-world HTML, strips noise, converts structural tags to markdown, and decodes entities. Intended output is for a markdown renderer or pager.

Strips noise tags entirely along with their bodies (<script>, <style>, <nav>, <header>, <footer>, <aside>, <noscript>, <iframe>, <form>, <button>, <svg>).
Converts <a href>...</a> to [text](url) so the markdown pager renders it as a clickable link.
Converts <h1>..<h6> to #/##/... headings.
Converts <pre>...</pre> to fenced code blocks; <code> to backticks.
Inserts blank-line separators around paragraphs, divs, list items.
Decodes the common HTML named entities + numeric refs.
Caps output at 256 KB with a "… [truncated]" suffix.

The implementation is intentionally regex-based — real HTML is rarely well-formed and a strict parser would reject pages that pragmatic readers should still display. The risk surface is small: the output goes only into a markdown pager, so even pathological HTML can't escape its sandbox.

Functions

Name	Signature
`extract`	extract(`html`) -> `text`

extract(html) -> text

Extract a markdown-friendly plain text rendering from raw HTML.