markdown.from_html — Lilush API

←index

← markdown

Overview

HTML → markdown content extraction. Picks the article root from messy real-world HTML, strips noise, converts structural tags to markdown, and decodes entities. Intended output is for a markdown renderer or pager.

The implementation is intentionally regex-based — real HTML is rarely well-formed and a strict parser would reject pages that pragmatic readers should still display. The risk surface is small: the output goes only into a markdown pager, so even pathological HTML can't escape its sandbox.

Functions

NameSignature
extractextract(html) -> text

extract(html) -> text

Extract a markdown-friendly plain text rendering from raw HTML.