HTML → markdown content extraction. Picks the article root from messy real-world HTML, strips noise, converts structural tags to markdown, and decodes entities. Intended output is for a markdown renderer or pager.
Strips noise tags entirely along with their bodies
(<script>, <style>, <nav>, <header>, <footer>,
<aside>, <noscript>, <iframe>, <form>, <button>,
<svg>).
Converts <a href>...</a> to [text](url) so the markdown pager
renders it as a clickable link.
Converts <h1>..<h6> to #/##/... headings.
Converts <pre>...</pre> to fenced code blocks; <code> to
backticks.
Inserts blank-line separators around paragraphs, divs, list items.
Decodes the common HTML named entities + numeric refs.
Caps output at 256 KB with a "… [truncated]" suffix.
The implementation is intentionally regex-based — real HTML is rarely well-formed and a strict parser would reject pages that pragmatic readers should still display. The risk surface is small: the output goes only into a markdown pager, so even pathological HTML can't escape its sandbox.
| Name | Signature |
|---|---|
extract | extract(html) -> text |
extract(
html) ->text
Extract a markdown-friendly plain text rendering from raw HTML.