block_parser.rs splits a source file into logical blocks and computes a
BLAKE3 hash for each block. The hashes are stored in the database so that
unchanged blocks can be skipped on the next tangle run — this is the
incremental-build layer of weaveback-tangle.
Two formats are recognised natively:
-
AsciiDoc (
.adoc,.asciidoc) — a hand-written line scanner that splits on----/…./++delimited blocks,== …section headers, and prose paragraphs. -
Markdown (
.md,.markdown) — uses `pulldown-cmark’s offset iterator to identify headings, fenced code blocks, and paragraphs.
Everything else (including .rs, .c, .py, …) is treated as a single
opaque text block so that any change to the file marks it dirty.
See db.adoc for how SourceBlockEntry values are stored, and
noweb.adoc for how write_files_incremental uses them.
SourceBlockEntry
Each parsed block carries its 1-based line range, a type tag ("section",
"code", "para", "text"), and a 32-byte BLAKE3 hash of the block content.
The hash is what makes incremental builds work: two runs on the same unchanged file produce identical hashes, so the database comparison can skip all downstream processing for that block’s chunks.
parse_source_blocks entry point
The public entry point dispatches on extension and maps the raw
(start, end, type, content) tuples into SourceBlockEntry values,
hashing each block’s content with BLAKE3 before storing it.
AsciiDoc scanner
The AsciiDoc scanner is intentionally simple — it does not parse the full
AsciiDoc grammar. It only needs to identify code blocks (for which a single
character change matters) and section boundaries (which mark natural
re-tangling points). The three fence patterns (----, …., ++) cover
all common listing and literal block delimiters.
Prose paragraphs are accumulated line-by-line; a blank line or section header
flushes the current paragraph. An unclosed delimiter at end-of-file is emitted
as a "code" block (defensive: the tangle will catch the parse error).
Markdown parser
The Markdown parser delegates to `pulldown-cmark’s offset iterator so that we get byte-accurate event ranges without reimplementing a Markdown parser.
Only top-level blocks (depth == 1) are emitted; inner events of nested
structures are skipped. The byte-offset iterator returns range.start at the
opening tag and range.end at the closing tag, which are mapped to 1-based
line numbers via a pre-built byte→line table.
If pulldown-cmark finds no blocks (e.g. the file is empty or consists only
of inline content) we fall back to a single full-file block, matching the
behaviour of the unknown-extension path.
Tests
Five unit tests exercise the public parse_source_blocks interface:
-
adoc_single_code_block— a minimal AsciiDoc file with one----block -
adoc_two_code_blocks_have_different_hashes— two different blocks must hash differently -
adoc_unchanged_block_same_hash— the same content must hash identically across two calls (BLAKE3 determinism) -
markdown_heading_and_code— a Markdown file yields both"section"and"code"blocks -
fallback_single_block— an unknown extension yields exactly one block
Generated file
// <[@file weaveback-tangle/src/block_parser.rs]>=
/// Sub-file block parsing for incremental build support.
///
/// Splits a source file into logical blocks (code blocks, section headers,
/// prose paragraphs) and computes a BLAKE3 hash for each block. The hashes
/// are stored in the database so that unchanged blocks can be skipped on the
/// next run.
///
/// A parsed logical block with its line range and content hash.
/// Parse `source` into logical blocks based on its file `extension`.
///
/// Recognised extensions: `adoc`, `asciidoc` (AsciiDoc line scanner);
/// `md`, `markdown` (pulldown-cmark); everything else gets a single block.
// ── AsciiDoc ──────────────────────────────────────────────────────────────────
/// Scan an AsciiDoc document line by line, splitting it into:
/// * `"section"` — a single `== …` header line
/// * `"code"` — the content of a `----` delimited block (inclusive of delimiters)
/// * `"para"` — a run of consecutive non-empty lines that are neither a
/// section header nor a delimiter
///
/// Each tuple is `(line_start, line_end, block_type, content)` (1-based lines).
// ── Markdown ──────────────────────────────────────────────────────────────────
/// Parse Markdown using pulldown-cmark's offset iterator.
///
/// Produces blocks of type:
/// * `"section"` — a heading
/// * `"code"` — a fenced code block
/// * `"para"` — a paragraph or other leaf element
/// Map of byte offset → 1-based line number.
// @@