The weaveback-macro lexer converts a raw &str into a flat Vec<Token>
that the parser then assembles into a tree. It is a hand-written
pushdown scanner: a state stack allows blocks, macro argument lists, and
block comments to nest arbitrarily. The special character (default %,
configurable) acts as the sole escape sigil — everything else is plain text.
Design rationale
Why a state stack?
Weaveback constructs nest: a macro call can contain a named block that
contains another macro call that contains a comment. A finite-state
scanner cannot count nesting depth. A stack of State frames solves
this cleanly: push on open, pop on close. Each frame records the
byte offset of its opening delimiter so that "Unclosed block at offset N"
errors point to the right place.
Why memchr in Block and Comment?
Weaveback documents are mostly plain text. The special character appears
rarely. Scanning byte-by-byte for % wastes cycles on the common case.
memchr compiles to a SIMD/vectorised search on every major platform and
locates the next % in O(n/16) wall-clock time. Macro arg lists are
short and contain many token types, so a simple loop is used there instead.
Why SpecialAction?
Both Block and Macro states must handle a % in their input. The full
dispatch logic lives in handle_after_special, shared by both states.
The return value — a SpecialAction enum — tells the caller what to do:
push a new state and return immediately, pop the current state, or
continue the loop.
Why plain text for %identifier?
%identifier not followed by (, {, or } is not a macro call.
Emitting it as Text (no error) allows weaveback documents to contain
printf-style specifiers (%d, %s), shell constructs (%T), version
tags, and similar, without noise. Only the structurally significant
combinations trigger the lexer’s machinery.
Why precompute comment delimiters?
The comment delimiters %/* and %*/ depend on the runtime special char.
Precomputing them as open_comment: [u8; 3] and close_comment: [u8; 3]
avoids rebuilding the arrays on every call to run_comment_state.
Token vocabulary
TokenKind |
Sequence | Meaning |
|---|---|---|
|
any |
Literal pass-through content |
|
whitespace run |
Whitespace inside macro args (preserved) |
|
|
Escaped special char → literal |
|
|
Opens a named or anonymous block |
|
|
Closes the innermost open block |
|
|
Macro call head; starts arg-list scanning |
|
|
Variable substitution |
|
|
Identifier inside a macro arg list |
|
|
Arg separator inside a macro arg list |
|
|
Closes the current macro arg list |
|
|
Named-arg |
|
|
Single-line comment (consumed to |
|
|
Opens a nestable block comment |
|
|
Closes the innermost open block comment |
|
end of input |
Sentinel; always the last token |
State machine
Three mutually-recursive states share the same stack. The driver loop
dispatches to the correct handler and pops the stack when the handler
returns false.
File Structure
The single output file assembles all chunks in declaration order.
// <<@file weaveback-macro/src/lexer/mod.rs>>=
// crates/weaveback-macro/src/lexer/mod.rs — generated from lexer.adoc
// <<lexer preamble>>
// <<lexer char classifiers>>
// <<lexer state types>>
// <<lexer struct>>
// <<lexer impl>>
// @
impl Lexer structure
Every method group gets its own sub-chunk. They are assembled here in the order they appear in the impl block.
// <<lexer impl>>=
// @
Module preamble
// <<lexer preamble>>=
use crate;
use memchr;
// @
Character classifiers
Three byte-level predicates used throughout the lexer. They operate on
u8 rather than char because the lexer works on raw bytes — only ASCII
sequences are structurally significant, and non-ASCII bytes are always
part of plain text.
// <<lexer char classifiers>>=
// @
State types
State
Each frame on the stack is one of three variants. Block and Macro
carry the byte offset of their opening delimiter so that EOF-unclosed
errors report the right position. Comment omits an offset because it
self-reports its own "Unclosed comment" error from inside
run_comment_state, using the start of the comment body.
SpecialAction
handle_after_special is called by both run_block_state and
run_macro_state after consuming the special char. The three variants
encode exactly what the caller must do next:
-
Push— a new state was just pushed; returntrueimmediately so the driver dispatches to the new state without re-entering the current loop. -
Pop— a closing delimiter was seen; returnfalseso the driver pops the current frame. -
Continue— a token was emitted; continue the current loop.
// <<lexer state types>>=
/// What `handle_after_special` tells the caller to do.
// @
The Lexer struct
bytes and pos are the cursor into the input. src is a 32-bit file
index threaded into every Token for source-map purposes. state_stack
starts with one State::Block(0) — the implicit outermost block — so
that top-level text and constructs are handled identically to nested ones.
open_comment and close_comment are the precomputed 3-byte delimiters
for the configurable special char (see Design rationale).
// <<lexer struct>>=
// @
Public API
new validates the special char is ASCII (the SIMD memchr path only works
on single bytes), precomputes the comment delimiters, and seeds the stack
with the outermost Block frame.
lex consumes the lexer and returns the completed token stream alongside
any errors. Errors are non-fatal: the stream is always well-formed up to
EOF even when errors occur.
// <<lexer pub api>>=
// @
Input helpers
Six small utilities used throughout the state handlers.
peek_byte and advance form the basic cursor. advance returns the
consumed byte so callers can avoid a second peek.
skip_line_comment uses memchr to jump past the entire line in one
call — useful for all three line-comment styles (%//, %--, %#).
get_identifier_end finds the end of an ASCII identifier without
allocating. It is used both to scan macro names and to extract the tag
from named blocks.
block_tag_at reads the identifier immediately after a % at a given
byte offset. Named-block open and close use this to match tags
(%blk{…%blk}) without storing the tag in the State frame.
// <<lexer input helpers>>=
// ── Low-level cursor ──────────────────────────────────────────────────
/// Advance past the rest of the current line (through `\n` or to EOF).
/// Returns the byte index just past the end of an identifier starting at `start`.
/// Extract the identifier tag from a `%tag{` or `%tag}` position.
/// `pct_start` is the byte offset of `%`. Returns `""` for anonymous `%{`/`%}`.
// @
Token and error emission
emit_token suppresses zero-length tokens except for EOF, which is
always required as the stream terminator. This keeps the token stream
clean without forcing callers to check lengths.
error_at appends to errors. Errors never stop lexing — the lexer
always produces a usable stream.
// <<lexer emission>>=
// ── Emission ──────────────────────────────────────────────────────────
// @
Main driver
run is the top-level dispatch loop. It terminates when pos reaches
the end of input — not when the stack empties. This ensures that even a
malformed (unclosed) document is fully scanned.
At EOF, any frames above the root block (state_stack[1..]) are
unclosed. The driver collects their error messages by borrowing
state_stack immutably before calling error_at mutably — the
collect() into a Vec breaks the simultaneous borrow. Comment
frames are omitted here because run_comment_state reports its own
unclosed error when it hits EOF.
// <<lexer run>>=
// ── Main driver ───────────────────────────────────────────────────────
// @
Block state
Block is the default state and the most common. memchr locates the
next special char in O(n/vector_width) time, skipping over all intervening
text. If no special char is found before EOF the remaining bytes are
emitted as a single Text token.
When a special char is found, the text up to it is emitted, then the char
is consumed and control passes to handle_after_special. The return
value determines whether to push (return true immediately so the driver
calls the new state), pop (return false), or stay in the loop.
// <<lexer block state>>=
// ── Block state ───────────────────────────────────────────────────────
// @
Macro arg state
Macro arg lists are scanned token-by-token. ) ends the list (pops the
frame), , and = are single-character tokens, whitespace runs are
collapsed into one Space token, ASCII identifiers become Ident, and
everything else — including non-ASCII text and numeric literals — becomes
Text.
A % inside a macro arg list is valid: it can start a nested macro call,
a block, a comment, a variable substitution, or an escape. Delegation to
handle_after_special handles all of these uniformly.
The trailing check if !matches!(self.state_stack.last(), Some(State::Macro(_)))
handles the case where handle_after_special called Pop (e.g. a %}
inside a macro arg): in that case, we are no longer the active state and
must return false without consuming ).
// <<lexer macro state>>=
// ── Macro arg state ───────────────────────────────────────────────────
// @
Shared special dispatcher
handle_after_special is the heart of the lexer. It is called by both
Block and Macro states after the special char has been consumed, and
decides what to emit and whether to push or pop.
The dispatch table (by the next byte):
| Next byte | Action |
|---|---|
|
|
|
Anonymous block open — push |
|
Block close — pop current frame |
|
Line ( |
|
Line comment |
|
Line comment |
|
Escaped special char |
identifier start |
Named macro |
anything else |
Error + emit |
EOF |
Emit |
handle_var handles the %(name) form: it demands an identifier
immediately after ( and a ) immediately after the identifier. Any
deviation produces an error and emits the malformed sequence as Text.
// <<lexer special handler>>=
// ── Shared special-sequence handler ──────────────────────────────────
//
// Called after the special char has been consumed.
// `pct_start` is the byte offset of the special char itself.
/// Handle a `%(varname)` sequence. `pct_start` is the byte offset of the `%`.
// @
Comment state
Block comments nest: %/* outer %/* inner %*/ outer %*/ is valid. The
comment state therefore pushes itself on %/* and pops on %*/, exactly
like the Block state does for %{/%}.
memchr is used here too — comment bodies can be long. Any % that is
not part of %/* or %*/ is silently skipped (the pos += 1 at the end
of the loop body). This is why a bare % inside a comment is harmless:
it is neither a comment delimiter nor a syntax error.
On EOF, the accumulated text (if any) is emitted and an "Unclosed comment"
error is recorded. The frame is then popped by returning false.
// <<lexer comment state>>=
// ── Comment state ─────────────────────────────────────────────────────
// @
Tests
The test module exercises each token kind, error path, and edge case in isolation. Key cases worth noting:
-
test_bare_percent_inside_comment— verifies that a lone%inside a block comment is swallowed as text, not treated as an error or comment delimiter. -
test_printf_format_specifiers— ensures%d,%s,%f, and similar pass through without errors. -
test_nested_comment— exercises the comment push/pop path. -
test_real_world_macro_with_block_and_vars— a realistic%definvocation with a named block containing variable references. -
test_escaped_pubfunc_not_macro—%name(must produce aSpecialtoken followed byText, not a macro call.
// <<@file weaveback-macro/src/lexer/tests.rs>>=
// crates/weaveback-macro/src/lexer/tests.rs
use crateLexer;
use crate;
/// Collect tokens from the lexer (non-EOF tokens only).
/// Helper to assert tokens match an expected sequence of (TokenKind, &str).
/// We compare both `kind` and the `length` of the text (since we can't store real text easily).
//-------------------------------------------------------------------------
// Tests
//-------------------------------------------------------------------------
// @