The AST phase sits between the parser and the evaluator — it is the
normalization (or lowering) step. The parser produces a ParseNode
arena (a flat Vec indexed by integer, built by the push-down state
machine) that is structurally correct but not yet clean: it still contains
comment nodes and unresolved whitespace artefacts. The evaluator needs an
owned, recursive ASTNode tree where those details have been stripped and
named parameters have been resolved. build_ast bridges the two
representations in a single tree walk.
The module also contains the serialization layer used to dump the AST to a line-oriented text format for external consumers (Python evaluator, diagnostic tools).
Design rationale
Why arena → owned tree?
The parser’s arena model is ideal for the push-down parser: nodes are
appended cheaply, back-references are just indices. But the evaluator
benefits from owned children — no lifetime ties to the source string, no
index arithmetic, direct recursive descent. clean_node converts each
ParseNode (by index) into an ASTNode (owned), recursing through
children and dropping comments on the way.
Why two-pass analyze_param?
A macro argument can be either positional (value) or named
(name = value). Detecting a named parameter requires recognising the
pattern [spaces/comments] Ident [spaces/comments] Equal […] at the
start of the children list — and then deciding where the value begins.
A forward scan through a three-state DFA accumulates the result; a second
loop then converts children starting at the computed start_idx. This
avoids backtracking while making all legal transitions explicit.
ParamState — making the DFA explicit
The first pass of analyze_param is a three-state machine:
| State | Transitions in | Carries |
|---|---|---|
|
initial state |
— |
|
|
the |
|
|
the |
The SeenName state has two exits: Equal advances to SeenEqual; any
other non-skip token break`s the scan and the parameter is classified as
positional. This means `foo bar is positional (with foo as its
first non-skippable token), never named, even though foo looks like a
name — only foo = … is named.
Any non-skip, non-transition token causes an immediate break. The final
state after the scan — combined with whether first_good_after_equal was
set — fully determines the output case without any boolean flags.
Positions (first_not_skippable, first_good_after_equal) remain
Option<usize> rather than i32 sentinels: no casting, no sentinel bugs,
idiomatic Rust.
process_ast as the single safe entry point
strip_space_before_comments mutates the parser’s arena before build_ast
converts it to a tree. Calling build_ast without the preceding strip
produces correct but polluted output. Parser::process_ast (in
parser/mod.rs) is the one place in the codebase that sequences both calls
in the right order. Parser::build_ast is also exposed for tests that
want to inspect the un-stripped parse tree. Any external caller that
wants clean output must go through process_ast.
Why strip whitespace before comments?
In idiomatic weaveback source, comments appear on their own line:
some text %// strip me
The space before %// is syntactically part of the preceding Text or
Space node. If left in, it pollutes the generated output.
strip_space_before_comments removes preceding Space nodes entirely
and trims trailing spaces off preceding Text nodes. Block comments are
treated identically when they are followed by a newline (i.e. they fill
the rest of the line).
The backward walk scans over all consecutive Space nodes before a
comment, not just the one immediately adjacent. A sequence such as
Text("hello") / Space / Space / %// comment would otherwise leave a
dangling Space node in the output.
is_skippable centralises the three-way Space | LineComment | BlockComment
predicate so both strip_space_before_comments and analyze_param use
the same definition.
Why BFS serialization?
The AST is serialized to a line-per-node text format for external
evaluators. BFS guarantees that a node at output index i always has its
children at a contiguous range next_idx..next_idx+n, where next_idx
is computed by accumulating child counts during the traversal. DFS would
place children non-contiguously, forcing consumers to store and scan back-
references.
File structure
Three files are generated from this document.
// <<@file weaveback-macro/src/ast/mod.rs>>=
// crates/weaveback-macro/src/ast/mod.rs — generated from ast.adoc
// <<ast preamble>>
// <<ast error>>
// <<ast build>>
// <<ast analyze param>>
// <<ast clean node>>
// <<ast strip spaces>>
// @
// <<@file weaveback-macro/src/ast/serialization.rs>>=
// crates/weaveback-macro/src/ast/serialization.rs — generated from ast.adoc
// <<ast serialization preamble>>
// <<ast serialize token>>
// <<ast serialize nodes>>
// <<ast write ast>>
// <<ast read input>>
// <<ast dump>>
// @
// <<@file weaveback-macro/src/ast/tests.rs>>=
// <<ast tests>>
// @
Module preamble
// <<ast preamble>>=
use crateParser;
use crate;
use Error;
pub use ;
/// Three-state DFA used by `analyze_param` to classify a parameter node.
/// Returns `true` for node kinds that are transparent in both parameter
/// scanning (`analyze_param`) and whitespace stripping
/// (`strip_space_before_comments`): `Space`, `LineComment`, `BlockComment`.
// @
Error type
ASTError covers the three failure modes: the parser produced nothing
(Parser), an index lookup missed (NodeNotFound), and a generic
catch-all for other processing problems. From<String> lets ? on
String-returning helpers propagate into ASTError::Other without noise.
// <<ast error>>=
// @
Public entry point: build_ast
The single public function that callers use. It asks the parser for its
root index (which may be absent if the source was empty), then delegates
to clean_node. A None result from clean_node — which can only
happen if the root itself is a comment node, an unlikely but legal state —
is reported as an error rather than silently returning an empty tree.
// <<ast build>>=
/// Main entry point that unwraps the Option
// @
Parameter analysis: analyze_param
The most complex function in the module. It handles the three structural cases of a macro parameter:
| Pattern | Result |
|---|---|
|
Named param — |
|
Named param with blank value — |
anything else |
Positional param — |
The first pass (lines labelled "First pass") scans children, skipping
Space, LineComment, and BlockComment nodes. It sets four state
variables:
-
param_name— set to the firstIdenttoken seen before any= -
name_index— its position in the children array -
seen_equal— set when=is found after the firstIdent -
first_good_after_equal— position of the first non-skippable node after= -
first_not_skippable— position of the first non-skippable node overall
The break at line if seen_equal { … } break is intentional: once the
pattern Ident = first-value-item is fully determined, further scanning
would only add complexity. Everything from start_idx onwards is
processed by clean_node in the second pass.
// <<ast analyze param>>=
/// Analyse a parameter node: classify as positional or named and collect parts.
// @
Tree conversion: clean_node
The recursive engine of build_ast. It returns None for comment nodes
(they are stripped from the AST entirely) and delegates Param nodes to
analyze_param. All other nodes have their children recursively cleaned
and are reconstructed as ASTNode values. The name field is always
None here — it is only populated by analyze_param.
A debug_assert! enforces that leaf node kinds (Equal, Ident, Text,
Space) arrive with no children. The assertion fires only in debug builds
and catches parser regressions immediately rather than producing a
silently-wrong AST.
// <<ast clean node>>=
/// Recursively convert a `ParseNode` arena entry to an owned `ASTNode` tree.
///
/// Returns `None` for comment nodes (stripped from the AST entirely) and
/// delegates `Param` nodes to `analyze_param`.
// @
Whitespace stripping
strip_space_before_comments
Walks the children of node_idx and removes whitespace that immediately
precedes comments. Two sub-cases:
-
Spacenode before comment — theSpacenode is removed frompartsby recording its index into_removeand splicing it out after the analysis loop (to avoid mutating while iterating). -
Textnode before comment — trailing spaces are trimmed in-place viaparser.strip_ending_space, which shortens the token’slengthfield without reallocating.
Block comments are treated as line-ending only when the byte immediately
after their end_pos is \n; inline block comments (%/* … %*/ more)
are left alone.
The function then recurses into all surviving children. Note that
children are re-read after the removal loop — the children Vec is
cloned after splicing so removed nodes are not visited.
is_followed_by_newline
Tiny helper: checks whether the byte at node.end_pos in the raw content
buffer is \n. Kept separate to keep strip_space_before_comments
readable.
// <<ast strip spaces>>=
// @
Invariants
After process_ast completes the following hold for every ASTNode in
the returned tree. Tests assert subsets of these; debug_assert! guards
catch violations in debug builds at the point of construction.
| Invariant | Explanation |
|---|---|
No comment nodes |
|
|
|
Leaf nodes have no children |
|
BFS output is contiguous |
The serialization walk writes nodes in BFS order; each node’s children occupy a contiguous range immediately following all previously written sibling subtrees. Consumers can compute child ranges from child counts alone without storing back-references. |
Serialization (serialization.rs)
Preamble
// <<ast serialization preamble>>=
use crate;
use crate;
use VecDeque;
use File;
use ;
use PathBuf;
// @
Token serialization
Tokens are encoded as a comma-separated tuple: src,kind,pos,length.
src is a file-index integer assigned by the lexer; kind is the
TokenKind discriminant cast to i32 for a stable numeric wire format.
// <<ast serialize token>>=
// @
BFS node serialization
Each node is encoded as a JSON-like array. The token tuple is expanded
inline by serialize_token (which returns a comma-separated string), so
the full wire format is:
[node_kind, token_src, token_kind, token_pos, token_length, end_pos, [child_indices...]]
The format! call has four {} slots — node_kind, serialize_token(…),
end_pos, and parts — but serialize_token itself emits four
comma-separated fields, yielding seven fields total in the output line.
BFS traversal (via VecDeque) ensures that child indices are always a
contiguous range starting at next_idx. The root is always at index 0.
next_idx is advanced by node.parts.len() for each dequeued node,
mirroring the order in which children are enqueued.
// <<ast serialize nodes>>=
// @
Writing AST output
write_ast is the generic writer; write_ast_to_file adds stdout-vs-file
dispatch based on the - convention used throughout the codebase.
// <<ast write ast>>=
// @
Input reading
read_input applies the same --means-stdin convention to the input side.
// <<ast read input>>=
// @
CLI dump entry point
dump_macro_ast is the public API called by the binary. For each input
file it lexes, parses, and serializes the AST, then writes it alongside
the source with a .ast extension (or to stdout for -).
The header line carries a # src:0=<path> annotation so external
consumers can map the src field of each token back to the originating
file.
// <<ast dump>>=
// @
Tests
Key tests and what they guard:
-
test_node_kind_discriminants— regression guard for the numeric discriminant values.NodeKind::NotUsed = 0is intentional: Python’sIntEnumstarts at 1, so reserving 0 keeps the wire format aligned. -
test_serialize_bfs_child_indices— verifies BFS ordering. With an old DFS implementation, node B landed at index 4 instead of 2. -
test_serialize_token_src_field_present—token.srcmust appear in the output so external evaluators can trace which source file a node came from. -
test_strip_*— cover the cases ofstrip_space_before_comments: Space node removed, Text node trimmed, block comment followed by newline (stripped), block comment not followed by newline (kept), multiple consecutive Space nodes all removed before a comment, tab trimming, and spaces before two consecutive line comments both removed. -
DFA edge cases —
test_param_double_equals_value_starts_with_equal(Equal as first value token),test_param_var_as_first_token_is_positional,test_param_block_as_first_token_is_positional— confirm the invariant that onlyIdentopens the named-detection path. -
test_pipeline_*— integration tests through the full lex → parse → strip → AST pipeline. -
test_strip_is_idempotent— runningstrip_space_before_commentstwice produces the same result as running it once (no double-trim). -
test_ast_no_comments_invariant—lex_parse_contentleaves noLineCommentorBlockCommentnodes anywhere in the AST tree.
// <<ast tests>>=
// src/ast/tests.rs
use *;
use crateParseNode;
use crateParser;
use crate;
/// Helper to create a basic token
/// Helper to create a node and add it to parser, returning its index
/// Builder to create sequence of nodes
/// Helper to verify AST node structure
// ── DFA edge cases ─────────────────────────────────────────────────────
// ── NodeKind discriminants — regression guard ──────────────────────────
// ── serialize_ast_nodes — BFS ordering ────────────────────────────────
// ── strip_space_before_comments ────────────────────────────────────────
// ── Full pipeline ──────────────────────────────────────────────────────
// @