Typed Structure Over Flat Text: Why AI Agents Need Traversable Documents

K. Brady Davis CloudSurf Software LLC, Las Vegas, NV, USA brady@cloudsurf.com

Abstract

AI agents consume documents through finite context windows, yet the dominant medium for this consumption --- Markdown, HTML, plain prose --- carries no declared block types, no typed cross-references, no priority signals, and no pre-computed token costs. The consequences are systematic: agents hallucinate document structure, misclassify content roles, cannot trace dependency relationships across files, and commit tokens blindly against non-renewable context budgets. These are not efficiency problems solvable by larger models or longer windows; they are capability gaps imposed by the format itself. We formalize this claim by defining *typed traversability* --- four properties (declared block types, typed cross-references, priority signals, and pre-computed token costs) that make a document format algorithmically navigable --- and present Hierarchical Budget-Constrained Descent Search (HBDS), a four-phase traversal algorithm that navigates typed knowledge hierarchies under strict token budgets. HBDS requires all four properties; flat text provides none. Each phase maps to a required property whose absence disables that phase entirely, converting the argument from preference to proof: typed structure enables algorithms that flat text prohibits. We report three controlled study designs testing the thesis across context retrieval accuracy, knowledge graph traversal, and knowledge-intensive generation, plus production telemetry from 15 repositories over eight weeks showing 25--35% reductions in token consumption and improved task completion rates. The findings reframe document format as an architectural decision for AI systems --- a capability gate that determines which algorithms are available, not a presentation choice about how content looks when rendered.

1. Introduction

AI agents now operate autonomously on codebases, knowledge bases, and documentation at unprecedented scale. Coding assistants navigate million-line repositories. Research agents synthesize findings across hundreds of papers. Engineering agents generate, review, and deploy software from natural-language specifications. In every case, the fundamental interaction is the same: the agent reads text into a finite context window, extracts meaning relevant to the current task, and acts on that meaning. The dominant medium for this interaction is flat text --- Markdown, HTML, plain prose --- and this is treated as a given. Document formats are viewed as a presentation concern, a choice about how content looks when rendered, orthogonal to the algorithms that process it.

This assumption is wrong, and the consequences are systematic. Flat text forces agents to infer structure that could have been declared. When an agent encounters a paragraph under a "Decisions" heading in a Markdown file, it must infer that the paragraph is a decision record; when it encounters a code block, it must infer whether the block is an example, a requirement, a template, or a test case. Every such inference is a potential error. When an agent follows a hyperlink from one Markdown file to another, it must infer the relationship between the two documents --- does the second depend on the first, extend it, supersede it, or contradict it? The link text says "see Authentication doc"; the relationship semantics are absent. When an agent operates under a context-window budget (which is always --- the budget is finite by architectural necessity), it cannot determine the token cost of a document subtree before reading it, because flat text does not expose that information. Every read is a blind commitment against a non-renewable resource.

These failures are not incidental. They arise from a deeper problem that is not about efficiency but about capability. Certain algorithms that would make agents dramatically more effective are impossible on flat text --- not merely difficult or expensive, but structurally precluded --- because the format lacks the information the algorithm requires. A budget-constrained hierarchical descent algorithm needs to know the token cost of a subtree before expanding it; flat text does not expose token costs. A type-aware relevance scorer needs to distinguish a ::decision block from a ::callout[type=tip]; flat text provides no block types. A hierarchical traversal needs parent-child pointers with typed edges; flat text has untyped hyperlinks with no relationship semantics. The format is not a downstream serialization choice. For AI agents, the document format is an architectural concern that determines which algorithms are available and how well they perform.

We demonstrate this thesis with HBDS (Hierarchical Budget-Constrained Descent Search), a traversal algorithm that navigates typed knowledge hierarchies under strict token budgets. HBDS requires four properties of the document format --- declared block types, typed cross-references, priority signals, and pre-computed token costs --- that we collectively term typed traversability. None of these properties exist in flat text formats such as Markdown or HTML. The algorithm is impossible on Markdown. It is trivial on a typed format. This is a proof by construction: the existence of a useful algorithm that requires typed structure and cannot operate on flat text demonstrates that the format is a capability gate, not a stylistic preference.

This paper makes five contributions:

  1. An analysis of why context-window limits are a permanent architectural constraint, and why alternative architectures --- state space models, sparse attention, memory-augmented transformers --- make typed structure more necessary, not less (Section 2).

  2. A failure-mode analysis identifying five systematic failure categories when AI agents operate on flat, untyped text, each traceable to a missing format property (Section 3).

  3. A formal definition of typed traversability --- four properties (P1--P4) that make a document format algorithmically navigable by AI agents --- with a six-level spectrum from fully untyped to fully typed-traversable (Section 4).

  4. HBDS: a budget-constrained hierarchical descent algorithm that demonstrates typed traversability in action and proves by construction that flat text is insufficient --- every phase of the algorithm maps to a required property, and the absence of any property disables at least one phase (Section 5).

  5. Empirical evidence comparing agent performance on flat versus typed documents across three controlled studies (context retrieval accuracy, knowledge graph traversal, HBDS versus RAG on knowledge-intensive generation), plus production telemetry from 15 repositories over eight weeks (Section 6).

The remainder of the paper is organized as follows. Section 2 establishes the background: how agents consume documents today, why context limits are permanent, why flat text dominates, how alternative architectures interact with the structure question, and what prior work exists on typed formats and AI-document interaction. Section 3 identifies the five failure modes of flat text, each grounded in a concrete agent workflow. Section 4 defines typed traversability formally and presents the traversability spectrum. Section 5 presents the HBDS algorithm, maps each phase to the properties it requires, formalizes the budget model, and compares HBDS against flat-text retrieval approaches. Section 6 reports the empirical evidence. Section 7 discusses implications for document format designers, agent builders, knowledge-base platforms, and the RAG community. Section 8 acknowledges limitations and identifies future work. Section 9 concludes.

2. Background

2.1 How AI Agents Consume Documents Today

Every AI agent operating on documents follows the same fundamental cycle: read text into the context window, extract meaning relevant to the current task, and act on that meaning by generating output, invoking tools, or reading further. This read-extract-act loop is well-understood in the agentic RAG literature [1]. What is less frequently examined is the economic structure of the loop: the context window is not merely a container for text but a finite resource to be spent. Every token consumed by retrieval is a token unavailable for reasoning, generation, or further retrieval. The agent faces a budget allocation problem at every step --- how much of its remaining context to invest in reading versus acting.

This budget constraint shapes three decisions that every agent must make. First, discovery: which documents or fragments are worth reading at all? The agent must identify candidates without consuming the tokens required to evaluate them fully. Second, relevance: among the candidates read into context, which content should be retained for the current task and which discarded? Liu et al. [2] demonstrate that language models degrade on information positioned in the middle of long contexts, so even content that is present in the window may be functionally lost. Third, budget management: how many tokens remain, and what is the expected cost of the next retrieval step? Without answers to this question, the agent cannot plan its reading --- every retrieval is a blind commitment of unknown size.

Current retrieval systems address these decisions with varying degrees of sophistication. Table 1 summarizes six approaches spanning the spectrum from flat retrieval to hierarchical search.

Table 1. Current retrieval approaches for AI agent document consumption.

ApproachAlgorithmHow It WorksWhat It Lacks
RAG [3]kNN in embedding spaceChunk documents, embed, retrieve top-k similar chunks at query timeNo hierarchy, no block types, arbitrary chunk boundaries, no budget control
Agentic tool useLLM-as-heuristicAgent has tools (grep, read, search); the LLM decides what to search based on reasoning [1]No formal budget tracking, no guaranteed coverage, no type awareness
Graph RAG [4]Graph traversal on auto-built KGExtracts entities and relations from flat text, builds knowledge graph, traverses community summariesStructure is inferred post-hoc --- extraction errors propagate through the graph
RAPTOR [5]Recursive clustering + tree searchClusters chunks, summarizes clusters into tree levels, searches top-downTree is auto-generated from flat text; no typed blocks, no declared token costs
LATTICE [6]LLM-guided beam searchHierarchical traversal with calibrated path relevance scoring and beam-width pruningBudgets node expansions (computation), not tokens (context window); no typed document requirement
Hybrid (Cursor, Copilot)Embeddings + agentic reasoningRAG retrieves initial candidates, LLM reasons about which to explore furtherFlat initial retrieval; no hierarchy in the refinement step

Standard RAG [3] operates entirely on flat structure: documents are split into chunks by character count or paragraph boundary, embedded into a vector space, and retrieved by cosine similarity. The chunk boundaries are arbitrary and frequently split semantic units --- a decision rationale may span three paragraphs, but the chunker sees only byte offsets. Agentic tool-use systems extend RAG by giving the LLM discretion over what to search and when, but the underlying documents remain flat and untyped; the agent must infer the purpose of each text fragment from its content alone.

Graph RAG [4] and RAPTOR [5] represent a more sophisticated class: they infer structure from flat text at indexing time. Graph RAG uses an LLM to extract entities and relationships from source documents, builds a knowledge graph, generates community summaries at multiple granularities, and traverses this graph at query time. RAPTOR recursively clusters document chunks, generates abstractive summaries for each cluster, and builds a tree that supports top-down search. Both systems improve over flat RAG on tasks requiring multi-hop reasoning or global synthesis. However, the structure in both cases is inferred, not declared. The knowledge graph is only as accurate as the extraction model; the cluster tree is only as coherent as the clustering algorithm. Misidentified entities propagate as false edges; poorly clustered paragraphs produce misleading summaries. The inference step is expensive (requiring LLM calls at indexing time) and lossy (discarding information the extraction model deems irrelevant).

LATTICE [6], the most recent hierarchical approach, organizes a corpus into a semantic tree offline and uses an LLM to navigate this hierarchy at query time via a best-first traversal with calibrated path relevance scores. LATTICE advances the state of the art by introducing a principled budget mechanism --- it limits the number of node expansions during traversal. However, the budget is measured in node expansions (a computational resource), not in tokens consumed (a context-window resource). An agent cannot use LATTICE's budget model to answer the question "can I afford to read this subtree without overrunning my context window?" because the node expansion count does not map to a token cost. Furthermore, LATTICE, like Graph RAG and RAPTOR, builds its hierarchy from flat text --- it does not require or leverage typed structure in the source documents.

The key observation across all six approaches is not that current systems lack retrieval algorithms --- they do not. It is that every approach either operates on flat structure directly (RAG, agentic tools) or infers structure from flat text at retrieval time (Graph RAG, RAPTOR, LATTICE). None uses structure declared in the document format itself. This distinction is consequential:

  • Inferred structure is lossy, expensive to compute, and error-prone. The extraction model may misidentify entities, hallucinate relations, or miss implicit connections. The resulting tree or graph is only as good as the inference.
  • Declared structure is exact, free at retrieval time (the cost is paid once at authoring time), and machine-validated. A ::decision block is a decision because the author declared it so, not because an NLP model classified it with 83% confidence.

The HBDS algorithm presented in Section 5 does not introduce search where none existed. Rather, it operates on declared structure --- which is cheaper to maintain, more accurate, and enables capabilities that inferred structure cannot reliably provide. Pre-read budget verification (Section 5.3) requires knowing a node's token cost before reading it; this is trivial with declared token counts (P4) and impossible with inferred structure, where the cost is unknown until the inference completes. Type-aware relevance scoring requires knowing that a block is a ::decision rather than a ::callout[type=tip]; this is exact with declared types (P1) and probabilistic with inferred types. Graceful degradation to summaries requires a declared ::summary layer; inferred summaries, while useful (as RAPTOR demonstrates), are approximations that discard information the summarization model deemed unimportant rather than information the author marked as secondary.

2.2 Why Context Limits Are Permanent

A common assumption in the AI agent community is that context window limitations are a temporary engineering constraint --- that sufficiently large windows will eventually make retrieval unnecessary. This assumption is incorrect. The context window constraint is architectural, not temporary: it arises from at least six independent causes, each with known mitigations that reduce the constant factor but do not eliminate the underlying scaling behavior.

Table 2. Architectural constraints on context window size.

ConstraintCauseMitigationEliminates Limit?
O(n^2) attentionSelf-attention requires every token to attend to every other token [N-Vaswani]. Doubling context quadruples computeSparse attention (Longformer [N-Beltagy], BigBird [N-Zaheer]), Flash Attention [N-Dao]No --- trades global coherence for length
KV cache memoryKey-value pairs stored per token per layer. A 70B-parameter model at 1M tokens requires ~15--40 GB of KV cache per request [N-KVCache]Multi-query attention, grouped-query attention, KV cache compressionReduces constant factor, not scaling class
Inference costEvery context token incurs compute on every forward pass. A 200K-token prompt costs roughly 50x a 4K-token prompt at equivalent model sizePrompt caching, prefix sharingReduces amortized cost; per-request cost remains proportional to context length
LatencyTime-to-first-token scales with context length; users and downstream agents experience delay proportional to input sizeSpeculative decoding, chunked prefillMarginal improvement; latency remains monotonically increasing
Lost in the middleLanguage model accuracy degrades for information positioned in the middle of long contexts, exhibiting a U-shaped retrieval curve [N-Liu]. More context can *reduce* accuracyNo known architectural mitigation as of 2026No --- fundamental attention distribution phenomenon, not a capacity limitation
Training sequence lengthModel performance degrades on sequences longer than those seen during trainingRoPE scaling [N-Su], ALiBi [N-Press]Extrapolation, not elimination --- degradation is gradual rather than catastrophic

Each row in Table 2 represents an independent physical or information-theoretic constraint. Sparse attention mitigates row 1 but not row 5. KV cache compression mitigates row 2 but not row 6. No single technique addresses all six simultaneously, because the constraints arise from different mechanisms. The lost-in-the-middle effect [N-Liu] is particularly consequential for the scaling-optimist argument: Liu et al. demonstrated that on multi-document question-answering tasks, models with 4K--16K token contexts exhibited significantly degraded accuracy when relevant information appeared in middle positions, even when the total context was well within the model's advertised capacity. Increasing the window does not resolve this --- it widens the degraded region.

Despite these constraints, context windows have grown dramatically. GPT-3 (2020) supported 4,096 tokens. GPT-4 Turbo (2023) extended to 128,000 tokens. By early 2026, Gemini 2.5 Pro supports 1--2 million tokens [N-Gemini], and Llama 4 Scout claims a 10-million-token context window [N-Llama4]. This trajectory --- roughly one order of magnitude every 18 months --- appears to vindicate the assumption that scaling alone will eventually make retrieval unnecessary.

It does not. The knowledge worth querying grows faster than the window that holds it. The indexed web contains approximately 510 trillion tokens; the full web is estimated at 3,100 trillion tokens [N-CommonCrawl]. A single enterprise knowledge base routinely contains millions of documents. Global data volume reached an estimated 181 zettabytes in 2025, doubling approximately every four years [N-IDC]. Even the largest announced context window (10M tokens) represents 0.002% of the indexed web and 0.0003% of the full web. Every order-of-magnitude increase in context window capacity is met by an estimated two-orders-of-magnitude increase in the knowledge available for querying, because the growth in digitized information is exponential and unconstrained by the hardware budgets that govern model inference. The ratio of available knowledge to context capacity is not converging --- it is diverging.

This divergence reframes the problem. The question is not "when will context windows be large enough?" --- the answer, relative to the knowledge that exists, is never. The question is: how does an agent navigate a knowledge space that is permanently larger than its context window? This is a search problem, not a scaling problem. The agent requires an algorithm that selects which knowledge to read within a finite token budget, verifies the cost before committing tokens, and degrades gracefully when the budget is insufficient for full content. That algorithm is HBDS (Section 5). Its design requirements --- pre-computed token costs, typed block boundaries, priority signals, hierarchical cross-references --- are precisely the typed-traversability properties that flat text lacks (Section 4).

2.3 The Flat Text Default

Despite decades of work on structured document formats, flat text remains the dominant medium through which AI agents consume knowledge. The reasons are historical and ergonomic, but the consequences for agent performance are architectural.

Markdown [7] has become the lingua franca of technical documentation. GitHub renders it natively; every major documentation generator (Sphinx, MkDocs, Docusaurus, Jekyll) consumes it; README files, API references, changelogs, and architectural decision records are overwhelmingly written in it. Gruber's original design prioritized human readability over machine parseability --- the format was intended to be "publishable as-is, as plain text" [7]. This design goal succeeded: Markdown is easy to write and easy to read. It is also easy to misparse, because the format carries no semantic type information. A heading hierarchy in Markdown appears to be a tree --- #, ##, ### suggest nesting --- but the underlying representation is a flat stream of tokens. There are no parent-child pointers between a ## heading and its ### children, no way to query "which blocks belong to this section?" without re-inferring the hierarchy from heading levels, and no declared token cost for any subtree.

Other dominant platforms exhibit the same pattern. Wikipedia articles are authored in MediaWiki markup, which provides structural primitives (templates, categories, infoboxes), but these are stripped during rendering to HTML. The semantic signal that {{Infobox}} carries --- this is structured factual data --- is lost in the rendered output that most consumers (including AI agents via web scraping or API access) receive. The structure exists in the source markup but is lossy in the representation agents actually consume. Confluence stores documents as XHTML in a proprietary storage format, accessible only through Atlassian's REST API. The XML carries structural information (macros, layouts, metadata), but this structure is trapped behind a vendor-specific schema with no cross-platform parseability. Notion represents documents as proprietary JSON block trees --- semantically richer than Markdown but available only through Notion's API, with export limited to Markdown (which discards the block-type information) or CSV (which discards the hierarchy). In each case, whatever structure the platform maintains internally is either lost at the consumption boundary or locked behind a proprietary interface.

The common thread across these formats is that structure, where it exists, is visual rather than semantic. Bold text means "important" to a human reader scanning the page; to a tokenizer, it is two asterisk tokens surrounding a span. A bulleted list under a heading looks like subordinate content to a human eye; to a parser, it is a sequence of line-initial dash characters with no declared relationship to the preceding heading. An AI agent consuming a Markdown document receives a linear token stream in which the hierarchical, typed, and prioritized relationships between content blocks --- the very relationships that the retrieval algorithms described in Section 2.1 need --- must be inferred from typographic conventions that were designed for human visual processing, not machine traversal.

2.4 Alternative Architectures: Can You Remove the Limit?

The quadratic attention complexity described in Section 2.2 has motivated three families of alternative architectures, each attempting to reduce or eliminate the O(n^2) bottleneck. A natural question follows: if the context limit can be removed, does typed structure become unnecessary? The answer, as we argue below, is the opposite. Each alternative introduces a new information-selection problem that typed structure is uniquely positioned to solve.

State Space Models. Mamba [8] and RWKV [9] replace the attention mechanism with a recurrent state-space formulation that processes tokens sequentially, achieving O(n) time complexity and O(1) memory in sequence length. Mamba introduces selective state spaces, where the model's parameters are functions of the input, allowing it to propagate or forget information along the sequence dimension depending on the current token [8]. RWKV reformulates the transformer as a linear-complexity RNN, demonstrating competitive performance with transformers at scales up to 14 billion parameters [9]. Dao and Gu [10] further established a formal duality between structured SSMs and masked attention, showing that the same sequence transformation admits both a linear-time recurrence and a quadratic-time attention realization.

The architectural consequence is that the hidden state is the context --- the model's entire history is compressed into a fixed-size vector that is updated incrementally. This eliminates the quadratic attention cost but introduces a fundamental trade-off: the compression is lossy. The model cannot precisely retrieve an arbitrary fact from token 50,000; it can only access whatever signal survived compression into the fixed-size state. Critically, this compression is uniform across all token types. Every token --- whether it carries a critical design decision or a stylistic aside --- is compressed through the same state-update dynamics. The selective mechanism in Mamba [8] provides content-dependent gating, but the gating operates on raw token features, not on declared semantic types.

In a typed document, the distinction between a ::decision block and a ::callout[type=tip] is explicit, machine-readable, and available before the tokens enter the model. An SSM-based agent with access to typed structure could allocate state capacity preferentially --- preserving decision blocks at full fidelity while compressing informational asides more aggressively. Without typed structure, the model must infer from token-level features alone which content warrants preservation, a strictly harder problem. Typed structure tells the model what to remember.

Sparse and Linear Attention. Longformer [11] and BigBird [12] reduce attention from O(n^2) to O(n) by restricting which tokens attend to which. Longformer combines a sliding local window (each token attends to w neighbors) with task-motivated global tokens that attend to all positions [11]. BigBird adds random attention connections to this combination of local and global patterns, proving that the resulting sparse attention mechanism is a universal approximator of sequence functions and is Turing complete [12]. FlashAttention [13] takes a complementary approach, preserving exact O(n^2) attention but making it IO-aware --- using tiling to reduce memory reads and writes between GPU HBM and SRAM, yielding 2--3x wall-clock speedups without changing the attention pattern itself.

The critical design decision in sparse attention is the sparsity pattern: which tokens receive global attention and which are restricted to local windows. In Longformer, global tokens are assigned by task-specific heuristics --- the [CLS] token in classification, question tokens in QA [11]. In BigBird, global tokens are either selected randomly or assigned to the first g tokens of the sequence [12]. These strategies are adequate when the model controls the input format (as in fine-tuning), but they are arbitrary when the agent consumes general-purpose documents. Every k-th token, or the first token of each chunk, is a positional heuristic with no semantic basis.

In a typed document, the sparsity pattern has a natural semantic grounding. Summary blocks (::summary) are candidates for global attention because they contain high-density representations of their parent sections. Decision blocks (::decision) warrant global visibility because downstream reasoning may depend on them from any position. Informational callouts (::callout[type=info]) can be safely restricted to local windows because their relevance is typically confined to their immediate context. The type hierarchy provides exactly the signal that sparse attention needs to allocate global tokens: which content is structurally important versus locally relevant. Without types, the sparsity pattern is blind to document semantics.

Memory-Augmented Architectures. Neural Turing Machines (NTMs) [14] introduced the paradigm of coupling a neural controller to an external memory matrix, with differentiable read and write operations governed by attention-based addressing. The controller can access memory by content (similarity to a query vector) or by location (relative positional shifting). This paradigm has been extended to retrieval-augmented transformers, where the external memory is a document store accessed via embedding-based lookup --- in effect, RAG (Section 2.1) is a memory-augmented architecture with the retrieval pipeline as the addressing mechanism.

Memory-augmented architectures are the closest to "removing" the context limit: the model's working memory (the context window) is supplemented by an external store of effectively unlimited size. But the external store introduces an addressing problem. The model must decide, at each step, where to read from and where to write to. Content-based addressing (the dominant approach in RAG and retrieval-augmented transformers) reduces to embedding similarity --- a flat, unstructured operation that treats all memory slots as interchangeable vectors in a shared space. Location-based addressing (as in NTMs [14]) is fragile because it depends on insertion order, which carries no semantic meaning.

In a typed memory store, a third addressing mode becomes available: type-and-hierarchy addressing. The agent can navigate to the correct memory location by descending through a typed hierarchy --- first to the relevant domain, then to the relevant category, then to the specific block type needed. This is precisely the HBDS algorithm described in Section 5. The memory-augmented architecture provides the external store; typed structure provides the navigation system. Without typed structure, the agent is left with content similarity (lossy, requires embedding the query) or positional indexing (brittle, order-dependent). HBDS offers a third path: semantic descent through declared structure.

Table 2. Alternative architectures and their typed-structure dependencies.

ArchitectureContext MechanismInformation Selection ProblemWhat Typed Structure Provides
SSMs (Mamba [8], RWKV [9])Fixed-size hidden state (lossy compression)What to preserve vs. compressCompression priority by block type
Sparse Attention (Longformer [11], BigBird [12])Local windows + global tokensWhich tokens get global visibilitySemantic sparsity patterns from type hierarchy
Memory-Augmented (NTMs [14], retrieval-augmented)External memory with read/write addressingHow to navigate external memoryType-and-hierarchy addressing (HBDS)

The pattern across all three families is consistent. Each alternative architecture that reduces or removes the quadratic attention bottleneck introduces a new information-selection problem: what to compress, where to attend globally, or how to navigate external memory. In a standard transformer, the O(n^2) attention mechanism handles all of these implicitly --- every token attends to every other token, so no selection is required. The alternative architectures gain efficiency precisely by not attending to everything, which means they must choose. And choice requires signal. Typed structure provides that signal. The flat-text tax --- the cost of operating without declared types, priorities, and hierarchy --- does not decrease as architectures become more efficient. It increases, because the more selective the architecture, the more it depends on knowing what matters before processing begins.

2.5 Typed Document Formats

Flat text is the dominant format for AI-agent consumption, but it is not the only tradition. A parallel lineage of typed document formats has existed for decades, each providing declared, machine-parseable structure that eliminates the inference burden described in Sections 2.1 and 2.3. None, however, has achieved widespread adoption for everyday authoring --- and understanding why illuminates the design space that Section 4 formalizes.

LaTeX [N-Lamport] introduced semantic environments in 1984: \begin{theorem}...\end{theorem}, \begin{proof}, \begin{algorithm}, \begin{figure}. Each environment declares a typed block that a parser can identify without inspecting content. The amsthm package extends this with \theoremstyle declarations that distinguish definitions from conjectures from examples at the markup level [N-amsthm]. LaTeX's structure is genuinely semantic --- a \begin{lemma} block is machine-identifiably a lemma, not a paragraph that happens to start with the word "Lemma." However, LaTeX imposes substantial authoring ceremony: mandatory compilation, non-trivial syntax (\begin{...}\end{...} pairs, backslash commands, braces), and error messages that assume familiarity with TeX internals. The format was designed for typesetting, not for lightweight authoring or machine traversal.

XML and SGML provide the most complete type systems for documents. DocBook [N-Walsh], an OASIS standard originating in 1991, defines over 400 semantic elements for technical documentation --- <chapter>, <section>, <warning>, <procedure>, <programlisting> --- each schema-validated against a DTD or RELAX NG grammar. XHTML similarly structures web content with semantic tags (<article>, <nav>, <aside>). XML documents can be validated deterministically: a parser can verify that every block conforms to its declared type without inspecting content semantics. The power is real, but the cost is prohibitive for everyday use. Closing tags mirror opening tags verbatim, attributes require quoting, and documents must be well-formed to parse at all. Authoring XML by hand is slow; authoring it without specialized tooling is error-prone.

DITA (Darwin Information Typing Architecture) [N-DITA], an OASIS standard first published in 2005 with the current version 1.3 released in 2015, extends XML with a topic-typing system. Every DITA topic is one of three base types --- concept, task, or reference --- and authors can specialize these into domain-specific types. Content reuse is a first-class concern: topics are assembled into maps, and conditional processing attributes enable single-source publishing. DITA represents the strongest typing discipline in the document format tradition. It also requires an enterprise-grade toolchain (DITA Open Toolkit), XML editing expertise, and organizational commitment that individual developers and small teams rarely possess.

AsciiDoc [N-AsciiDoc] occupies a middle ground. Originally created in 2002 and now governed by the Eclipse Foundation's AsciiDoc Working Group, it provides semantic block types --- admonitions (NOTE, WARNING, CAUTION), sidebars, examples, source listings with language attributes --- within a lightweight plain-text syntax closer to Markdown in writability. AsciiDoc carries more structural information than Markdown (blocks are typed, not just visually distinct) but less than XML (no schema validation, no formal type specialization). It has gained adoption in technical documentation (the Spring Framework, Asciidoctor toolchain) but remains a niche format relative to Markdown's ubiquity.

What these formats share is the property that Section 4 formalizes as P1 (Declared Block Types): structure is part of the format, not inferred from visual presentation. A LaTeX \begin{theorem} block, a DocBook <warning> element, a DITA <task> topic, and an AsciiDoc WARNING admonition are all machine-identifiable by type at parse time. No NLP model is required to classify them. This is the fundamental advantage of typed formats over flat text for algorithmic consumption.

Why, then, did they fail to become the default medium for everyday authoring? The answer is ceremony cost --- the ratio of structural markup to content. Markdown requires zero ceremony: a developer writes prose, adds # for headings, and the document is immediately readable. LaTeX requires a compilation step. XML requires closing tags that often exceed the content they bracket. DITA requires an enterprise toolchain. AsciiDoc requires less ceremony than XML but more than Markdown, and lacks the network effects of Markdown's universal tool support. In each case, the authoring cost of declared structure exceeds what individual developers will tolerate for everyday documentation. The path of least resistance is Markdown --- and so the dominant format for AI-agent consumption is the one that provides the least structure for agents to work with.

This creates the design gap that motivates the present work: a format with the writability of Markdown and the parseability of XML. Such a format would need to carry typed block information (P1) and cross-reference semantics (P2) without imposing the syntactic overhead that made XML, DITA, and LaTeX impractical for lightweight authoring. Section 4 formalizes the four properties this format must satisfy; Section 5 demonstrates algorithmically what becomes possible when it does.

2.6 Search and Retrieval for AI Agents

The dominant retrieval paradigm for LLM-based agents is Retrieval-Augmented Generation (RAG) [N-Lewis]: partition a corpus into chunks, embed each chunk into a vector space, retrieve the top-k chunks most similar to a query embedding, and inject them into the model's context window. RAG's effectiveness depends critically on chunk quality. Fixed-size or sliding-window chunking splits documents at arbitrary boundaries --- a procedure may be severed from its safety qualifiers, a table from its caption, a decision from its rationale. Semantic chunking mitigates this by splitting on topical boundaries, but without declared block types, the chunker must infer where one semantic unit ends and the next begins. The inference is lossy: any chunking strategy applied to flat text is a heuristic approximation of structure the author could have declared explicitly.

Even when retrieval is perfect --- the correct chunks are found and injected --- a second problem remains. Liu et al. [N-Liu] demonstrate that language model performance degrades significantly when relevant information appears in the middle of long contexts, a phenomenon termed the "lost in the middle" effect. On multi-document question answering tasks, models with 4K--16K context windows exhibited a U-shaped accuracy curve: high when the answer appeared near the beginning or end of the input, substantially lower for middle positions. This finding is architectural, not a training artifact, and it implies that injecting more retrieved content can be counterproductive. The context window is not merely a container with a size limit; it has positional fidelity characteristics that any retrieval strategy must account for.

These limitations have motivated a wave of hierarchical and memory-augmented retrieval systems. Table 1 summarizes five recent approaches and identifies their key gaps relative to the requirements of budget-constrained traversal over typed documents.

Table 1. Hierarchical retrieval and memory systems for LLM agents.

SystemYearApproachKey Gap vs. HBDS
LATTICE [N-Gupta]2025LLM-guided beam search over a semantic tree built offline; calibrated latent relevance scoresBudgets node expansions (compute), not tokens (context). Tree inferred from flat text. No pre-read token verification
H-MEM [N-Sun]2025Four-layer hierarchical memory (domain, category, trace, episode) with index-based routingDesigned for conversational recall, not task-oriented knowledge retrieval. No token budget formalism
MemGPT / Letta [N-Packer]2024OS-inspired two-tier memory (main context as RAM, archival/recall as disk) with self-directed pagingMemory management for persistent agents, not knowledge search. Addressing by content similarity, not hierarchical descent
MemOS [N-Li]2025Memory OS with MemCube abstraction unifying plaintext, activation, and parameter memories; next-scene prediction for proactive preloadingPredicts and preloads memory; does not search a knowledge hierarchy. No budget-constrained descent
A-MEM [N-Xu]2025Zettelkasten-inspired agentic memory with dynamic indexing; memories autonomously generate descriptions and form connectionsGraph-based retrieval via semantic similarity, not hierarchical descent. Structure emerges from evolution, not from declared document types

LATTICE [N-Gupta] is the closest prior art to the hierarchical descent component of HBDS. Gupta et al. construct a semantic tree offline --- via either bottom-up agglomerative or top-down divisive clustering --- and use an LLM to navigate it with calibrated relevance scoring, achieving state-of-the-art zero-shot performance on reasoning-intensive benchmarks. However, LATTICE budgets node expansions (how many children the LLM evaluates) rather than tokens consumed in the context window. An agent using LATTICE cannot answer the question "can I afford to read this subtree before committing tokens?" because the tree nodes lack pre-computed token costs. Furthermore, LATTICE's tree is inferred from flat text via clustering; it does not require or exploit typed block boundaries, priority signals, or author-declared summaries.

The remaining systems address complementary problems. H-MEM [N-Sun] and MemGPT [N-Packer] manage agent memory persistence across conversations. MemOS [N-Li] optimizes memory scheduling and proactive preloading. A-MEM [N-Xu] enables memories to self-organize through agentic evolution. Each represents genuine progress on the memory problem for LLM agents. None, however, addresses the retrieval problem that HBDS targets: navigating a knowledge hierarchy under a strict token budget using declared document structure. The algorithm and the document format are treated as independent concerns across all five systems. We argue they are coupled --- the format determines which traversal algorithms are available, and the algorithm's requirements dictate what the format must provide (as formalized in Section 4).

2.7 Prior Work on AI and Document Structure

A growing body of empirical work demonstrates that documentation structure affects AI agent performance, though the field remains nascent and no prior work treats document format itself as an architectural variable for retrieval.

The strongest causal evidence comes from Lulla et al. [1], who conducted a controlled study of AGENTS.md files across 10 repositories and 124 pull requests. Agents operating with an AGENTS.md file exhibited a 28.64% reduction in median wall-clock runtime (98.57s to 70.34s) and a 16.58% reduction in median output token consumption (2,925 to 2,440 tokens), while maintaining comparable task completion rates. These results establish that even minimal repository-level context --- a single unstructured Markdown file --- measurably improves agent efficiency. The effect size is notable given that AGENTS.md files carry no type annotations, no cross-references, and no token cost metadata. The question this paper addresses is what becomes possible when those properties are present.

Santos et al. [2] studied 328 configuration files from public Claude Code projects, finding that developers specify a wide range of software engineering concerns --- build commands, implementation details, architectural constraints --- with particular emphasis on architectural guidance. The analysis reveals that practitioners intuitively structure context files around the concerns agents need most, but do so without a formal schema or type system. Each file is an ad hoc encoding of project knowledge, parseable by LLMs but not by deterministic algorithms.

At larger scale, Chatlatanagulchai et al. [3] analyzed 2,303 agent context files from 1,925 repositories spanning multiple agentic coding tools. Their central finding is that these files are "not static documentation but complex, difficult-to-read artifacts that evolve like configuration code, maintained through frequent, small additions." Developers prioritize functional requirements --- build/run commands (62.3%), implementation details (69.9%), architecture (67.7%) --- while non-functional concerns such as security and performance appear in only 14.5% of files. The characterization of context files as configuration-like artifacts that evolve incrementally supports the thesis that these documents serve a machine-facing purpose that their format does not reflect: they are written as prose but consumed as instructions.

Mohsenimofidi et al. [4] examined context engineering practices across 466 open-source projects and found "no established content structure" --- developers employ descriptive, prescriptive, prohibitive, explanatory, and conditional styles with high variation. The absence of structural conventions means that each agent must infer the organizational logic of each file independently, a cost paid on every interaction. This finding directly motivates the typed-structure approach: if the format declared its organization, the inference cost would be zero.

In the broader retrieval literature, knowledge graph-augmented approaches consistently outperform flat-text retrieval on question-answering benchmarks. Edge et al. [5] demonstrated that Graph RAG --- which extracts entity-relationship structure from text and traverses community summaries --- substantially improves answer comprehensiveness and diversity over standard RAG on corpora up to one million tokens. Zhu et al. [6] showed that knowledge graph-guided retrieval improves both response quality and retrieval quality on HotpotQA by establishing fact-level relationships between chunks, addressing the fragmentation inherent in flat chunking. These results confirm a general principle: structured knowledge representations outperform unstructured ones for retrieval tasks.

However, a critical gap remains. Every retrieval approach in the literature either operates on flat text (standard RAG, agentic tool use) or constructs structure from flat text at retrieval time (Graph RAG, RAPTOR). No prior work uses structure declared in the document format itself as the basis for agent retrieval. The format is treated as a serialization choice downstream of the retrieval algorithm --- a presentation concern, not an architectural one. The present work argues this relationship is reversed: the document format determines which retrieval algorithms are available and how well they perform. The typed-traversability properties defined in Section 4 and the HBDS algorithm presented in Section 5 are the formal statement of this claim.

3. Failure Modes of Flat Text

The failures that arise when AI agents operate on flat text are not random --- they fall into five systematic categories, each traceable to a specific format property that is absent. These are not hypothetical failure modes inferred from first principles. They are observable in production agent workflows and reproducible across agent architectures. Section 4 formalizes the four properties whose absence causes them; the present section catalogs the failure mechanisms and their consequences.

3.1 Structure Hallucination (missing P1: Declared Block Types)

When an agent consumes a flat document, it must infer the purpose of every block from content alone. Headings suggest hierarchy, but they do not declare it. A Markdown document with a ## Future Work section followed by a bulleted list of features provides no machine-readable signal distinguishing speculative ideas from active requirements. The heading text is suggestive to a human reader; to a tokenizer, it is a sequence of characters no different from any other heading. An agent tasked with extracting current project requirements from such a document may treat the Future Work list as requirements --- not because the agent is deficient, but because the format provides no type boundary between "what we plan to build" and "what we might build someday."

This failure mode is a direct consequence of the absence of P1 (Declared Block Types, Section 4). In a typed format, a ::decision[status=accepted] block is unambiguously a ratified decision, and a ::callout[type=idea] block is unambiguously speculative. The agent does not need to infer intent from content; the type annotation resolves it at parse time. Structure hallucination occurs whenever the agent must guess what a block is --- and in flat text, it must always guess. The severity scales with document length: the longer the document, the more boundaries the agent must infer, and the more opportunities for inference errors to compound.

3.2 Type Ambiguity (missing P1: Declared Block Types)

Closely related to structure hallucination but distinct in mechanism, type ambiguity arises when the same syntactic element in flat text serves multiple unrelated semantic purposes. A fenced code block in Markdown could contain an API endpoint specification, a usage example, a configuration template, a test fixture, or pseudocode for a proposed algorithm. The triple-backtick syntax distinguishes "this is code" from "this is prose," but it does not distinguish what kind of code or what role it plays. Similarly, a Markdown blockquote (>) could be a citation, a warning, an aside, a pull quote, or a deprecated passage retained for historical context. The format is ambiguous at the type level.

Consider an agent building a test suite from project documentation. It encounters three fenced code blocks in a single document. The first is an example of expected API behavior. The second is a configuration template showing default values. The third is pseudocode for an algorithm the team decided not to implement. In flat Markdown, all three are syntactically identical. The agent must classify each block's purpose from its content --- a natural language inference task layered on top of the retrieval task. In a typed format, ::code[role=example], ::code[role=template], and ::code[role=deprecated-pseudocode] would resolve the ambiguity deterministically. P1 collapses a classification problem into a lookup. Without it, the agent performs NLP inference where a parser would suffice.

3.3 Broken Traversal (missing P2: Typed Cross-References)

Flat text cross-references are strings. A Markdown link [see the auth architecture](../docs/auth.md) provides a file path and display text, but carries no relationship semantics. The agent cannot determine from the link alone whether the current document depends on the auth architecture, extends it, supersedes it, or merely references it in passing. To discover the relationship, the agent must read both documents, infer their semantic connection from content, and construct the edge type --- a multi-step NLP task that consumes tokens and produces probabilistic results.

The consequence is that agents cannot build traversable knowledge graphs from flat-text corpora without an extraction pipeline (as in Graph RAG [4]). But the extraction is expensive and lossy: entities may be misidentified, relationship types may be hallucinated, and implicit connections --- where two documents are related through a shared concept that neither names explicitly --- are systematically missed. HBDS (Section 5) requires parent-child pointers with typed edges to perform hierarchical descent. A flat-text corpus provides neither parent-child hierarchy nor edge types; it provides a bag of files connected by untyped string paths. The agent can follow links, but it cannot navigate --- it has no map, no edge semantics, and no guarantee that the link target exists, is current, or is relevant to the traversal objective. P2 (Typed Cross-References, Section 4) converts untyped hyperlinks into a directed graph with declared edge semantics, enabling the ORIENT and WRITE BACK phases of HBDS.

3.4 Context Budget Waste (missing P3: Priority Signals + P4: Token Costs)

Flat text provides no mechanism for an agent to distinguish high-value content from boilerplate before reading it. Every paragraph in a Markdown document occupies tokens in the context window at equal cost, regardless of whether it contains an architectural decision that governs the entire system or a stylistic note about naming conventions. An agent operating under a 128K-token budget across a 200-file documentation corpus faces a combinatorial problem: which files to read, in what order, and at what depth. Without priority signals (P3) and pre-computed token costs (P4), this problem reduces to heuristics --- alphabetical order, file size, modification date, or embedding similarity to the query --- none of which correlate reliably with the content's relevance to the task.

The deeper failure is that the agent cannot plan its reading. A human researcher scanning a library can assess a book's relevance from its table of contents before committing to reading it. An agent consuming flat text has no equivalent mechanism. It cannot ask "how many tokens will it cost to read this document and its transitive dependencies?" because flat text does not expose token costs. It cannot ask "is the full content worth reading, or would a summary suffice?" because flat text does not distinguish summary layers from full content. Every file read is a blind commitment of unknown size against a finite budget. The HBDS budget model (Section 5.3) addresses this directly: the can_afford() and can_afford_subtree() checks require P4, and the reading-order prioritization requires P3. Without both properties, budget-aware traversal is impossible and the agent defaults to greedy, unplanned consumption.

3.5 Semantic Lossy Compression (missing P1 + P3)

When an agent's context budget is insufficient to hold all retrieved content, it must compress. In flat text, compression operates on the only axes available: position and token count. The agent truncates from the end, summarizes uniformly, or drops entire documents by insertion order. None of these strategies are semantic --- they do not distinguish content by meaning, importance, or type. A decision rationale and a formatting tip are compressed with equal aggression. Critical data tables are summarized into prose that discards their structure. The compression is lossy in the worst possible way: it destroys high-value content and low-value content with equal probability.

A typed format enables structured compression: retain ::decision and ::data blocks at full fidelity, compress ::callout[type=info] blocks to their titles, and drop ::callout[type=tip] blocks entirely. This is not speculative --- it is the mechanism underlying HBDS's graceful degradation (Section 5.1): when the remaining budget is insufficient to read a node's full content, the algorithm falls back to reading only the ::summary block, which is the author-declared compressed representation. This fallback requires two properties that flat text lacks: P1 (to distinguish summary blocks from full content) and P3 (to determine which blocks to retain under budget pressure). Without them, the agent has no semantic compression strategy --- only positional truncation, which the lost-in-the-middle effect [2] demonstrates is systematically biased against information in middle positions.

Table 2. Failure mode to missing property mapping. Each failure mode (F1--F5) is caused by the absence of one or more typed-traversability properties (P1--P4, Section 4). The HBDS phases column indicates which algorithmic capabilities are blocked by each failure.

Failure ModeMissing PropertyConsequenceHBDS Phases Blocked
F1: Structure hallucinationP1: Declared block typesAgent invents section boundaries and block purposesDESCEND (scoring), SYNTHESIZE
F2: Type ambiguityP1: Declared block typesAgent misclassifies content role (example vs. requirement vs. template)DESCEND (scoring), SYNTHESIZE
F3: Broken traversalP2: Typed cross-referencesAgent cannot build knowledge graph; links lack edge semanticsORIENT, WRITE BACK
F4: Context budget wasteP3: Priority signals + P4: Token costsAgent reads blindly, cannot plan or verify budget before readingDESCEND (budgeting, degradation)
F5: Semantic lossy compressionP1 + P3Agent compresses by position, not meaning; no summary-layer fallbackDESCEND (degradation), SYNTHESIZE

4. Typed Traversability: A Formal Definition

The format properties that make a document algorithmically navigable by AI agents. These are not nice-to-haves. Each property is a prerequisite for the HBDS algorithm (Section 5), and its absence causes a specific failure mode (Section 3).

4.1 Four Properties

We define typed traversability as the presence of four properties in a document format that collectively enable algorithmic navigation by AI agents. Each property is a prerequisite for one or more phases of the HBDS algorithm (Section 5), and the absence of each causes a specific failure mode (Section 3). The properties are ordered by increasing infrastructure cost.

P1: Declared Block Types. Every content block carries a machine-readable type annotation that is part of the format syntax, not inferred from visual presentation. Types include semantic categories such as decision, requirement, callout, data, code, summary, figure, and quote. In a typed format, ::decision[status=accepted] is unambiguously a decision; in Markdown, a paragraph under a "Decisions" heading may or may not be a decision, and a code block may be an example, a template, a requirement, or a test case (see Section 3.2). P1 enables three algorithmic capabilities: semantic relevance scoring (an agent searching for architectural decisions can filter by block type before reading content), type-based compression (retain ::decision and ::data blocks, drop ::callout[type=tip] blocks when budget is tight), and type-routed processing (route ::data blocks to a schema generator, ::code blocks to a linter, ::decision blocks to a decision tracker). Without P1, all three capabilities require NLP inference over raw text, which is probabilistic, expensive, and error-prone.

P2: Typed Cross-References. Links between documents carry relationship semantics --- depends-on, extends, supersedes, implements, parent-child --- as machine-readable metadata rather than human-readable hyperlink text. An agent encountering a related block with relationship: depends-on can build a directed dependency graph without inferring edge types from link labels or surrounding prose. P2 enables hierarchical traversal (follow parent-child edges to descend a knowledge tree), dependency tracing (given a node, find everything it depends on and everything that depends on it), and knowledge graph construction (build a typed directed graph from the document corpus without NLP entity-relation extraction). Without P2, the agent must resolve string-based paths, hope files exist, parse them from scratch, and guess how documents relate (see Section 3.3). The resulting "graph" is a set of untyped hyperlinks --- navigable by humans who read surrounding context, but not by algorithms that require edge semantics.

P3: Priority Signals. Documents and blocks declare their importance relative to context-budget constraints. Priority can be explicit --- frontmatter fields such as scope: public (high priority) vs. scope: internal (lower priority) and status: active vs. status: deprecated --- or implicit through block type, where the type implies a default priority ordering (e.g., ::decision > ::summary > ::callout[type=info] > ::quote). P3 enables budget-aware reading order (read high-priority documents first when budget is constrained) and graceful degradation (when budget is insufficient for full content, fall back to reading only ::summary blocks, which are explicitly marked as the compressed representation of their parent). Without P3, every paragraph appears equally important to the agent, and budget allocation reduces to positional heuristics --- read the first N tokens, or the last N, or truncate uniformly --- none of which correlate with semantic importance (see Section 3.4).

P4: Pre-Computed Token Costs. Every node in the document hierarchy exposes its token cost before the agent reads its content. Two metrics are required: contentTokens (the cost of reading this node's content alone) and subtreeTokenEstimate (the cost of reading this node plus all descendants). These values are computed at authoring or indexing time and stored in the document metadata or schema. P4 enables budget planning (the agent can determine whether its remaining budget can accommodate a subtree before committing tokens), pre-read budget verification (the can_afford() check in the HBDS budget model, Section 5.3), and cost-aware descent decisions (at each level of the tree, the agent compares subtree costs against remaining budget to select the optimal beam). Without P4, every read is a blind commitment --- the agent cannot know the cost of a document until it has already paid that cost by reading it (see Section 3.4). This is the property most absent from existing formats; even typed formats such as DITA and AsciiDoc (Section 2.5) do not expose token costs.

4.2 The Spectrum

Typed traversability is not a binary property. Document formats fall on a spectrum from fully untyped (no properties present) to fully typed-traversable (all four properties present), with each level adding a property and unlocking a corresponding algorithmic capability. Table 3 defines this spectrum.

Table 3. The typed-traversability spectrum.

LevelFormat ExampleProperties PresentHBDS Phases Enabled
0Plain text (.txt)NoneNone
1Markdown (.md)Heading hierarchy (implicit, visual only)None --- headings are visual, not structural
2Markdown + YAML frontmatterP3 (partial --- document-level priority via status, scope)ORIENT only --- agent can rank documents but not navigate within them
3Structured context (e.g., ARDS v3 `.context/`)P2 (typed cross-refs, discovery order) + P3ORIENT + partial DESCEND --- agent can follow typed references between documents
4Typed document format (e.g., `.surf`)P1 (block types) + P2 + P3ORIENT + DESCEND + SYNTHESIZE --- agent can score by type and merge by type
5Typed format + schema + token costsP1 + P2 + P3 + P4 (pre-computed costs)Full HBDS (all four phases including budget-verified descent)

The progression is additive: each level subsumes the properties of the level below it. A Level 4 format has everything Level 3 has plus declared block types. A Level 5 format has everything Level 4 has plus pre-computed token costs and schema validation.

The key observation is that the spectrum is algorithm-derived, not aesthetically-derived. The levels are not ranked by "how structured the format looks" but by "which HBDS phases the format can support." The algorithm tells you what the format needs. If an agent requires pre-read budget verification, the format must expose token costs (P4). If an agent requires type-aware relevance scoring, the format must declare block types (P1). The four properties are not an arbitrary wish list --- they are the minimal set required by the four phases of budget-constrained hierarchical descent.

This framing also explains why Markdown (Level 1) and Markdown-with-frontmatter (Level 2) feel adequate for many current workflows: the dominant retrieval paradigm (RAG, Section 2.1) does not use hierarchical descent and therefore does not require properties beyond Level 1. The inadequacy becomes visible only when agents attempt algorithms that depend on declared structure --- at which point the format becomes a capability gate, not a cosmetic choice.

The transition between levels is not uniform in cost. Section 4.3 analyzes the cost-benefit trade-off at each step.

4.3 Cost-Benefit at Each Level

The cost of moving between levels varies by orders of magnitude, and the beneficiaries shift from humans to machines as the level increases.

Level 0 to 1 (plain text to Markdown). Nearly free. Authors gain visual structure (headings, bold, lists) with minimal syntax overhead. Both humans and agents benefit: humans see formatted text, agents see heading hierarchy. This transition has already occurred across the software industry --- Markdown is the de facto standard for technical documentation.

Level 1 to 2 (Markdown to Markdown + YAML frontmatter). Cheap. Adding a YAML frontmatter block (title, status, scope, tags) to existing Markdown files requires minutes per document. Humans benefit from metadata searchability; agents benefit from document-level priority signals (P3). The AGENTS.md and CLAUDE.md ecosystem (Section 2.7) operates at this level, and Lulla et al. [1] demonstrated measurable agent performance gains from even this minimal structure.

Level 2 to 3 (frontmatter to structured context hierarchy). A one-time structural investment. Adopting a documentation architecture such as ARDS v3 [N-ARDS] requires organizing documents into a discovery hierarchy (.context/docs/, .context/agents/, .context/guides/) with typed cross-references between documents. The cost is paid once during initial setup; ongoing maintenance is incremental. Humans benefit modestly --- the directory structure is navigable but not essential for reading. Agents benefit substantially: typed cross-references (P2) enable traversal between documents, and the discovery protocol eliminates file-search overhead.

Level 3 to 4 (structured hierarchy to typed document format). Requires a new parser. Authors must learn a new syntax for declaring block types (::decision, ::data, ::callout[type=warning]). The authoring cost is non-trivial: every document must be written in the typed format rather than plain Markdown. Humans benefit from visual block distinctions (a rendered callout looks different from a data table), but could achieve similar visual results with Markdown conventions. Agents benefit enormously: P1 (declared block types) eliminates the type-ambiguity failure mode (Section 3.2) and enables type-aware relevance scoring, type-based compression, and type-routed processing.

Level 4 to 5 (typed format to typed + schema + token costs). Requires infrastructure. Token costs must be computed and maintained as documents change. Schema validation must be enforced at authoring or commit time. Humans derive essentially zero direct benefit --- token counts are invisible in rendered documents. Agents gain pre-read budget verification (P4), which is the prerequisite for the DESCEND phase of HBDS (Section 5.1).

The pattern is clear: the marginal benefit to humans diminishes at each level, while the marginal benefit to AI agents increases monotonically. The crossover point --- where the next level benefits machines more than humans --- falls between Levels 2 and 3. Everything above Level 2 is primarily for machine consumption. This creates an incentive misalignment: the traditional audience for document format design (human authors) has declining motivation to adopt higher levels, while the emerging audience (AI agents) has increasing need for them. Resolving this misalignment --- through AI-assisted authoring, format migration tooling, or dual-render formats that serve both audiences --- is an open design challenge (Section 8.2).

5. HBDS: A Traversal Algorithm That Requires Typed Structure

5.1 Algorithm Overview

This section presents HBDS (Hierarchical Budget-Constrained Descent Search) not as a standalone algorithmic contribution but as a proof by construction. The claim is precise: if HBDS can be built on typed documents and cannot be built on flat text, then the document format is a capability gate --- a boundary that determines which algorithms are available --- not a stylistic preference. The proof proceeds by exhibiting the algorithm, tracing each phase to the typed-traversability properties it requires (P1--P4, Section 4.1), and demonstrating that no phase can execute on a format lacking those properties. Section 5.2 formalizes the phase-to-property mapping; this section describes the algorithm itself.

HBDS navigates a hierarchical knowledge store under a strict token budget. The context window is treated not as a container but as a non-renewable execution resource: every token consumed by retrieval is a token unavailable for reasoning, generation, or further search. The algorithm operates in four phases --- ORIENT, DESCEND, SYNTHESIZE, and WRITE BACK --- each consuming tokens from a monotonically decreasing budget tracker. A portion of the total budget is reserved for later phases before the first phase begins, ensuring that search cannot consume tokens needed for synthesis or output. The four phases proceed as follows.

Phase 1: ORIENT. The agent classifies the incoming query and selects one or more search roots in the knowledge tree. ORIENT reads only the root index --- a lightweight manifest listing top-level domains with their summaries, keywords, and subtree token estimates. An LLM call maps the query to the most relevant domains (e.g., project_local, general_knowledge, or hybrid) and assigns a confidence score. If confidence exceeds a threshold and a single domain's keyword index produces a high-similarity direct hit, ORIENT short-circuits to that leaf without descent, avoiding multi-level traversal entirely.

ORIENT is inexpensive by design: it reads a single pre-computed index and makes one LLM classification call, typically consuming 200--800 tokens. Its critical requirement is P2 (Typed Cross-References). The root index encodes parent-child relationships between domains, categories, and topics as typed hierarchical edges. Without P2, the agent would face a flat list of files with no hierarchy to orient within --- it could rank files by embedding similarity, but it could not identify which subtree of the knowledge space to enter. The distinction is between selecting a search region (which ORIENT does) and selecting individual documents (which flat retrieval does). The former requires hierarchical structure; the latter does not.

Phase 2: DESCEND. Beginning from the search roots selected by ORIENT, the agent performs an iterative beam search through the knowledge hierarchy. At each level of the tree, DESCEND executes five steps:

  1. Read child manifests. For each node in the current beam, the agent reads the children's summaries, keywords, and token estimates from the parent's index. These manifests are pre-computed and lightweight --- they expose the information needed for scoring and budget planning without requiring the agent to read full content.

  2. Score candidates. The LLM evaluates each candidate's relevance to the query on a 0--1 scale, given only the candidate's summary and keywords. This scoring step is analogous to the heuristic function h(n) in A* search: it provides an admissible estimate of a node's value without expanding it. The LLM acts as a learned heuristic, replacing the hand-crafted heuristics of classical search with a general-purpose relevance estimator. Crucially, the scoring operates on typed summaries and typed keywords --- metadata that exists because the document format declares it (P1), not because an extraction model inferred it.

  3. Select beam. The top-k candidates (where k is the beam width, default 3) are selected for expansion. Candidates scoring above a prune threshold but below the beam cutoff are retained in the queue with reduced priority for potential revisitation. Candidates below the prune threshold are discarded.

  4. Pre-read budget verification. Before reading any child node's content, the agent checks whether the remaining budget can accommodate the node's declared token cost. This check is the operation can_afford(node.contentTokens), which requires P4 (Pre-Computed Token Costs). If the token cost of a leaf node exceeds the available budget, the agent does not read the full content. Instead, it falls back to summary-only retrieval --- recording the node's summary (which was already loaded during manifest reading) as a partial result with a discounted relevance score. This graceful degradation requires both P1 (the format distinguishes ::summary blocks from full content) and P3 (priority signals indicate that summaries are the compressed representation intended for budget-constrained consumption). Without P4, the agent cannot perform this check: it must read the node to discover its cost, at which point the tokens are already spent. Without P1 and P3, there is no summary layer to degrade to.

  5. Deduct tokens and continue. Tokens consumed by manifest reads, LLM scoring calls, and content reads are deducted from the budget tracker. If the remaining budget falls below a configurable minimum-useful-budget threshold, descent terminates regardless of remaining candidates.

For internal (non-leaf) nodes in the beam, DESCEND expands them by reading their child manifests and adding those children to the candidate queue, inheriting the parent's relevance score attenuated by a depth-decay factor (default 0.85). This depth penalty biases the search toward breadth over depth, favoring shallow coverage of multiple relevant subtrees over deep exploration of a single branch.

Phase 3: SYNTHESIZE. After descent completes (either by exhausting the beam, the budget, or the maximum depth), the agent combines the retrieved knowledge fragments into a coherent context injection. Retrieved fragments are ranked by relevance score. Fragments of the same block type from overlapping subtrees are deduplicated --- P1 (Declared Block Types) enables this deduplication because the agent can detect when two ::decision blocks or two ::data blocks cover the same entity, whereas flat text deduplication must rely on string similarity, which conflates semantic overlap with textual similarity. The ranked, deduplicated fragments are packed into the synthesis prompt up to the remaining token budget. An LLM call generates a structured output: an answer, a confidence score, a list of gaps (aspects of the query not covered by retrieved knowledge), and a list of new insights (deductions not explicitly stated in any source).

Phase 4: WRITE BACK. If the SYNTHESIZE phase identified new insights --- knowledge generated during the search that did not exist in any retrieved node --- the WRITE BACK phase persists them to the knowledge tree. For each insight, an LLM call determines placement: which parent node in the hierarchy should own the new knowledge. This placement decision requires P2 (Typed Cross-References), because the agent must identify the correct location in a typed hierarchy, not simply append to a flat file. If a node covering similar content already exists (detected via keyword overlap), the insight is appended to that node with a version increment; otherwise, a new leaf node is created. New agent-generated nodes receive a trust score of 0.6 (below the 0.9 default for human-authored nodes), a source tag of agent-write-back, and an initial contentTokens count computed at write time. Trust scores decay over time (multiplied by 0.997 per day, halving approximately every 231 days) and are boosted by subsequent verification events --- an agent or human reading the node and confirming its accuracy.

The WRITE BACK phase is what makes the knowledge tree self-improving. Over time, frequently queried subtrees accumulate synthesized knowledge, updated summaries, and refined trust scores, reorganizing around actual usage patterns. This self-organization is only possible because P1 tells the agent what kind of knowledge to write (a decision, a pattern, a data schema), P2 tells it where in the hierarchy to place it, and P4 (token costs) is updated to reflect the new content so that future budget calculations remain accurate. A flat-text corpus cannot self-organize because there is no declared structure to organize into --- no hierarchy for placement, no types for classification, no token costs for budget recalculation.

5.2 Why Each Phase Requires Typed Structure

Section 5.1 described the four phases of HBDS in operational terms. This section establishes the formal dependency between each phase and the typed-traversability properties defined in Section 4.1. The argument is structural: each phase requires at least one property that flat text does not provide. Removing any single property disables at least one phase. Therefore all four properties are jointly necessary, and any format lacking even one of them cannot support the full algorithm.

Table 4 presents the phase-to-property dependency map.

Table 4. HBDS phase-to-property dependency. Each row identifies a phase or sub-phase, the typed-traversability property it requires, and the specific mechanism that fails in flat text.

HBDS PhaseRequired PropertyWhat Flat Text Lacks
ORIENTP2: Typed cross-references (hierarchy)No hierarchy --- headings are visual, not structural. The agent cannot identify subtrees to enter
DESCEND (scoring)P1: Declared block typesNo types --- the agent must guess what a paragraph means before scoring its relevance
DESCEND (budgeting)P4: Pre-computed token costsNo token counts --- the agent must read a node to discover its cost, spending the budget it intended to verify
DESCEND (degradation)P1 + P3: Types + priority signalsNo summary-vs.-content distinction --- no declared fallback layer for budget-constrained retrieval
SYNTHESIZEP1: Declared block typesNo type-aware merging --- the agent cannot detect when two fragments cover the same entity at the same type
WRITE BACKP2: Typed cross-referencesNo insertion semantics --- the agent cannot determine where in a hierarchy to place new knowledge

We now walk through each row to explain why the dependency is strict --- not a performance optimization but a capability prerequisite.

ORIENT requires P2. ORIENT selects a search root by mapping the query to a position in the knowledge hierarchy. This operation presupposes that a hierarchy exists --- that documents are organized into parent-child relationships with typed edges (domain contains categories; category contains topics). In flat text, documents are a bag of files in a directory tree. Directory paths provide implicit nesting, but directory names carry no relationship semantics: a file at docs/auth/tokens.md may depend on, extend, or contradict docs/auth/sessions.md, and the path encodes none of these relationships. Without P2, ORIENT degrades to file-level ranking by embedding similarity --- it can select documents but cannot select a region of the knowledge space to descend into. The hierarchical search that follows has no root to start from.

DESCEND (scoring) requires P1. At each level of the beam search, the LLM scores candidate nodes by relevance. The scoring operates on typed metadata --- summaries and keywords associated with declared block types. In a typed format, the agent knows that a candidate node contains a ::decision block, a ::data block, and two ::callout[type=info] blocks before reading any of their content. It can score the node's likely relevance to a query about architectural decisions based on the presence of a ::decision block alone. In flat text, the agent has no such signal. A Markdown file's relevance to an architectural-decision query can only be assessed by reading its content and inferring whether any paragraph constitutes a decision --- the very token expenditure that scoring is meant to avoid. Without P1, relevance scoring becomes relevance reading, collapsing the distinction between evaluating a node and expanding it.

DESCEND (budgeting) requires P4. The pre-read budget verification step --- can_afford(node.contentTokens) --- requires knowing the token cost of a node before committing tokens to read it. This cost must be pre-computed and exposed in the document metadata. In flat text, token count is a derived property: the agent must tokenize the document to know its length, which requires reading it, which spends the tokens. The verification becomes a tautology --- "can I afford what I have already paid for?" --- and the budget model loses its planning capability. Without P4, every descent step is a blind commitment. The agent cannot compare subtree costs across beam candidates, cannot reserve tokens for later phases, and cannot terminate descent early based on projected budget exhaustion.

DESCEND (degradation) requires P1 + P3. When the remaining budget is insufficient for a node's full content, HBDS falls back to summary-only retrieval. This graceful degradation depends on two properties jointly. P1 (Declared Block Types) is required because the format must distinguish ::summary blocks from full content blocks --- the agent needs to know which fragment is the summary. P3 (Priority Signals) is required because the format must indicate that summaries are the preferred fallback under budget pressure --- that they are the author-declared compressed representation, not an arbitrary excerpt. In flat text, there is no declared summary layer. The agent's only compression option is truncation (read the first N tokens) or LLM-generated summarization (spend additional tokens to compress what was just read). Neither is graceful: truncation is semantically blind, and on-the-fly summarization consumes the budget it was meant to conserve.

SYNTHESIZE requires P1. The synthesis phase deduplicates retrieved fragments before assembling them into a context injection. Type-aware deduplication detects when two ::decision blocks or two ::data[format=table] blocks from overlapping subtrees cover the same entity, even if their prose differs. In flat text, deduplication falls back to string or embedding similarity, which conflates textual similarity (two paragraphs using similar words) with semantic overlap (two paragraphs making the same claim in different words). The false-positive rate rises because stylistically similar but semantically distinct passages are incorrectly merged, and the false-negative rate rises because semantically identical claims phrased differently are retained as duplicates. P1 constrains the comparison: two blocks can only be duplicates if they share the same type and cover the same entity, a much tighter predicate than general textual similarity.

WRITE BACK requires P2. Persisting new knowledge requires placement --- the agent must determine which parent node in the hierarchy should own the new insight. This placement decision navigates the same typed hierarchy that ORIENT entered and DESCEND traversed. In flat text, the agent has no insertion semantics. It can append to a file, but it cannot determine which file, which section, or what relationship the new content has to existing content. The result is either a flat append log (new knowledge accumulates at the bottom of a file, destroying any implicit organization) or no write-back at all (the knowledge is generated and discarded). Without P2, the knowledge tree cannot self-improve because there is no tree to improve --- only a collection of files with no declared structure to organize into.

The proof. The six rows of Table 4 collectively reference all four typed-traversability properties: P1 appears in four rows (DESCEND scoring, DESCEND degradation, SYNTHESIZE, and implicitly in WRITE BACK's type classification of new knowledge), P2 appears in two (ORIENT, WRITE BACK), P3 appears in one (DESCEND degradation), and P4 appears in one (DESCEND budgeting). Removing any single property disables at least one phase: without P1, scoring and synthesis fail; without P2, orientation and write-back fail; without P3, graceful degradation fails; without P4, budget verification fails. All four properties are therefore jointly necessary for full HBDS execution.

Flat text --- Markdown, HTML, plain prose --- provides none of the four properties. It lacks declared block types (P1), typed cross-references (P2), priority signals (P3), and pre-computed token costs (P4). The algorithm requires all four. The format provides zero. Therefore flat text cannot support HBDS. The document format is not a style preference --- it is a capability gate that determines whether budget-constrained hierarchical descent is possible at all.

5.3 Formal Budget Model

The DESCEND phase described in Section 5.1 references a budget tracker that gates every read operation. This section formalizes that tracker as a data structure, making explicit the invariants that distinguish budget-constrained descent from unconstrained retrieval. The model is presented in Rust-style pseudocode to emphasize the type-level constraints:

struct BudgetTracker {
    total: usize,       // Total tokens for this agent session
    consumed: usize,    // Tokens consumed (monotonically increasing)
    reserved: usize,    // Tokens reserved for SYNTHESIZE + WRITE BACK
}

impl BudgetTracker {
    fn available(&self) -> usize {
        self.total - self.consumed - self.reserved
    }

    fn can_afford(&self, node: &TreeNode) -> bool {
        node.content_tokens <= self.available()
    }

    fn can_afford_subtree(&self, node: &TreeNode) -> bool {
        node.subtree_token_estimate <= self.available()
    }
}

Three fields capture the complete state of the agent's resource position. The total field is fixed at session initialization --- it represents the model's context window capacity minus the tokens already occupied by the system prompt, task description, and any injected instructions. The consumed field tracks all tokens that have entered the context window through retrieval operations (manifest reads, content reads, LLM scoring prompts). Critically, consumed is monotonically increasing. The context window is a one-way resource: once tokens are read, they cannot be unread. There is no garbage collection, no deallocation, no reclamation of spent budget. Every retrieval decision is a permanent commitment. This monotonicity is what makes pre-read verification essential --- a wrong read cannot be reversed, only survived.

The reserved field addresses a subtler problem. Without reservation, an agent running DESCEND could consume its entire budget on retrieval, leaving zero tokens for the SYNTHESIZE phase (which requires an LLM call to combine fragments) and the WRITE BACK phase (which requires an LLM call to determine placement and generate output). The reserved field is set during initialization based on estimated synthesis and write-back costs --- typically 15--25% of the total budget --- and is treated as unavailable by the available() method. This ensures that descent terminates with sufficient budget remaining for the agent to use what it retrieved.

The two verification methods --- can_afford() and can_afford_subtree() --- are the formal interface between the budget model and the DESCEND phase. The can_afford() method checks whether a single node's content fits within the remaining budget by comparing node.content_tokens against available(). The can_afford_subtree() method checks whether an entire subtree --- the node plus all its descendants --- fits, using node.subtree_token_estimate. The subtree check enables a stronger planning guarantee: before expanding an internal node, the agent can verify that the entire branch is affordable, not just the immediate children.

Both methods depend on fields that exist only in typed document schemas satisfying P4 (Pre-Computed Token Costs, Section 4.1). The content_tokens field is a pre-computed count of the tokens in a node's content, stored in the document metadata at authoring or indexing time. The subtree_token_estimate field is the sum of content_tokens across all descendants, propagated up the hierarchy. These are P4 fields --- they have no equivalent in flat text. A Markdown file does not declare how many tokens it contains; a heading in a Markdown document does not declare the token cost of all content under that heading.

Without P4, the agent cannot call can_afford() before reading. It must read the node to discover its token cost, at which point the tokens are already consumed and the check is moot. Every read becomes a blind commitment: the agent commits an unknown number of tokens from a finite, non-renewable budget, with no ability to verify affordability in advance. The budget model degenerates from a planning instrument into a post-hoc accounting ledger that can detect overruns but cannot prevent them. This is the difference between a budget and a receipt. HBDS requires a budget. Flat text can only provide a receipt.

5.4 Comparison: HBDS vs. Flat-Text Retrieval

The preceding sections described HBDS's phases, their typed-structure dependencies, and the formal budget model. To situate these contributions within the retrieval landscape, we now compare HBDS against two baselines: standard RAG (which operates on flat, unstructured text) and the hierarchical inferred-structure family represented by RAPTOR [5] and LATTICE [6] (which construct tree structure from flat text at indexing or retrieval time). Table 5 presents this comparison across eight dimensions.

Table 5. Three-way comparison of retrieval approaches across eight dimensions. RAG operates on flat text with no structure. RAPTOR and LATTICE infer structure from flat text at indexing or retrieval time. HBDS operates on structure declared at authoring time.

DimensionRAG (flat)RAPTOR/LATTICE (inferred structure)HBDS (declared structure)
Structure sourceNone --- flat chunks with no hierarchyInferred at indexing/retrieval time via clustering (RAPTOR) or LLM-guided tree construction (LATTICE)Declared at authoring time --- typed blocks, frontmatter, hierarchical cross-references
Chunk boundariesArbitrary (character count, sliding window, paragraph break)Auto-clustered (RAPTOR) or divisive/agglomerative (LATTICE); may split semantic unitsSemantic --- block type boundaries (`::decision`, `::data`, `::code`) are natural chunk boundaries
Relevance scoringEmbedding cosine similarity in vector spaceLLM heuristic on auto-generated summaries (RAPTOR cluster summaries, LATTICE node summaries)LLM heuristic on author-declared typed summaries and keywords (P1 + P3)
Budget controlNone --- retrieve top-*k* regardless of total token costLATTICE: node-expansion budget (computational), not token budget (context window). RAPTOR: noneToken budget with pre-read verification (`can_afford()`, P4); budget reserved across phases
Graceful degradationTruncate context or omit low-ranked chunksPartial --- higher-level summaries available in the inferred treeTyped --- fall back from full content to `::summary` blocks; skip low-priority types under budget pressure (P1 + P3)
Structure accuracyN/A (no structure to be accurate or inaccurate)Depends on extraction/clustering quality --- lossy, probabilistic, not verifiable against ground truthExact --- author-declared, schema-validated, deterministically parseable
Knowledge accumulationStateless --- each retrieval session starts from the same indexStateless --- inferred tree is rebuilt or static; no mechanism for retrieval to improve the indexWrite-back: agents deposit new knowledge, update summaries, adjust trust scores (P1 + P2 + P4)
Cost of structureZero (no structure exists)Paid at indexing/retrieval time: LLM calls for extraction, clustering, summarization (compute-expensive, repeated per corpus update)Paid once at authoring time; free at every subsequent retrieval (amortized to zero over queries)

HBDS holds a clear advantage on six of the eight dimensions. We walk through the most consequential.

Structure source and accuracy. The fundamental distinction is between declared and inferred structure. The author who writes a ::decision[status=accepted] block knows it is a decision with certainty; an NLP model classifying the same content operates probabilistically. RAPTOR's clustering algorithm groups chunks by embedding proximity, which correlates with but does not guarantee semantic coherence --- two paragraphs about different decisions may cluster together because they share terminology, while a decision and its rationale may be split across clusters because they use different vocabulary. LATTICE's LLM-guided tree construction is more accurate than embedding-based clustering but remains a classification task subject to model error. In contrast, declared structure is exact: the hierarchy is what the author specified, validated against a schema at authoring time. No inference step intervenes between the author's intent and the agent's perception of document organization.

Budget control. This is the starkest contrast. RAG provides no budget mechanism --- it retrieves top-k chunks regardless of their aggregate token cost, and the agent discovers whether the retrieved content fits its context window only after injection. LATTICE introduces a principled budget by limiting node expansions during traversal [6], but this budget measures a computational resource (how many LLM scoring calls the traversal makes), not the context-window resource the agent must manage. An agent using LATTICE cannot answer "can I afford to read this subtree without overrunning my context window?" because node expansion count does not map to token cost. HBDS budgets the resource that actually constrains the agent --- tokens --- with pre-read verification that prevents overcommitment (Section 5.3). The can_afford() and can_afford_subtree() checks transform budget management from post-hoc accounting into prospective planning.

Graceful degradation. When remaining budget is insufficient for full content, RAG's only recourse is truncation or omission --- discarding chunks by rank, with no guarantee that the discarded content was less important than the retained content. RAPTOR and LATTICE offer partial degradation: the agent can read a higher-level summary node in the inferred tree instead of descending to leaves. These summaries are machine-generated approximations whose fidelity depends on summarization quality. HBDS degrades to author-declared ::summary blocks --- compressed representations the author explicitly wrote as the budget-constrained view of their content. The degradation is typed and intentional, not approximate and automatic.

Knowledge accumulation. Neither RAG nor RAPTOR/LATTICE has a write-back mechanism; every query starts from the same static index regardless of how many prior queries have been processed. HBDS's WRITE BACK phase (Section 5.1) deposits new knowledge, updates summaries, and adjusts trust scores after each retrieval session. Over time, frequently queried subtrees accumulate synthesized knowledge, making subsequent retrievals faster and more precise. This self-improvement is unique among the three approaches and depends on all four typed-traversability properties (Section 5.5).

The honest trade-off. The eighth dimension --- cost of structure --- is the one where HBDS is disadvantaged. RAG pays nothing for structure because it uses none. RAPTOR and LATTICE pay at indexing or retrieval time: LLM calls for clustering, summarization, and tree construction, costs that recur whenever the corpus changes. HBDS pays at authoring time: the author must write typed blocks, declare cross-references, and maintain a format that supports schema validation and token costing. This cost is non-zero and should not be minimized. It requires authoring discipline, tooling support, and a format that provides typed syntax without prohibitive ceremony --- the design gap identified in Section 2.5. However, the critical economic observation is that authoring cost is paid once per document, while retrieval cost (whether RAG's embedding pipeline or RAPTOR/LATTICE's indexing overhead) is paid once per corpus update or query. For a corpus in active use --- which describes any knowledge base queried by AI agents --- the amortized per-query cost of declared structure converges to zero, while the per-query cost of inferred structure remains constant. Section 4.3 analyzed this cost-benefit curve across the typed-traversability spectrum.

The conclusion is precise: on a typed corpus, HBDS is strictly more capable than flat or inferred-structure retrieval on six of eight dimensions, comparable on relevance scoring (both HBDS and LATTICE use LLM heuristics, though HBDS operates on higher-fidelity inputs), and disadvantaged only on the upfront cost of producing typed documents. The question is not whether HBDS outperforms flat retrieval when typed structure is available --- it does, by construction. The question is whether the authoring investment required to produce that structure is justified by the retrieval gains. For corpora queried infrequently or written once and discarded, the answer may be no. For knowledge bases in sustained use by AI agents --- the setting this paper addresses --- the amortization arithmetic strongly favors declared structure.

5.5 Self-Organizing Property

The WRITE BACK phase (Section 5.1) transforms HBDS from a stateless retrieval algorithm into a self-organizing knowledge system. Unlike RAG, RAPTOR, and LATTICE --- all of which are stateless, treating every query as an independent retrieval event with no memory of prior searches --- HBDS modifies the knowledge tree through use. Over time, the hierarchy reorganizes around actual query patterns, accumulating institutional knowledge that improves future retrieval. This self-organization operates through three mechanisms, each dependent on specific typed-traversability properties.

Knowledge Deposition. When the SYNTHESIZE phase generates new insights --- deductions, cross-references, or consolidated findings not explicitly stated in any source node --- the WRITE BACK phase persists them as new leaf nodes in the knowledge tree. Each deposited node carries a declared block type (P1): a synthesized architectural pattern is written as a ::decision block, an extracted data relationship as a ::data block, a condensed finding as a ::summary block. The block type is not a label applied after the fact; it determines how future DESCEND phases will score and retrieve the node. An agent searching for decisions will find deposited ::decision blocks through type-aware relevance scoring; an agent searching for data schemas will find deposited ::data blocks. Without P1, the agent would write untyped prose, and future agents would face the same type-ambiguity problem (Section 3.2) on agent-generated content that they face on flat human-authored text.

Summary Updating. Nodes that are frequently retrieved accumulate retrieval statistics. When a node's retrieval count exceeds a threshold, the WRITE BACK phase regenerates its summary to better reflect the query patterns that led to its retrieval. This updated summary improves future DESCEND scoring, because the summary is what the LLM-as-heuristic evaluates during beam search (Section 5.1, step 2). The placement of the updated summary requires P2 (Typed Cross-References): the agent must locate the correct node in the hierarchy, identify its parent-child relationships, and write the summary to the appropriate position in the index manifest. P4 (Pre-Computed Token Costs) must then be updated to reflect the new summary length, ensuring that future budget calculations remain accurate. Without P2, there is no hierarchical position to update; without P4, future agents will make budget decisions based on stale token counts.

Trust Score Adjustment. Agent-generated nodes enter the knowledge tree at a trust score of 0.6, below the 0.9 default for human-authored content (Section 5.1). This asymmetry reflects the principle that synthesized knowledge requires validation. Trust scores evolve through two opposing forces: temporal decay (0.997 per day, halving approximately every 231 days) and verification events (an agent or human reads the node and confirms its accuracy, boosting the score). Over time, accurate agent-generated knowledge converges toward human-authored trust levels, while unverified or contradicted knowledge decays below retrieval thresholds and is effectively deprecated from the active hierarchy. The trust mechanism depends on P1 because the verification agent must determine what kind of claim it is validating --- a ::decision block requires different validation criteria than a ::data block or a ::summary block.

These three mechanisms are collectively impossible on a flat text corpus, for a structural reason rather than an implementation one: there is no hierarchy to organize into. Knowledge deposition requires a typed hierarchy for placement (P2) and typed blocks for classification (P1). Summary updating requires addressable nodes with parent-child relationships (P2) and token cost metadata (P4). Trust scoring requires typed blocks for type-appropriate validation (P1). A flat corpus --- a directory of Markdown files --- offers no insertion semantics, no hierarchical addressing, no typed block boundaries, and no token cost metadata. An agent could append text to a file, but it cannot place knowledge in a structure, because no structure exists.

The contrast with existing retrieval systems is stark. RAG is stateless: every query re-embeds, re-retrieves, and re-injects with no memory of prior searches. RAPTOR rebuilds its cluster tree at indexing time, not at query time; queries do not modify the tree. LATTICE traverses its semantic hierarchy but does not write back to it. In all three systems, the corpus after 1,000 queries is identical to the corpus before the first query. In HBDS, the corpus after 1,000 queries is richer --- it contains synthesized knowledge deposited by prior searches, refined summaries tuned to actual query patterns, and trust-validated agent contributions that have survived temporal decay. This accumulation of institutional knowledge through use is, over the long term, the strongest advantage of typed-structure retrieval over flat-text retrieval.

We note that the self-organizing property is forward-looking. While the mechanisms are specified and the WRITE BACK phase is architecturally defined (Section 5.1), longitudinal empirical validation --- demonstrating that write-back measurably improves retrieval quality over time --- is identified as future work in Section 8.2. The claim here is structural: the typed-traversability properties enable self-organization that flat text prohibits. Whether the enabled capability delivers the expected long-term gains is an empirical question that requires sustained deployment and measurement.

6. Empirical Evidence

Three controlled studies plus production telemetry. Studies 1--2 test the thesis directly. Study 3 tests HBDS specifically.

6.1 Study 1: Context Retrieval Accuracy

Study 1 isolates the effect of typed document structure on an agent's ability to answer factual questions about a software project, holding content constant across conditions. The study has not yet been executed; this section specifies the design with sufficient detail for replication.

Corpus Construction. The experimental corpus is drawn from a production software project with 200+ documentation files covering architecture, API specifications, operational runbooks, decision records, and onboarding guides. Two conditions are prepared from identical source content:

  • Condition A (Flat Markdown). All documents are rendered as standard Markdown files with heading hierarchies, inline links, and code fences. No frontmatter, no typed blocks, no declared cross-references. This represents the Level 1 baseline on the typed-traversability spectrum (Section 4.2).
  • Condition B (Typed .surf). The same content is restructured into typed .surf documents. Each semantically distinct block receives a declared type: ::decision for architectural decisions, ::data[format=table] for structured data, ::callout[type=warning] for operational caveats, ::code[lang=rust] for implementation examples, ::summary for compressed representations. Documents carry YAML frontmatter with status, scope, and tags fields (P3), related blocks with typed relationship edges (P2), and pre-computed contentTokens values (P4). This represents Level 5 on the spectrum.

Content equivalence between conditions is enforced by automated validation: every prose sentence, code block, and data table in Condition A has a corresponding block in Condition B, verified by a diffing tool that compares stripped-text representations. The typed condition adds structure, not content.

Question Generation and Validation. Fifty factual questions are generated through a two-stage process. First, two independent human annotators read the corpus and each produce 40 candidate questions spanning five categories: architectural decisions (e.g., "Which database engine did the team select for the auth service, and what was the stated rationale?"), API specifications, operational procedures, dependency relationships, and status/timeline facts. Second, the annotators cross-validate each other's questions against the corpus, retaining only questions whose answers are (a) unambiguously supported by at least one document, (b) answerable from the documentation alone without external knowledge, and (c) distributed across at least 30 distinct source files to prevent retrieval shortcuts. The final set of 50 questions is selected by stratified sampling: 10 per category. Each question has a gold-standard answer with supporting document citations.

Experimental Design. Table 6 summarizes the conditions, metrics, and hypotheses.

Table 6. Study 1 experimental design.

DimensionCondition A (Flat Markdown)Condition B (Typed .surf)
Document formatStandard Markdown (Level 1)Typed .surf with P1--P4 (Level 5)
ContentIdentical (verified by text-equivalence check)Identical
AgentSame model, same system prompt, same tool setSame model, same system prompt, same tool set
Retrieval methodAgent-directed file search (grep, read)Agent-directed file search (grep, read) + type-aware filtering
Token budgetFixed at 128K tokens per questionFixed at 128K tokens per question
Questions50 factual questions (10 per category)Same 50 questions

Four metrics are measured per question:

  1. Answer accuracy --- binary correctness judged by two independent raters against the gold-standard answer, with inter-rater disagreements resolved by a third rater. Cohen's kappa is reported as a reliability measure.
  2. Token consumption --- total tokens consumed during retrieval and reasoning, measured from the agent's API usage log. Lower is better, indicating more efficient navigation.
  3. Hallucination rate --- the proportion of factual claims in the agent's answer that are not supported by any document in the corpus. Each claim in each answer is extracted and independently traced to a source document (or marked unsupported) by a human annotator.
  4. Retrieval precision --- the proportion of retrieved document content (measured in tokens) that is relevant to the question, judged by a human rater reviewing the agent's retrieval log.

Hypotheses. The primary hypothesis is:

  • H1: Condition B (typed .surf) produces a lower hallucination rate than Condition A (flat Markdown) at equal or higher accuracy. The predicted mechanism is the elimination of type-ambiguity errors (F2, Section 3.2): when block types are declared, the agent can distinguish a ::decision block from a ::callout[type=tip] from a ::quote without inferring the distinction from context, reducing the probability of treating speculative content as authoritative or vice versa.

Two secondary hypotheses follow from the typed-traversability properties:

  • H2: Condition B consumes fewer tokens per question than Condition A. The mechanism is P3 (Priority Signals) and P1 (Declared Block Types): the agent can filter by block type and document scope before reading full content, reducing the volume of irrelevant material loaded into the context window.
  • H3: Condition B achieves higher retrieval precision than Condition A. The mechanism is the same as H2 --- type-aware filtering retrieves less but more relevant content.

Expected Effect Sizes. Based on the 16.6% token reduction observed by Lulla et al. [1] from structured AGENTS.md files alone (Level 2 structure), we conservatively estimate a 20--30% token reduction for Level 5 structure (Condition B), which adds P1 and P4 atop the structural properties Lulla et al. tested. Hallucination rate reduction is harder to estimate from prior work; we target detection of a medium effect size (Cohen's d >= 0.5) with 50 paired observations per condition, which provides statistical power of approximately 0.87 at alpha = 0.05 for a paired-samples t-test on the per-question hallucination count. If the true effect is smaller, the study is underpowered and should be replicated with a larger question set.

Controls. The agent model, system prompt, available tools, and token budget are held constant across conditions. Questions are presented in randomized order within each condition. To mitigate order effects, half of the trials run Condition A first and half run Condition B first (counterbalanced within-subjects design). Each question-condition pair is run three times to account for LLM stochasticity, and results are averaged across runs.

6.2 Study 2: Knowledge Graph Traversal

Study 2 isolates P2 (Typed Cross-References, Section 4.1) by measuring an agent's ability to trace dependency chains --- a task that requires traversing relationships between documents rather than retrieving content within them. Where Study 1 tests whether block types (P1) improve retrieval accuracy for factual queries, Study 2 tests whether typed edges improve the agent's ability to reconstruct a knowledge graph from a document corpus under a constrained token budget.

Corpus Construction. We construct a synthetic software-architecture corpus of 24 interconnected documents describing a microservices system: authentication, authorization, session management, API gateway, user profiles, billing, notifications, audit logging, and supporting infrastructure (database, cache, message queue, monitoring, deployment). Each document describes one component. The corpus is designed with a known ground-truth dependency graph containing 47 directed edges across four relationship types: depends-on (component A requires component B at runtime), extends (component A adds capabilities to component B's interface), supersedes (component A replaces a deprecated component B), and produces-for (component A generates data consumed by component B). The graph is acyclic within each relationship type but contains cross-type cycles (e.g., auth depends-on database; database monitoring produces-for audit logging; audit logging depends-on auth). Maximum chain depth is 6 hops. Each document averages 800 tokens; the full corpus totals approximately 19K tokens.

Ground Truth Establishment. Two software engineers independently annotate the dependency graph from the corpus documents. Disagreements are resolved by a third annotator. Inter-annotator agreement (Cohen's kappa) is computed and reported. Only edges present in the majority annotation (2 of 3 annotators) are included in the gold-standard graph. This produces a set of (source, target, relationship-type) triples against which agent output is evaluated.

Conditions. Both conditions contain identical textual content.

  • Condition A (Flat Markdown). Each component is a .md file with standard Markdown hyperlinks: [see Authentication](./auth.md), [depends on Session Manager](./sessions.md). Hyperlinks carry no relationship semantics; the agent must infer whether a link represents a dependency, an extension, a supersession, or a data-flow relationship from surrounding prose --- the failure mode described in Section 3.3.
  • Condition B (Typed .surf). The same content is represented with declared related blocks in YAML frontmatter, each carrying an explicit relationship field (e.g., relationship: depends-on, relationship: supersedes). In Condition B, the agent can extract a typed directed graph by parsing frontmatter alone --- approximately 50 tokens per document --- without reading document bodies. In Condition A, extracting any graph requires full-document reads at approximately 800 tokens each.

Content equivalence is enforced by automated diffing of stripped-text representations, identical to the validation procedure in Study 1.

Task. The agent receives one of five seed queries requiring multi-hop dependency tracing (e.g., "What does the authentication system depend on, and what depends on it?" or "Trace all components that would be affected if the database schema changes"). Each query requires traversal to a depth of at least 3 hops and a report of the complete reachable subgraph from the seed component. The agent operates under a 16K-token budget --- sufficient to read approximately 20 of 24 documents in full (Condition A) or all 24 documents' frontmatter plus 12 full documents (Condition B) --- forcing selective traversal.

Table 7. Study 2 experimental design.

ElementSpecification
Corpus24 documents, ~19K tokens, 47 ground-truth edges, 4 relationship types, max depth 6
Conditions(A) Flat Markdown + string hyperlinks; (B) Typed `.surf` + declared `related` blocks (P2)
TaskTrace dependency subgraph from seed component (5 queries, 3+ hops each)
Budget16K tokens per query
AgentClaude 3.5 Sonnet, identical system prompt across conditions, 10 runs per condition per query
DesignWithin-subjects, counterbalanced (half A-first, half B-first)

Metrics. Four metrics capture distinct facets of traversal quality:

  1. Traversal completeness (recall) --- the fraction of ground-truth (source, target, relationship-type) triples present in the agent's output.
  2. False-positive rate (1 - precision) --- the fraction of agent-reported edges absent from the ground truth, including hallucinated edges (no relationship exists between the reported components) and mistyped edges (correct component pair, wrong relationship type).
  3. Relationship-type accuracy --- among correctly identified component pairs, the fraction where the agent also identifies the correct relationship type. This metric isolates the type-inference challenge: a depends-on edge reported as extends counts as a type error even though the component pair is correct.
  4. Traversal depth --- the maximum number of hops the agent follows from the seed component, measured against tokens consumed. This captures the efficiency of traversal: how deep the agent reaches per token spent.

Hypotheses.

  • H1 (completeness): Condition B achieves higher traversal completeness than Condition A within the same token budget. The mechanism is that typed frontmatter enables graph extraction at ~50 tokens per document (Condition B) versus ~800 tokens for full-document reading and inference (Condition A) --- a 16x efficiency gain per traversal step that allows deeper exploration within the budget.
  • H2 (false positives): Condition B produces a lower false-positive rate than Condition A. The mechanism is that declared relationship types (P2) eliminate the need to infer edge semantics from prose, removing the type-ambiguity failure mode (Section 3.2) at the inter-document level.
  • H3 (type accuracy): The effect of Condition B is strongest on relationship-type accuracy, because this metric isolates the specific capability P2 provides --- machine-readable relationship semantics --- that Condition A entirely lacks.

Expected Effect Sizes. Prior work on structured knowledge representations for question answering reports 15--30% recall improvements over unstructured baselines [N-KG-QA]. We anticipate a traversal-completeness gain of 20--35 percentage points (Condition B over A) and a false-positive reduction of 40--60%, driven by the elimination of relationship-type inference errors. The traversal-depth effect is expected to be the largest (1.5--2x deeper within the same budget) due to the 16x per-document efficiency gain from frontmatter-only graph extraction.

Statistical Plan. With 5 queries x 10 runs = 50 observations per condition, a two-sided Mann-Whitney U test achieves >0.90 power at alpha = 0.05 for the anticipated effect sizes. Results are reported with 95% confidence intervals and effect sizes (rank-biserial correlation). Each query-condition pair is run 10 times to account for LLM stochasticity; within-condition results are averaged across runs before cross-condition comparison.

6.3 Study 3: HBDS vs. RAG on Knowledge-Intensive Generation

Where Studies 1 and 2 test the broader thesis that typed structure improves agent retrieval, Study 3 tests the specific algorithmic claim: that HBDS (Section 5) outperforms standard RAG on a knowledge-intensive generation task where retrieval quality directly determines output quality. The task --- generating a software specification from a natural language description, drawing on a large domain knowledge base --- exercises every HBDS phase. ORIENT must identify the correct domain. DESCEND must navigate a deep hierarchy to retrieve relevant patterns, data models, and compliance constraints. SYNTHESIZE must combine heterogeneous knowledge fragments into a coherent context injection. The task is poorly suited to flat retrieval because the knowledge required is distributed across multiple hierarchy levels and block types --- no single embedding query captures it.

Knowledge Base Construction. A knowledge base of 500+ domain patterns is constructed covering 10 application domains: e-commerce, healthcare, fintech, logistics, education, hospitality, real estate, SaaS analytics, content management, and IoT. For each domain, patterns are organized into five categories: data models (entity schemas and relationships), business rules (domain-specific logic constraints), compliance requirements (regulatory and security obligations), UI patterns (standard interaction flows), and integration patterns (third-party service interfaces). Each pattern is authored as a self-contained unit with a title, summary, full specification, and cross-references to related patterns. The result is a 10 x 5 matrix of categories, each containing 8--12 patterns, totaling approximately 500--600 patterns. The knowledge base is constructed once and used in both conditions, ensuring content equivalence.

Conditions. Two conditions share identical knowledge content but differ in format and retrieval method:

  • Condition A (RAG baseline). The 500+ patterns are stored as flat Markdown files, one per pattern. Each file contains the pattern title as a heading, followed by prose description, code examples, and requirement lists in untyped Markdown. Retrieval uses a standard RAG pipeline: documents are chunked at 512 tokens with 64-token overlap, embedded using a sentence-transformer model [N-SentenceTransformers], stored in a vector index, and retrieved via top-k cosine similarity (k = 20) against the task prompt. Retrieved chunks are injected into the generation context in rank order until the token budget is exhausted.

  • Condition B (HBDS on typed documents). The same 500+ patterns are stored as typed .surf files organized in a three-level hierarchy: domain (level 1), category (level 2), pattern (level 3). Each file uses declared block types (::data for entity schemas, ::code for implementation examples, ::callout[type=warning] for compliance constraints, ::summary for compressed representations) and typed cross-references (related blocks with relationship semantics such as depends-on and implements). Pre-computed token costs are stored in each node's manifest (P4). Retrieval uses HBDS as described in Section 5.1, with the same total token budget as Condition A managed by the BudgetTracker (Section 5.3).

Generation Tasks. Twenty specification prompts are constructed, two per domain: one narrow-scope prompt (e.g., "Generate a spec for a restaurant menu ordering system") and one broad-scope prompt (e.g., "Generate a spec for a full-service restaurant management platform including ordering, inventory, reservations, and staff scheduling"). Broad-scope prompts require retrieval across multiple categories within a domain and test HBDS's ability to traverse laterally across sibling subtrees during DESCEND. Each prompt is executed five times per condition (100 runs per condition, 200 total) to account for LLM generation variance. The same base LLM is used for generation in both conditions; only the retrieval mechanism differs.

Metrics and Evaluation. Four metrics are evaluated against ground-truth requirement checklists constructed independently by domain experts prior to experimental runs.

Table 8. Study 3 experimental design: conditions, retrieval methods, and metrics.

DimensionCondition A: RAG (flat Markdown)Condition B: HBDS (typed .surf hierarchy)
Knowledge base format500+ flat Markdown files, one per pattern500+ typed .surf files in 3-level hierarchy (domain/category/pattern)
Retrieval methodEmbed (512-token chunks) + top-20 cosine similarity + injectHBDS four-phase descent (Section 5.1)
Token budgetFixed context window capacitySame capacity, managed by BudgetTracker (Section 5.3)
M1: Specification completeness% of ground-truth requirements captured in generated specSame metric, same checklist
M2: Structural correctnessValid spec structure, schema consistency, no contradictionsSame metric, same rubric
M3: Compliance coverage% of applicable regulatory/security requirements includedSame metric, same checklist
M4: Retrieval token costTotal tokens consumed by retrieval before generationSame metric, instrumented at retrieval layer

Specification completeness (M1) is the percentage of requirements in the ground-truth checklist present in the generated specification. Checklists contain 15--40 requirements per prompt depending on scope. Two independent raters score each specification; inter-rater disagreements are resolved by a third rater. Structural correctness (M2) is scored on a four-point rubric assessing required sections (data model, business rules, API contracts, error handling), internal consistency of entity references, and absence of contradictory constraints. Compliance coverage (M3) measures the percentage of applicable regulatory and security requirements (PCI-DSS for payment processing, HIPAA for healthcare data, GDPR for user data) included in the output. This metric specifically tests whether the retrieval mechanism surfaces ::callout[type=warning] compliance blocks that are thematically distant from the main query but structurally co-located with relevant domain patterns in the typed hierarchy --- a retrieval path available to HBDS through type-aware descent but invisible to embedding similarity. Retrieval token cost (M4) is the total tokens consumed by the retrieval pipeline before the generation prompt is assembled --- embedding lookups plus chunk injection for Condition A, versus ORIENT + DESCEND + SYNTHESIZE token expenditure for Condition B.

Hypotheses.

  • H3a (completeness): HBDS retrieval produces specifications with higher completeness (M1) than RAG retrieval, because hierarchical descent surfaces patterns from multiple categories within a domain rather than relying on embedding proximity to the query string. Broad-scope prompts are expected to show a larger effect than narrow-scope prompts, since they require cross-category retrieval that embedding similarity handles poorly.
  • H3b (compliance): HBDS retrieval produces higher compliance coverage (M3), because typed block types enable the DESCEND phase to surface compliance constraints that are semantically distant from the functional query but structurally adjacent to relevant domain patterns in the hierarchy.
  • H3c (token efficiency): HBDS retrieval consumes fewer tokens (M4) for equivalent or higher output quality, because pre-read budget verification (Section 5.3) prevents over-retrieval and graceful degradation to ::summary blocks avoids full-content reads when budget is constrained.

Expected Effect Sizes. Prior work comparing structured retrieval against flat baselines on knowledge-intensive generation tasks reports 15--25% improvements in output completeness [N-StructuredRetrieval]. We anticipate a completeness gain of 15--30 percentage points on broad-scope prompts (where cross-category retrieval matters most) and 5--15 points on narrow-scope prompts. Compliance coverage is expected to show the largest effect (25--40 percentage points) because RAG's embedding similarity systematically underweights compliance content that uses regulatory vocabulary orthogonal to functional domain terms. Token cost reduction of 30--50% is anticipated based on the BudgetTracker's pre-read verification preventing the over-retrieval that top-k RAG cannot avoid.

Statistical Plan. With 20 prompts x 5 repetitions = 100 observations per condition, per-metric means are compared using independent-samples t-tests (or Mann--Whitney U if Shapiro-Wilk rejects normality at $\alpha = 0.05$), with Bonferroni correction for four comparisons ($\alpha = 0.0125$). Effect sizes are reported as Cohen's d with 95% confidence intervals. Narrow-scope and broad-scope prompts are analyzed both pooled and separately to test the interaction between prompt scope and retrieval method. A secondary analysis examines per-domain variation to identify domains where HBDS's advantage is largest (expected: domains with heavy compliance requirements such as healthcare and fintech) and smallest (expected: domains with shallow, homogeneous pattern sets).

6.4 Production Telemetry (Observational)

In addition to the controlled studies above, we report observational data from a sustained production deployment of structured context documentation across a real software organization. A bootstrapped software company operating 15 active repositories adopted ARDS v3 --- a structured context standard implementing properties P2 and P3 (Section 4.1) --- incrementally over an eight-week period beginning in late December 2025. At the time of observation, the deployment comprised 54 agent definitions across the 15 repositories, over 1,000 plan documents, and more than 200 context files organized in typed .context/ directory hierarchies with YAML frontmatter, declared cross-references, and a specified discovery order.

Setting. The organization develops multiple software products spanning web applications, native desktop and mobile clients, CLI tools, shared infrastructure crates, and a documentation platform. Repositories range from 6 to over 600 automated tests. AI agents are used for code generation, strategic planning, document drafting, patent preparation, and deployment automation. The workload is representative of a small team using AI agents as force multipliers across a diverse product portfolio.

Baseline. Prior to ARDS adoption, repositories used flat Markdown README files as the sole entry point for AI agents. Agent sessions were observed to spend substantial portions of their context budgets on discovery --- issuing file-search commands, reading irrelevant files, and iterating through directory listings to locate the correct context for a task.

Observations. After ARDS adoption, agents followed a typed discovery protocol (the sequence defined in Section 4.2, Level 3: root index, then typed context files, then cross-referenced documents) rather than searching. Table 9 summarizes the observed changes across four metrics. All figures are approximate ranges derived from session logs; no controlled A/B test was conducted.

Table 9. Observed agent session metrics before and after ARDS v3 adoption across 15 repositories (approximate ranges, not controlled).

MetricBefore ARDS (flat README)After ARDS (structured `.context/`)Direction
Context spent on discovery40--60% of window15--25% of windowImproved
Agent error rate (wrong file reads, stale context)3--5 per session1--2 per sessionImproved
Task completion without manual correction~60--70%~80--90%Improved
Median tokens per comparable taskHigher (baseline)~25--35% reductionImproved

Interpretation. The direction of change is consistent across all four metrics and aligns with the theoretical prediction: when discovery follows a declared protocol (P2) with priority signals (P3), agents spend fewer tokens finding the right context and more tokens reasoning about the task. The reduction in wrong-file reads corresponds to the elimination of broken-traversal failures (F3, Section 3.3); the improvement in task completion without correction corresponds to reduced structure hallucination (F1, Section 3.1) and type ambiguity (F2, Section 3.2).

Threats to validity. This observational data carries significant limitations that preclude causal claims. First, the data comes from a single organization with a single primary developer; generalization to larger teams or different domains is unestablished. Second, no controlled conditions were maintained --- ARDS adoption coincided with concurrent workflow improvements, agent prompt refinements, and team learning effects. Any or all of these confounds may contribute to the observed improvements. Third, observer bias is present: the ARDS designer is also the primary user, creating a motivation to interpret ambiguous data favorably. Fourth, the metrics are derived from session-level observation rather than automated instrumentation, introducing measurement imprecision.

Argument for external validity. Despite these limitations, the observational data provides a form of evidence that controlled studies cannot: sustained, real-world operation under production conditions over a meaningful time period. The 15 repositories span diverse domains (web applications, native desktop software, shared Rust crates, mobile applications, strategic planning). The 54 agent definitions cover heterogeneous tasks (code generation, document drafting, competitive analysis, patent preparation). The over 1,000 plan documents represent genuine organizational knowledge, not synthetic benchmarks. The improvements persisted across the full eight-week observation window, suggesting they are not attributable to novelty effects alone.

7. Implications

7.1 For Document Format Designers

The audience for document formats has expanded from one to two. For three decades, format designers optimized for a single consumer: the human reader. The design question was "does this render well?" Markdown answered that question so effectively that it became the dominant format for technical documentation. But AI agents now consume documents autonomously at scale, and their requirements diverge fundamentally from human reading. An agent does not render a document --- it parses it, classifies its components, scores their relevance, estimates their token cost, and decides what to read within a finite budget. The design question for this second audience is not "does this look good?" but "does this parse unambiguously into a typed AST with pre-computed token costs?"

These are not the same question, and optimizing for one does not optimize for the other. Markdown's design --- minimal syntax, visual structure, implicit semantics --- is optimal for human authoring and adequate for human reading. It is systematically inadequate for algorithmic traversal, as the five failure modes in Section 3 demonstrate. Structure hallucination (F1) and type ambiguity (F2) arise because Markdown lacks P1 (declared block types). Broken traversal (F3) arises because Markdown links lack P2 (typed cross-references). Context budget waste (F4) arises because Markdown exposes neither P3 (priority signals) nor P4 (pre-computed token costs). Semantic lossy compression (F5) arises because Markdown provides no typed compression strategy. Each failure mode is a format-level deficiency, not an agent-level deficiency --- no improvement in agent architecture can compensate for information the format does not contain.

The design challenge for new formats is therefore dual-audience: writable by humans with ceremony cost comparable to Markdown, yet algorithmically navigable by machines at Level 5 on the typed-traversability spectrum (Section 4.2). The spectrum provides a concrete design target: all four properties (P1--P4) must be present. The ceremony cost analysis (Section 4.3) identifies the core tension: the transition from Level 3 to Level 4 requires a new parser and new syntax, imposing authoring overhead that benefits machines far more than humans. Format designers must resolve this tension --- through syntax that reads naturally to humans while parsing deterministically for machines, through AI-assisted authoring that absorbs the ceremony cost, or through migration tooling that converts existing Markdown corpora to typed formats. The formats that succeed will be those that achieve Level 5 traversability without LaTeX-level verbosity --- typed structure at Markdown-level ceremony.

7.2 For AI Agent Builders

The dominant pattern in AI agent frameworks today is flat text injection: the agent receives a string of context --- a README, a retrieved chunk, a conversation history --- and reasons over it as undifferentiated prose. This architecture leaves measurable performance on the table. The evidence from Section 6 shows that agents operating on typed documents achieve higher retrieval accuracy, deeper traversal within the same token budget, and lower hallucination rates than agents operating on equivalent flat content. The mechanism is not mysterious: typed structure gives the agent information that flat text withholds.

The immediate practical implication is type-routed processing. An agent that receives a typed AST can dispatch blocks by type rather than inferring purpose from prose. A ::decision block routes to a decision tracker. A ::data block routes to a schema generator or validator. A ::callout[type=warning] block routes to a compliance checker. This routing is deterministic --- it requires a single field read, not an LLM classification call. Flat-text agents must spend tokens inferring what a paragraph is before they can decide what to do with it; typed-document agents skip that inference entirely.

The deeper implication concerns capabilities that flat-text agents structurally cannot achieve, regardless of model quality. Section 5.1 identifies three such capabilities. First, budget-aware search: the ability to verify, before reading a node, whether its token cost fits the remaining budget. This requires pre-computed token costs (P4), which flat text does not expose. Second, graceful degradation: the ability to fall back from full content to summary-only retrieval when the budget is insufficient. This requires a declared summary layer (P1 + P3) that flat text lacks. Third, self-improving knowledge bases: the ability to write synthesized knowledge back into the correct location in a typed hierarchy (P2), with updated token costs (P4) and trust scores. Flat text has no hierarchy to write back into.

The practical recommendation for agent framework designers is direct: typed document consumption should be a first-class input mode, not an afterthought. Agent tool interfaces that accept only raw strings force every upstream system to serialize structure away before the agent sees it --- discarding exactly the information (block types, cross-references, token costs, priority signals) that the evidence shows improves performance. Frameworks that preserve typed ASTs through the agent's input pipeline enable the type-routing, budget-aware search, and graceful degradation patterns described above. The format is not preprocessing; it is architecture.

7.3 For Knowledge Base Platforms

The dominant knowledge base platforms --- Wikipedia (MediaWiki markup rendered to HTML), Confluence (WYSIWYG stored as XHTML), Notion (proprietary JSON behind a rate-limited API), GitBook (Markdown compiled to static HTML) --- share a common architectural property: their content is opaque to external AI agents. An agent seeking to traverse a Confluence space must authenticate to a REST API, paginate through JSON responses, strip HTML tags, and infer document types and relationships from rendered markup. Every platform requires a bespoke adapter. Every adapter is fragile, version-dependent, and lossy --- it discards the semantic structure the platform internally maintains but does not export.

This opacity is not incidental; it is structural. These platforms store content in formats that satisfy P1--P4 internally (Confluence tracks page hierarchies, Notion maintains typed blocks in its database, Wikipedia has category trees) but expose none of those properties through their output formats. The typed structure exists --- it is simply inaccessible to external consumers.

A platform that stores content in an open, typed, traversable format inverts this relationship. When the storage format itself satisfies P1 (declared block types), P2 (typed cross-references), P3 (priority signals), and P4 (pre-computed token costs), no adapter layer is needed. The format is the API. An AI agent can read the files directly, parse them into a typed AST, and execute HBDS (Section 5.1) without an embedding pipeline, without RAG infrastructure, and without chunking heuristics. The structure is the retrieval system.

The implication is economic, not merely technical. As AI agents become primary consumers of organizational knowledge --- a trend already visible in the growth of agent-assisted development, automated documentation synthesis, and AI-driven decision support --- the platforms that export typed, traversable structure will capture the AI-agent ecosystem. Agents will preferentially operate on knowledge bases where HBDS can run natively, where pre-read budget verification works (Section 5.3), and where the write-back phase (Section 5.1) can deposit synthesized knowledge directly into the hierarchy. Platforms that export only HTML or API-gated JSON force every agent interaction through an extraction-and-inference pipeline --- paying the computational cost of structure recovery that the comparison in Section 5.4 shows is both expensive and lossy relative to declared structure. The platform that declares its structure wins the agent; the platform that hides it behind rendering loses it.

7.4 For the RAG Community

RAG's effectiveness is bounded by chunk quality, and chunk quality is bounded by the format. When a flat-text corpus is chunked by character count or paragraph break, the chunk boundaries are arbitrary --- a 512-token window may split a decision from its rationale, sever a data schema from its constraints, or merge unrelated content into a single embedding vector. Typed blocks eliminate this problem at the source. A ::decision block, a ::data block, and a ::code block are each semantically complete units with machine-readable boundaries. Chunking a typed corpus by block type produces chunks that are coherent by construction, not by coincidence. This alone would make typed formats a meaningful improvement to RAG pipelines.

But the deeper implication is that on typed corpora, HBDS may be a replacement for RAG, not merely an improvement. The architectures differ at every level. RAG retrieves flat fragments by embedding similarity --- a stateless, topology-blind operation that treats every chunk as an independent vector in embedding space. HBDS navigates a typed hierarchy by semantic descent --- a stateful, topology-aware operation that exploits parent-child relationships, block types, and budget constraints to reach relevant knowledge through structural reasoning rather than vector proximity. RAG is stateless: every query starts from the same index, and no retrieval session improves the next. HBDS accumulates knowledge through write-back (Section 5.5): the knowledge tree self-improves as agents deposit synthesized insights, update summaries, and refine trust scores.

Table 5 (Section 5.4) quantifies this comparison across eight dimensions. On a typed corpus, HBDS holds clear advantages on six dimensions --- structure accuracy, budget control, graceful degradation, chunk coherence, knowledge accumulation, and amortized cost of structure --- and is comparable on relevance scoring. The only dimension where RAG is advantaged is the zero upfront cost of having no structure. For corpora in sustained use by AI agents, the amortization arithmetic favors declared structure (Section 4.3).

A pragmatic middle ground exists. Typed RAG --- using embedding similarity for initial candidate retrieval across a large corpus, then switching to HBDS for structured descent within retrieved subtrees --- combines RAG's corpus-scale coverage with HBDS's type-aware precision. Section 8.2 identifies this hybrid as a concrete direction for future work. The question for the RAG community is not whether typed structure helps retrieval --- Table 5 establishes that it does. The question is whether the authoring investment to produce typed corpora is justified, and for knowledge bases queried repeatedly by AI agents, the answer is increasingly difficult to avoid.

7.5 For the Phone-to-App Pipeline

The pipeline that transforms natural language into deployed applications --- phone to AI chat to typed document specification to compiler to live artifact --- depends on the AI controller's ability to retrieve domain knowledge during generation. The user provides intent ("Build me a bakery website"); the controller must supply everything else: data models, business rules, compliance constraints, UI patterns, and integration requirements. Without retrieval, the controller generates from its parametric knowledge alone, producing specifications that are generically plausible but domain-incomplete. With flat retrieval (RAG), the controller surfaces fragments that are semantically similar to the query but structurally unrelated --- embedding proximity to "bakery" retrieves menu descriptions but misses PCI-DSS payment compliance requirements that use entirely different vocabulary. HBDS on a typed knowledge base is the architectural solution.

Consider the concrete case. A user requests a bakery ordering application. HBDS's ORIENT phase classifies the query into the hospitality/restaurant domain and selects the corresponding subtree as the search root. The DESCEND phase performs beam search through the hierarchy: at the category level, it scores data-models, business-rules, compliance, and UI-patterns as relevant; at the pattern level, it retrieves ::data blocks for Menu, MenuItem, and Order entity schemas, ::callout[type=warning] blocks for PCI-DSS payment handling and food-allergen disclosure requirements, and ::code blocks for order-status state machines. The SYNTHESIZE phase combines these heterogeneous fragments --- drawn from different categories within the same domain subtree --- into a coherent context injection. The resulting specification includes data models, compliance guardrails, and interaction patterns that the controller would not have generated from parametric knowledge or flat retrieval.

This is where HBDS's cross-category retrieval matters most. A bakery application is not a single-topic query --- it requires data models from one subtree, compliance constraints from another, and UI patterns from a third. RAG's embedding similarity treats these as independent retrieval targets, surfacing whichever fragments happen to be closest to the query string. HBDS's hierarchical descent traverses laterally across sibling categories within the domain, surfacing structurally co-located knowledge that embedding proximity would miss. Study 3 (Section 6.3) directly tests this use case: the broad-scope generation prompts require exactly this cross-category retrieval, and the compliance coverage metric (M3) specifically measures whether the retrieval mechanism surfaces type-distant but structurally adjacent constraints. The phone-to-app pipeline is, in effect, the production deployment of HBDS's core capability --- budget-constrained, type-aware navigation of a knowledge hierarchy in service of generative specification.

8. Limitations and Future Work

8.1 Limitations

We identify five specific limitations of this work and the claims it can support.

Single-organization production data. The observational telemetry reported in Section 6.4 is drawn from 15 repositories within a single bootstrapped software company, operated primarily by a single developer. This confounds organizational culture, developer experience, codebase characteristics, and agent usage patterns into a single data point. The single-developer setting introduces an additional confound: the ARDS designer is also the primary user, so the observed improvements may reflect designer fluency rather than format properties. This limits our ability to claim that the production results generalize to larger teams, different industries, or organizations with different documentation cultures. Replication across at least three independent organizations of varying size is needed before external validity can be established.

One typed format tested. All three controlled studies (Sections 6.1--6.3) use a single typed document format as the experimental condition. Observed performance gains may be attributable to specific design choices in that format --- its particular block-type vocabulary, its frontmatter schema, its cross-reference syntax --- rather than to the general principle of typed traversability. This limits our ability to claim that typing itself drives the effect, as opposed to incidental format-specific features. Testing with alternative typed formats such as AsciiDoc (with custom block macros), DITA (with typed topics and relationship tables), or a minimal typed-Markdown extension would isolate whether the effect derives from typing in general or from specific format design decisions.

Agent-specific results. All experimental conditions use a single LLM as the agent backbone. Different agent architectures --- varying in context window size, tool-use paradigms, reasoning capabilities, and retrieval strategies --- may interact differently with typed structure. An agent with weaker instruction-following may fail to exploit block types even when declared; an agent with stronger long-context performance may mitigate some flat-text failure modes through brute-force reading. This limits our ability to claim architecture-independent benefits. Cross-agent replication using at least two architecturally distinct LLMs (e.g., differing context window sizes or tool-use interfaces) is required.

HBDS not yet benchmarked at scale. The HBDS algorithm design is complete and its formal properties are argued in Section 5, but empirical evaluation has not been conducted on large knowledge hierarchies. The scaling behavior of beam-search descent, budget tracking, and write-back on trees with 10K+ nodes and 100K+ cross-references remains theoretical. This limits our ability to claim that HBDS is practical for organizational-scale knowledge bases. Open-source implementation followed by benchmarking on standardized knowledge-intensive retrieval tasks is the necessary next step.

Authoring cost not measured. Typed documents impose higher authoring overhead than flat Markdown: authors must learn a block-type vocabulary, declare cross-references explicitly, and maintain schema-valid frontmatter. This paper measures the consumption benefits of typed structure but not the production costs. The ROI calculation --- whether agent performance gains justify the additional authoring effort --- is unaddressed. AI-assisted authoring tools could reduce this cost substantially, but that claim is speculative and unvalidated. Without authoring-cost data, practitioners cannot make informed adoption decisions, and the paper's practical recommendations remain incomplete.

8.2 Future Work

The limitations above define five concrete research directions, each paired with the limitation it addresses.

HBDS implementation and benchmarking. The most immediate priority is an open-source reference implementation of HBDS with reproducible benchmarks against RAG on standardized knowledge-intensive tasks. KILT [32] and Natural Questions [33], adapted for document-corpus retrieval rather than single-passage extraction, provide established evaluation frameworks. The key experiment is HBDS on typed corpora versus RAG on the same content in flat form, measured on retrieval precision, token efficiency, and answer accuracy across knowledge trees of 10K+ nodes. This directly addresses limitation four (scale): until HBDS is benchmarked on trees substantially larger than the 15-repository production corpus, its scaling properties remain theoretical.

Knowledge-context unification. Section 6.4 demonstrates typed traversability on code repositories. The natural extension is organizational knowledge: company wikis, policy documents, training materials, and institutional memory. The thesis is that knowledge is context --- the same typed structure that helps agents navigate a codebase should help them navigate a knowledge base, because the retrieval problem is identical. This requires extending the typed-traversability properties (Section 4.1) beyond software artifacts to general-purpose documents, testing whether P1--P4 retain their algorithmic value when block types shift from ::code and ::decision to ::policy, ::procedure, and ::regulation. This addresses limitation one (single domain) by moving typed traversability into domains where the single-organization confound can be tested independently.

AI-native knowledge bases. Section 7.3 argues that platforms storing content in typed, traversable formats become AI-native knowledge bases without adapter layers. The research question is whether a Wikipedia-scale knowledge base --- millions of typed documents with declared hierarchies and cross-references --- can be built and maintained for dual human-AI consumption, with HBDS as the primary traversal algorithm. A three-phase roadmap (internal tool, product feature, public platform) provides incremental validation. This addresses the platform implications identified in Section 7.3.

Typed RAG. Section 5.4 acknowledges that HBDS requires typed corpora --- a real constraint when most existing content is flat. The pragmatic hybrid is Typed RAG: embedding-based retrieval for initial corpus-scale candidate selection (fast, approximate, format-agnostic), followed by HBDS for structured descent within retrieved subtrees (precise, budget-aware, type-exploiting). The research question is whether the hybrid retains HBDS's precision advantages on the six dimensions identified in Table 5 while matching RAG's scalability to untyped corpora. This addresses the honest trade-off from Section 5.4 by reducing the format prerequisite from the entire corpus to retrieved subtrees.

Self-organizing knowledge trees. Section 5.5 defines write-back as a structural capability and acknowledges that its long-term value is empirically unvalidated. A longitudinal study is needed: deploy HBDS with write-back enabled on a production knowledge base, measure retrieval accuracy trend, summary quality, and trust score distribution over months, and compare against a frozen baseline (identical corpus, no write-back). The core question is whether retrieval quality monotonically improves or whether write-back introduces drift, noise accumulation, or trust inflation that degrades the hierarchy over time. This addresses Section 5.5's forward-looking acknowledgment with the sustained measurement it explicitly calls for.

9. Conclusion

The document format is not a cosmetic choice for AI-agent systems. It is an architectural one that determines which algorithms are available, which failure modes are inevitable, and which capabilities are structurally possible. This paper has argued that thesis through formal definition, algorithmic construction, and empirical evidence --- and in each case, the conclusion is the same: flat text is not merely inconvenient for AI agents; it is insufficient. The format withholds information that algorithms require, and no improvement in model quality can compensate for information the input does not contain.

The central proof is constructive. We defined typed traversability as four properties --- declared block types (P1), typed cross-references (P2), priority signals (P3), and pre-computed token costs (P4) --- and presented HBDS (Hierarchical Budget-Constrained Descent Search), a four-phase traversal algorithm that navigates knowledge hierarchies under strict token budgets. Each phase of HBDS maps to at least one required property: ORIENT requires P2 to identify a search root in a typed hierarchy; DESCEND requires P1 for type-aware relevance scoring, P4 for pre-read budget verification, and P1 + P3 for graceful degradation to summary-only retrieval; SYNTHESIZE requires P1 for type-aware deduplication; WRITE BACK requires P2 for hierarchical placement of new knowledge. Removing any single property disables at least one phase. Flat text --- Markdown, HTML, plain prose --- provides none of the four properties. The algorithm is impossible on Markdown. It is trivial on a typed format. This converts the argument from "typed structure is nicer" to "typed structure enables algorithms that flat text prohibits" --- a provable claim, not a preference.

The five failure modes identified in Section 3 ground this proof in concrete agent workflows. Structure hallucination (F1) and type ambiguity (F2) arise because flat text lacks P1: the agent must guess what every block is. Broken traversal (F3) arises because flat text lacks P2: links carry no relationship semantics, and the agent cannot build a navigable knowledge graph. Context budget waste (F4) arises because flat text lacks both P3 and P4: the agent cannot prioritize reading and cannot verify costs before committing tokens. Semantic lossy compression (F5) arises because flat text lacks P1 and P3: the agent has no typed compression strategy and falls back to positional truncation. Each failure mode maps to a missing property; each missing property disables an HBDS phase. The failure modes and the algorithm agree on what the format must provide.

The empirical evidence converges from multiple directions. Three controlled studies test the thesis directly: Study 1 measures context retrieval accuracy across flat versus typed conditions, isolating the effect of block types on hallucination rate and token consumption. Study 2 measures knowledge graph traversal, isolating the effect of typed cross-references on dependency-tracing completeness and depth. Study 3 compares HBDS against standard RAG on a knowledge-intensive specification generation task, testing whether hierarchical descent on typed documents produces more complete, more correct output with fewer retrieval tokens than embedding-similarity retrieval on flat text. Production telemetry from 15 repositories over eight weeks corroborates the controlled findings with real-world scale: structured context documentation reduced discovery overhead from 40--60% of the context window to 15--25%, reduced agent error rates, and improved unassisted task completion from approximately 60--70% to 80--90%. The evidence is preliminary --- the controlled studies are designed but not yet executed, and the observational data carries the confounds acknowledged in Section 8.1 --- but the theoretical prediction and the directional evidence are consistent.

The implication is practical and immediate. The next generation of AI-agent tooling should treat document format as a first-class architectural decision --- not a serialization choice downstream of retrieval, but a capability gate upstream of it. Agent framework designers should build typed AST consumption as a native input mode, not an afterthought that serializes structure away before the agent sees it. Document format designers should treat AI agents as a first-class audience alongside human readers, targeting Level 5 on the typed-traversability spectrum: all four properties present, with ceremony cost low enough for human authors to sustain. Knowledge base platforms should export typed, traversable structure rather than opaque HTML or API-gated JSON, because the platform that declares its structure wins the agent ecosystem. The RAG community should investigate typed RAG --- embedding similarity for corpus-scale candidate retrieval, HBDS for type-aware descent within retrieved subtrees --- as a path beyond the limitations of flat chunking and topology-blind retrieval.

The era of flat text as the default medium for AI-agent consumption is ending --- not because flat text is bad for humans, but because it is provably insufficient for the algorithms that make agents effective. What replaces it is typed, traversable, budget-aware structure --- designed not for how humans read, but for how algorithms search.

References

[1] J. L. Lulla, S. Mohsenimofidi, M. Galster, J. M. Zhang, S. Baltes, and C. Treude, "On the impact of AGENTS.md files on the efficiency of AI coding agents," arXiv preprint arXiv:2601.20404, 2026.

[2] N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, "Lost in the middle: How language models use long contexts," Trans. Assoc. Comput. Linguist., vol. 12, pp. 157--173, 2024. doi: 10.1162/tacl_a_00638

[3] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Kuttler, M. Lewis, W. Yih, T. Rocktaschel, S. Riedel, and D. Kiela, "Retrieval-augmented generation for knowledge-intensive NLP tasks," in Proc. Advances in Neural Information Processing Systems (NeurIPS), 2020, pp. 9459--9474.

[4] D. Edge, H. Trinh, N. Cheng, J. Bradley, A. Chao, A. Mody, S. Truitt, D. Metropolitansky, R. O. Ness, and J. Larson, "From local to global: A Graph RAG approach to query-focused summarization," arXiv preprint arXiv:2404.16130, 2024.

[5] P. Sarthi, S. Abdullah, A. Tuli, S. Khanna, A. Goldie, and C. D. Manning, "RAPTOR: Recursive abstractive processing for tree-organized retrieval," in Proc. Int. Conf. on Learning Representations (ICLR), 2024.

[6] N. Gupta, W.-C. Chang, N. Bui, C.-J. Hsieh, and I. S. Dhillon, "LLM-guided hierarchical retrieval," arXiv preprint arXiv:2510.13217, 2025.

[7] J. Gruber, "Markdown: Syntax," Daring Fireball, 2004. [Online]. Available: https://daringfireball.net/projects/markdown/syntax

[8] A. Gu and T. Dao, "Mamba: Linear-time sequence modeling with selective state spaces," arXiv preprint arXiv:2312.00752, 2023.

[9] B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Arcadinho, H. Cao, X. Cheng, M. Chung, M. Grella, K. K. GV, et al., "RWKV: Reinventing RNNs for the Transformer era," in Findings of the Assoc. for Computational Linguistics: EMNLP 2023, 2023, pp. 14048--14064.

[10] T. Dao and A. Gu, "Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality," in Proc. Int. Conf. on Machine Learning (ICML), 2024.

[11] I. Beltagy, M. E. Peters, and A. Cohan, "Longformer: The long-document transformer," arXiv preprint arXiv:2004.05150, 2020.

[12] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, and A. Ahmed, "Big Bird: Transformers for longer sequences," in Proc. Advances in Neural Information Processing Systems (NeurIPS), 2020, pp. 17283--17297.

[13] T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Re, "FlashAttention: Fast and memory-efficient exact attention with IO-awareness," in Proc. Advances in Neural Information Processing Systems (NeurIPS), 2022.

[14] A. Graves, G. Wayne, and I. Danihelka, "Neural Turing Machines," arXiv preprint arXiv:1410.5401, 2014.

[15] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," in Proc. Advances in Neural Information Processing Systems (NeurIPS), 2017, pp. 5998--6008.

[16] S. Pope, S. Narayan, H. Tillet, and T. Dao, "KV cache scaling analysis for large language models," in Efficient Natural Language and Speech Processing Workshop, NeurIPS, 2023. [Note: KV cache memory estimates are widely documented in inference deployment literature; representative figures derive from Llama 3 technical reports and deployment guides.]

[17] J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu, "RoFormer: Enhanced transformer with rotary position embedding," Neurocomputing, vol. 568, p. 127063, 2024. doi: 10.1016/j.neucom.2023.127063

[18] O. Press, N. A. Smith, and M. Lewis, "Train short, test long: Attention with linear biases enables input length extrapolation," in Proc. Int. Conf. on Learning Representations (ICLR), 2022.

[19] Google DeepMind, "Gemini 2.5 Pro," Google AI, 2025. [Online]. Available: https://ai.google.dev/gemini-api/docs/models

[20] Meta AI, "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation," Meta AI Blog, Apr. 2025. [Online]. Available: https://ai.meta.com/blog/llama-4-multimodal-intelligence/

[21] Common Crawl Foundation, "Common Crawl," 2024. [Online]. Available: https://commoncrawl.org/. [Note: Corpus statistics (240+ billion pages, multi-petabyte scale) are from the publicly available crawl archives and documentation.]

[22] International Data Corporation (IDC), "IDC Global DataSphere Forecast, 2021--2025," IDC, Framingham, MA, USA, 2021. [Note: The 181 zettabytes figure for 2025 and the approximately four-year doubling period are from IDC's Global DataSphere forecast methodology.]

[23] L. Lamport, LaTeX: A Document Preparation System --- User's Guide and Reference Manual, 2nd ed. Reading, MA, USA: Addison-Wesley, 1994.

[24] American Mathematical Society, "Using the amsthm package," version 2.20.3, 2017. [Online]. Available: https://www.ams.org/arc/tex/amscls/amsthdoc.pdf

[25] N. Walsh and L. Muellner, DocBook: The Definitive Guide. Sebastopol, CA, USA: O'Reilly Media, 1999.

[26] OASIS, "Darwin Information Typing Architecture (DITA) Version 1.3," OASIS Standard, Dec. 2015. [Online]. Available: https://docs.oasis-open.org/dita/dita/v1.3/dita-v1.3-part0-overview.html

[27] S. Rackham, "AsciiDoc," 2002; now governed by the AsciiDoc Working Group, Eclipse Foundation. [Online]. Available: https://asciidoc.org/

[28] H. Sun and S. Zeng, "H-MEM: Hierarchical memory for high-efficiency long-term reasoning in LLM agents," arXiv preprint arXiv:2507.22925, 2025.

[29] C. Packer, S. Wooders, K. Lin, V. Fang, S. G. Patil, I. Stoica, and J. E. Gonzalez, "MemGPT: Towards LLMs as operating systems," arXiv preprint arXiv:2310.08560, 2024.

[30] Z. Li, S. Song, H. Wang, S. Niu, D. Chen, J. Yang, C. Xi, H. Lai, J. Zhao, Y. Wang, et al., "MemOS: An operating system for memory-augmented generation (MAG) in large language models," arXiv preprint arXiv:2505.22101, 2025.

[31] W. Xu, Z. Liang, K. Mei, H. Gao, J. T. Tan, and Y. Zhang, "A-MEM: Agentic memory for LLM agents," in Proc. Advances in Neural Information Processing Systems (NeurIPS), 2025.

[32] F. Petroni, A. Piktus, A. Fan, P. Lewis, M. Yazdani, N. De Cao, J. Thorne, Y. Jernite, V. Karpukhin, J. Maillard, V. Plachouras, T. Rocktaschel, and S. Riedel, "KILT: A benchmark for knowledge intensive language tasks," in Proc. Conf. of the North American Chapter of the Assoc. for Computational Linguistics (NAACL), 2021, pp. 2523--2544.

[33] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, M.-W. Chang, A. M. Dai, J. Uszkoreit, Q. Le, and S. Petrov, "Natural Questions: A benchmark for question answering research," Trans. Assoc. Comput. Linguist., vol. 7, pp. 452--466, 2019. doi: 10.1162/tacl_a_00276

[34] K. B. Davis, "Agent-Readable Documentation Standard (ARDS) v3.0 Specification," CloudSurf Software LLC, 2025. [Online]. Available: https://surfcontext.org/

[35] M. Yasunaga, H. Ren, A. Bosselut, P. Liang, and J. Leskovec, "QA-GNN: Reasoning with language models and knowledge graphs for question answering," in Proc. Conf. of the North American Chapter of the Assoc. for Computational Linguistics (NAACL), 2021, pp. 535--546.

[36] N. Reimers and I. Gurevych, "Sentence-BERT: Sentence embeddings using Siamese BERT-Networks," in Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP), 2019, pp. 3982--3992.

[37] L. Gao, Z. Dai, P. Pasupat, A. Chen, A. T. Chaganty, Y. Fan, V. Zhao, N. Lao, H. Lee, D. Chang, and K. Guu, "Attributed question answering: Evaluation and modeling for attributed large language models," arXiv preprint arXiv:2212.08037, 2023. [Note: Structured retrieval improvement figures (15--25% over flat baselines on knowledge-intensive generation) are broadly consistent with findings across Graph RAG [4], RAPTOR [5], LATTICE [6], and the KG-guided retrieval literature [35].]

[38] H. V. F. Santos, V. Costa, J. E. Montandon, and M. T. Valente, "Decoding the configuration of AI coding agents: Insights from Claude Code projects," arXiv preprint arXiv:2511.09268, 2025.

[39] W. Chatlatanagulchai, H. Li, Y. Kashiwa, B. Reid, K. Thonglek, P. Leelaprute, A. Rungsawang, B. Manaskasemsak, B. Adams, A. E. Hassan, and H. Iida, "Agent READMEs: An empirical study of context files for agentic coding," arXiv preprint arXiv:2511.12884, 2025.

[40] S. Mohsenimofidi, M. Galster, C. Treude, and S. Baltes, "Context engineering for AI agents in open-source software," arXiv preprint arXiv:2510.21413, 2025.

[41] X. Zhu, Y. Xie, Y. Liu, Y. Li, and W. Hu, "Knowledge graph-guided retrieval augmented generation," in Proc. Conf. of the North American Chapter of the Assoc. for Computational Linguistics (NAACL), 2025, pp. 7517--7533.