Build a Local Multi-Agentic RAG App in 7 Steps: Transformers.js, Strands, ONNX, Orama

Stop paying vendor bills for every user turn. I built a local multi-agentic RAG app that runs every model — orchestrator, embedder, reranker — on the user’s own GPU, in the browser, with zero backend inference.

TL;DR — A fully working, privacy-first, local multi-agentic RAG app. Built with Transformers.js v4.2.0, Strands Agents TypeScript SDK v1.0, Hugging Face ONNX models, and Orama vector store. No API keys. No egress. No “install Python”. Cold start ~30 s, warm ~11 s, marginal cost per query: zero.


Table of Contents


Why build a local multi-agentic RAG app at all

Every RAG app I shipped in the last year sent every user turn through three hosted vendors. An LLM provider for the orchestrator. An embeddings API for retrieval. A reranker for relevance. Each one printed a per-token bill at the end of the month.

Legal kept asking where the uploaded PDFs ended up.

Finance kept asking why the inference line item doubled quarter over quarter.

Latency oncall kept asking why p99 got 400 ms worse when us-east-1 hiccupped.

Most “local AI” posts I read solved exactly one of those problems. They ran a quantised Llama in the browser and called it done. A useful agent does not stop at text generation. It has to embed documents. It has to rerank candidates. It has to carry tool calls, cite sources, and keep enough visible structure that the user understands what it did.

local multi-agentic RAG app is the honest version of that promise. Every model runs on the user’s own GPU. No API keys. No egress. No “also install Python”. The whole thing boots from a static Next.js page, downloads its weights once to OPFS, and answers questions about uploaded PDFs with inline citations — all on the user’s machine.

This post walks through the architecture I landed on, the places the browser platform surprised me, and enough pattern detail to reproduce the build yourself. I will be specific about versions, file names, and the exact gotchas that cost me hours. You can see the complete code in the reference repository on GitHub.


The 7-piece stack I picked (and why)

The stack for a local multi-agentic RAG has to satisfy three constraints at once: everything must run in a browser tab, every model has to fit in roughly one gigabyte of disk, and the agent glue has to be serious enough to support real sub-agent delegation.

LayerChoiceWhy
Agent frameworkStrands Agents TypeScript SDK v1.0First-class multi-agent primitives. Browser-compatible. Drop an Agent into another agent’s tools: [...] array and it auto-wraps as a tool.
Orchestrator SLMonnx-community/Qwen3.5-0.8B-Text (q4f16)Smallest Qwen3 variant with fused LinearAttention in transformers.js v4.2.0. ~13× faster decode than the WASM fallback on Apple Silicon.
Runtime@huggingface/transformers v4.2.0 + ONNX Runtime WebOnly mature JS runtime for ONNX on WebGPU. Ships the fused kernels.
Embeddingsnomic-ai/nomic-embed-text-v1.5, Matryoshka 256-dimTruncatable vectors with 1.24-MTEB-point loss at one-third the storage.
RerankerXenova/bge-reranker-base (q4f16)Battle-tested XLM-RoBERTa cross-encoder for short-text query/passage ranking.
Vector storeOrama v3.1Pure-JavaScript. Supports pre-filtered vector search via a where clause. ~80 KB runtime.
Weight cacheOPFS via a custom CacheInterface500 MB+ quota. Streaming writes. Chrome’s Cache API silently caps around 200 MB — I learned that the expensive way.
UINext.js 16 + shadcn/ui + Tailwind v4Boring. Productive. Gets out of the way.

Every piece on that list is free, open source, and fits inside a single-page app. None of them require a paid API key to ship. That is the entire proposition of a local multi-agentic RAG stack.


Step 1: architect the multi-agent topology

My first draft of the local multi-agentic RAG app wired up exactly one agent with four leaf tools: datetimeweathersearch_documentsummarize. The orchestrator LLM picked which to call. On paper, that design was clean.

In practice, with a 0.8 B orchestrator, the routing got fuzzy.

“What does the file say about X?” sometimes triggered datetime. “Weather in Bengaluru” sometimes got answered from the model’s training data instead of a live tool call. Rewriting the system prompt with sharper rules did not fix it — small models do not carefully parse rule lists, they pattern-match on names and descriptions.

The fix was hierarchical delegation. Strands’ Agent.asTool() — or more compactly, dropping an Agent directly into another agent’s tools: [...] array — lets a specialist sub-agent own its own tools, its own system prompt, and its own descriptions. The orchestrator’s decision then shrinks to “doc-related? yes or no”. Small models get binary decisions right almost every time.

┌──────────────────────────────────────────────────────┐
│ Orchestrator Agent (Qwen3.5-0.8B)                    │
│                                                      │
│  tools: [                                            │
│    datetimeTool,          ← leaf, pure TS function   │
│    getWeatherAgent(),     ← sub-agent                │
│    getRagAgent(),         ← sub-agent                │
│    summarizeTool,         ← leaf, wraps the SLM      │
│  ]                                                   │
└──────────────────────────────────────────────────────┘
        │                              │
        │                              │
┌───────▼───────────┐          ┌───────▼────────────────┐
│ Weather sub-agent │          │ RAG sub-agent          │
│  tools:           │          │  tools:                │
│   geocodeTool     │          │   searchDocumentTool   │
│   currentTool     │          │     └─ embed + Orama   │
│   forecastTool    │          │     └─ bge-reranker    │
└───────────────────┘          └────────────────────────┘

Strands ships two other multi-agent primitives — Swarm (peer handoff) and Graph (deterministic DAG). I considered both. Both turned out to be wrong for a chat UI.

Replacing the top-level Agent with a Swarm would make every user turn a handoff dance between peer agents, which a 0.8 B SLM will not navigate reliably. A Graph would lock me into a fixed pipeline, which fights the “the model decides” spirit of agentic code in the first place.

I am keeping Swarm and Graph in my back pocket for research-assistant fan-out or reviewer/critic chains. For a single-user doc-QA app, the hierarchical sub-agent pattern earns its keep with the least risk.


Step 2: bridge Transformers.js to Strands

Strands ships model providers for Anthropic, Bedrock, OpenAI, Google, and Vercel AI. None helped me. I needed the model to live in the user’s browser.

The SDK is explicitly designed for this extension. Subclass Model<T>, implement stream(messages, options): AsyncIterable<ModelStreamEvent>, and return events in the Strands vocabulary.

The core of my custom provider is roughly 200 lines that glue pipeline('text-generation', ...) to Strands’ streaming event types:

import { Model, type ModelStreamEvent } from "@strands-agents/sdk";
import {
  pipeline,
  TextStreamer,
  InterruptableStoppingCriteria,
} from "@huggingface/transformers";

export class TransformersJSModel extends Model<Config> {
  async *stream(messages, options): AsyncIterable<ModelStreamEvent> {
    const pipe = await getSLM(this.config.variant);
    const prompt = pipe.tokenizer.apply_chat_template(
      toOpenAIChat(messages, systemPrompt),
      {
        add_generation_prompt: true,
        tokenize: false,
        tools: toolSpecs,
      },
    );

    const stopper = new InterruptableStoppingCriteria();
    this.activeStoppers.add(stopper);

    const queue: string[] = [];
    const streamer = new TextStreamer(pipe.tokenizer, {
      skip_prompt: true,
      callback_function: (text) => queue.push(text),
    });

    pipe(prompt, {
      max_new_tokens: this.config.maxTokens,
      temperature: 0,
      streamer,
      stopping_criteria: stopper,
    });

    yield new ModelMessageStartEvent({
      type: "modelMessageStartEvent",
      role: "assistant",
    });
    yield new ModelContentBlockStartEvent({
      type: "modelContentBlockStartEvent",
    });

    while (/* tokens still arriving */) {
      const chunk = queue.shift();
      const toolCall = tryParseToolCall(accumulated, format);
      if (toolCall) {
        // Emit a toolUseStart + toolUseInputDelta
      } else {
        yield new ModelContentBlockDeltaEvent({
          type: "modelContentBlockDeltaEvent",
          delta: { type: "textDelta", text: chunk },
        });
      }
    }

    yield new ModelContentBlockStopEvent({
      type: "modelContentBlockStopEvent",
    });
    yield new ModelMessageStopEvent({
      type: "modelMessageStopEvent",
      stopReason: "endTurn",
    });
  }

  interrupt() {
    for (const s of this.activeStoppers) s.interrupt();
  }
}

Two non-obvious details cost me real debugging time.

Tool-call format is model-specific. Qwen2.5 and SmolLM2 emit JSON inside <tool_call>{…}</tool_call> blocks. Qwen3.5 emits an XML-parameter form instead: <tool_call><function=name><parameter=k>v</parameter></function></tool_call>. My provider carries both parsers and picks one per model variant. I got this wrong at first and the orchestrator looked like it was answering from memory because tool calls were slipping through as plain text.

You need InterruptableStoppingCriteria for a working Stop button. The user’s AbortSignal alone will not halt an in-flight pipe(…) call; the transformers.js decoder does not check it. Hand-rolling the cancellation path cost me a few extra lines. It saved my users from sitting through 30-second hallucination loops that cannot otherwise be interrupted.


Step 3: build the RAG pipeline end-to-end

The RAG half of a local multi-agentic RAG app is two linear flows. Ingest writes; retrieval reads. Neither one touches a server.

Ingest

File (.pdf/.md/.txt)
  │
  ├─► extract(file) → { kind, text, pdfPages? | rawMarkdown? | rawText? }
  │
  ├─► chunkDoc(extract, docId)
  │      sliding 512-token window, 64-token overlap
  │      emits Citation per chunk
  │        (pdf → page, md → line + heading path, txt → line)
  │
  ├─► embed(chunks, "document")
  │      nomic-embed-text-v1.5 with "search_document: " prefix
  │      Matryoshka-truncate to 256 dims, L2-normalize
  │
  └─► indexChunks(embeddedChunks)
         into Orama with schema:
           {
             id, text, index, docId, fileName,
             citation: JSON, embedding: vector[256]
           }

Retrieval (called by the RAG sub-agent’s search_document tool)

user query
  │
  ├─► embedOne(query, "search_query")
  │
  ├─► vectorSearch(qVec, top_k=20, docIds: enabled)
  │      Orama: mode:'vector', where: { docId: enabledIds }
  │                                     ↑ pre-filter before ANN scoring
  │
  ├─► rerank(query, candidates, top_k=4)
  │      Xenova/bge-reranker-base cross-encoder on (query, passage) pairs
  │
  └─► return passages with { marker, chunk_index, score, source, text }

The shape that matters most for product quality is the marker field. Every passage the tool returns gets a 1-indexed marker. The sub-agent’s system prompt instructs the model to append [N] at the end of every factual claim.

When the model complies, I parse the markers client-side into React <Popover> triggers. When it does not comply — and small models skip citation rules under context pressure — a lightweight lexical matcher I wrote attributes each sentence to the passage with the most content-word overlap and inserts the marker post-hoc. The result looks identical to the user. The source-of-truth binding is still the real retrieval.

Per-document scoping

One Orama index holds chunks from every uploaded document. Each chunk carries its docId. A small in-memory DocRegistry publishes checkbox state to React. The vectorSearch function takes an optional docIds: string[] argument and Orama applies it as a pre-filter — straight from the v3.1 types:

where?: Partial<WhereCondition<Schema>>;
// accepts { docId: ['a', 'b'] }

Pre-filter matters because it happens before ANN scoring. I am not post-filtering top-K and hoping enough candidates from the selected documents survived. If a user has a five-document library with two enabled, retrieval is strictly scoped to those two. No leakage.


Step 4: cache 500 MB of ONNX weights in OPFS

The naive approach is to let transformers.js fetch from Hugging Face on every session and trust HTTP cache. That failed me two ways.

First, Hugging Face’s default Cache-Control is not strong enough to survive a browser restart reliably. Users got re-downloads they did not expect.

Second — and this is the expensive discovery — transformers.js v4 ships an internal CacheInterface that writes to the browser Cache API, which silently fails on entries above roughly 200 MB in Chrome with a cryptic QuotaExceededError. I discovered during the build that the naive path works for tokenizer files and small configs but drops the 460 MB .onnx weights on the floor without warning. The cache claimed to be empty every cold start.

The fix I landed on is a custom CacheInterface backed by OPFS (navigator.storage.getDirectory()). Two reasons.

One: no per-entry size cap. OPFS writes are stream-based. The only ceiling is the origin’s total quota, which is typically 1 GB or more on a healthy disk.

Two: durable across reloads without any header games. Whatever I write stays written until I explicitly delete it.

Skeleton:

class OPFSCache implements CacheInterface {
  async match(request: string): Promise<Response | undefined> {
    const key = await sha256Hex(request);
    const root = await this.#root();
    const handle = await root.getFileHandle(key);
    const file = await handle.getFile();
    return new Response(file, { headers: await this.#readMeta(key) });
  }

  async put(request: string, response: Response): Promise<void> {
    if ((await estimateUsageFraction()) >= 0.85) {
      await this.#evictUntilBelow(0.7);  // LRU
    }
    const key = await sha256Hex(request);
    const handle = await (await this.#root()).getFileHandle(key, {
      create: true,
    });
    const writable = await handle.createWritable();
    await response.body!.pipeTo(writable);
    await this.#writeMeta(key, {
      url: request,
      headers: /* … */,
      storedAt: Date.now(),
    });
  }
}

env.useCustomCache = true;
env.customCache = new OPFSCache();
env.useBrowserCache = false;  // don't even try the Cache API

One surprise worth planning for: OPFS does not release blocks immediately on removeEntry. Chrome’s accounting reports hundreds of MB of “file system” usage minutes after I deleted files. For a proof of concept that did not matter. For a shipping product, I would build quota eviction on top of navigator.storage.estimate() rather than assuming removeEntry is synchronous.


Step 5: fix cold-start latency and the asyncify.wasm myth

The first question I asked myself when I watched the network panel during my build was “why is it downloading ort-wasm-simd-threaded.asyncify.wasm? I asked for WebGPU.”

Short version: I was still on WebGPU. The asyncify.wasm file is onnxruntime-web’s JSEP bridge — a WASM shim that hosts the JavaScript Execution Provider, which is how ONNX Runtime dispatches WebGPU shaders from a WASM-compiled model graph. The WASM in the filename is infrastructure. The actual matrix math runs on the GPU.

The confusion is reasonable because there are three WASM variants in the onnxruntime-web/dist/ folder, and only one of them is a real “fell back to CPU” signal:

FileSizeWhat it is
ort-wasm-simd-threaded.wasm12.9 MBPure CPU path. This is the “fell back to WASM” signal.
ort-wasm-simd-threaded.asyncify.wasm23.5 MBWebGPU path via Asyncify. The file I want.
ort-wasm-simd-threaded.jsep.wasm26 MBOlder JSEP path, also WebGPU-capable

I confirm I am on the GPU with ort.env.webgpu.adapter?.info.vendor from the console, not by reading filenames.

A clean cold start on an M-series Mac takes about 30 seconds end to end. That breaks down into: model download, WebGPU shader compile, prefill of the first system prompt + user turn. A warm reload with OPFS populated drops that to about 11 seconds. Roughly 90% of the remaining time is shader compile and prefill. No more network.


Step 6: tune routing quality on a 0.8 B orchestrator

Two patterns made the difference between an orchestrator that delegates cleanly and one that hallucinates around tool calls. This is the single most important craft lesson from my build.

Keep the system prompt short and specialist-named. A rules-heavy prompt is how I got a small model to pattern-match on rule phrasing instead of routing intent. The working template I landed on:

You are a privacy-first agent running in the user's browser.
Route the user's question to the best specialist, or answer directly for small talk.

Specialists:
- rag: answers questions about the user's uploaded documents.
- weather: current conditions or short-term forecast for a named place.
- datetime: current wall-clock date/time, optionally in a specific IANA timezone.

Also available:
- summarize: use ONLY for passages the user pasted inline.

Document scope: {scopeBlock}

Rules:
1. When a document is in scope, prefer `rag` for ANY question that could
   plausibly be answered from those documents — even vague phrasings.
2. When calling datetime with a location, pass it as an IANA timezone.
3. Be concise. Forward specialist answers unchanged.

Citation rules live inside the RAG sub-agent’s own prompt, where they apply. The orchestrator just forwards whatever comes back. That prevents double-rendering and keeps the orchestrator prompt sharp.

Push decision detail into tool descriptions. Strands’ router uses the tool name plus description as the primary routing signal. The model reads them immediately before deciding the next token. A tight, specific description outperformed long system-prompt paragraphs by a wide margin in my testing. The rag sub-agent’s description says what it is for, and what it does when no documents are enabled, and nothing else.

I also added a light post-processing layer that catches the specific failure mode where the model emits a valid answer but forgets the [N] markers. When that happens the post-processor scans the answer sentence by sentence, scores each sentence against each retrieved passage on content-word overlap, and inserts the best-matching marker. The user sees the same citation popover they would have seen if the model had cited inline. The grounding is still real because the passages were the ones actually retrieved.


Step 7: ship citations, scoping, and cancellation

Three UI affordances separate a demo from a shippable local multi-agentic RAG product.

Citations as first-class DOM elements. Every [N] marker in the assistant’s response becomes a React <Popover> trigger. Hover or click, and the source passage appears with its file name and line or page number. No “here are the sources” footer. No modal. Just the marker inline, exactly where the claim it supports sits.

Per-document scoping via checkboxes. The user uploads multiple files. A checkbox per file controls whether that file is in the active retrieval scope. Toggling rebuilds the orchestrator with a fresh scope block in the system prompt so the model knows exactly what documents it can draw from. The Orama where filter enforces the same scope at the retrieval layer so there is no leakage even if the model misroutes.

A Stop button that actually works. During a generation, the Send button swaps for Stop. Pressing it triggers both the AbortSignal path (which stops the Strands agent loop between tool calls) and the InterruptableStoppingCriteria.interrupt() path (which halts the transformers.js decoder mid-generation). Without the second call, the decoder runs to max_new_tokens regardless of what the user clicks.


When a local multi-agentic RAG wins, and when it loses

A local multi-agentic RAG stack is not a universal answer. It is a deliberate trade-off. Here is my honest take after building one.

Wins

  • Privacy-sensitive workflows. Legal. Medical. Internal-corpus QA where the data physically cannot leave the user’s machine. Every compliance review I have sat through gets 10× easier when the answer to “where does the PDF go” is “nowhere”.
  • Cost-sensitive long-tail products. Per-user inference amortises over many turns. After the first session’s weight download, marginal cost per answer is literally zero.
  • Offline-capable PWAs and browser extensions. Users on spotty connectivity, users on airplanes, users in workshops with no WiFi.
  • Demo-ability. No API keys to rotate, no backend to provision, no infra to break. I can hand a colleague a URL and the whole thing works.

Loses

  • Any workflow that needs a frontier model at the orchestrator position. 0.8 B is the ceiling for “runs on a laptop GPU without draining the battery in ten minutes”. Tasks that need Claude-level reasoning stay server-side, full stop.
  • First-visit cold start is rough. A user’s first question takes roughly 30 seconds because they are downloading around 600 MB of models. Design the empty state to tell them clearly what is happening.
  • Very long documents. My app caps at a sliding ingest budget — 10,000 chunks or 10 MB of vectors. A 2,000-page technical manual is out of scope for a client machine with about 1 GB of OPFS quota to play with.
  • Multi-user or multi-tab shared state. Each tab has its own Orama index. Cross-device sync needs a remote SnapshotStorage implementation, which is the exact thing this architecture was trying to avoid.

What to steal

Even if you never build a full local multi-agentic RAG app, a few of the patterns from this build are worth stealing into whatever you are building.

  • Agent.asTool() as the default integration surface for multi-agent. Forget “multi-agent” as a scary topic. It is literally an Agent inside a tools: [...] array. You get its event stream, its cancellation, its conversation history, for free.
  • Matryoshka embeddings. A 768-dim Nomic embedding truncated to 256 dims loses 1.24 MTEB points and 2/3 of your storage overhead. A near-freebie for any RAG app. Most teams still store full-dimensional vectors by default.
  • OPFS is production-ready for ML assets. Chrome, Safari, and Firefox all ship it. For single-user workflows — browser extensions, PWAs, internal tools — OPFS is a more honest home for model weights than the Cache API.
  • Strands’ concurrent tool executor. When a model emits multiple tool calls in one assistant turn, Strands 1.0 runs them in parallel by default. For my combo queries like “weather in Bengaluru and what time is it in Tokyo”, this meaningfully cut p95 latency.
  • Lexical fallback for citation compliance. Do not depend on the model to cite every time. Build a deterministic post-processor that binds uncited sentences to their best-matching retrieved passage. The user never has to know the model forgot.

Closing

There has been a real gap between the “you can run a model in the browser” demos and “you can ship a real agentic RAG app with sub-agents, citations, tool calls, and durable caching”. The gap mostly is not about models any more. WebGPU-fused Qwen3.5 at 0.8 B parameters is good enough for most routing and doc-QA tasks when you give it a focused sub-agent prompt.

The gap is in the plumbing around the model.

A real agent framework that does not assume a hosted LLM. A cache that can hold half a gigabyte per bucket without silently corrupting. A retrieval stack that scopes by document. A UI that surfaces citations as first-class affordances. A cancellation path that actually stops a running decoder.

Every one of those exists today, for free, as an npm install. What is left is the building. Go build.


Further reading I lean on when I need primary-source answers:

Bookmark 🔖 (Ctrl + D) this page for a quick reference. If you found this article helpful, please do share it with your friends and colleagues.

Navule Pavan Kumar Rao

I am a Full Stack Software Engineer with the Product Development experience in Banking, Finance, Corporate Tax and Automobile domains. I use SOLID Programming Principles and Design Patterns and Architect Software Solutions that scale using C#, .NET, Python, PHP and TDD. I am an expert in deployment of the Software Applications to Cloud Platforms such as Azure, GCP and non cloud On-Premise Infrastructures using shell scripts that become a part of CI/CD. I pursued Executive M.Tech in Data Science from IIT, Hyderabad (Indian Institute of Technology, Hyderabad) and hold B.Tech in Electonics and Communications Engineering from Vaagdevi Institute of Technology & Science.

Leave a Reply