Stoke Space
Fusion Intel

Fusion Intel

Building a Knowledge Engine from First Principles

Connor England

Please do interrupt with questions!

About Me

  • Current: Applied AI Developer & Solutions Architect, Fusion Health
  • Before: Co-founded SmartSave (fintech, USTE processing). Built Pidgeon Health (open-source C# healthcare data platform, used at Fusion and beyond)
  • Early Career: QA Lead, 3 food manufacturing plants (HACCP/SQF)
  • Certs: AWS Solutions Architect, CompTIA Security+, DeepLearning.AI Data Engineering
Family Aviation Teddy and Guitar
Manufacturing QA → Entrepreneurship → Healthcare AI → (hopefully) aerospace.
Overview

What We'll Cover

This talk is about

  • Fusion Intel: multi-domain RAG platform (designed + 80% coded by me)
  • Modular AI development that absorbs improvements fast
  • Lessons learned, applied to Boltline

This talk will

  • Derive requirements from domain constraints
  • Walk through architecture and implementation
  • Share strong opinions, loosely held
  • Get into the nitty gritty!
Fusion Intel
Context

What Is Fusion Intel?

A platform that lets our teams ask questions and get accurate, cited answers from our own documentation, powered by AI.

The problem with general AI

Trained on public data, not ours. Guesses when it doesn't know.

What RAG changes

Retrieval-Augmented Generation: searches our vetted docs first, then answers. Every claim traceable to source.

Intelligent Search
Finds the right information across thousands of pages using both keyword and meaning-based search
AI-Powered Answers
Generates clear, natural-language responses grounded in our documentation, with citations
Multi-Domain
Built once, used by RFP, Legal, Support, Client Success. Each team gets their own curated knowledge base.
Prioritizes correctness over helpfulness. Cite, flag, or refuse. Never guess.
The Problem

Why Build This?

First: RFP Bottleneck

  • RFPs are the bloodflow of business growth
  • 40-80 hrs per RFP, SMEs pulled off product work
  • Clear first use case: massive upside, limited downside

Next: Cross-Domain Intelligence Multiplier

  • Same knowledge bases serve Legal, Compliance, Support, CSM
  • Build once well, then scale from first principles
"We have all [the intel] we need to be winning deals left and right, if we could actually find it in time. Rovo sucks. Half the time, it can't find anything. Half of the remaining half, it lies. I don't want to ground our responses in what a bot told [our BD team] if failing to honor that claim means a few million sunk and our brand ruined. There's no question that this class of technology is powerful. [...]There's got to be a way." M. Jakovcic, CRO, initial planning meeting 12/2025
Framing

Constraints & Opportunities

Domain Constraints

  • Multi-department, one platform (RFP, Legal, Support, CSM)
  • HIPAA compliance: full data control required, no RAG-in-a-box
  • Docs scattered across Confluence + OneDrive, no unified search
  • Cross-domain sharing with clear department boundaries

Technical Constraints

  • RAG quality degrades at scale (46K pages, 131K tickets)
  • Vector search misses exact terms (drug names, product codes)
  • LLMs are sycophantic: will guess instead of refuse
  • AI moves fast: every component must be swappable
Constraint #1

The core value prop is genuinely accurate intelligence. If we don't have provenance, we don't have an app.

Helpfulness vs. honesty. We chose honesty.

The Governing Principle
Principle 1: "Tell the truth, or at least don't lie."
Principle 2: "Log and lead."
  • First Pass: Weaviate hybrid search across authority-tiered classes
  • Second Pass: Confluence deep search via Atlassian MCP. If found, recommend tagging into corpora.
  • Third Pass: Intelligence advisory. What was searched, what's missing, who to ask next.
A confidently wrong answer costs more than "I need more information."
→ New requirement: Citation-backed responses. "Insufficient Evidence" as a first-class output with actionable gap reporting.
Requirements v1

What we know so far

0 / 12

Functional

  • Citation-backed responses
  • "Insufficient Evidence" as a first-class output
  • Multi-pass retrieval with fallback advisory
  • Feedback loop for confident-but-wrong responses

Non-Functional

  • 100% citation presence on answered queries
  • <1% unsupported claims
  • Recurring queries served from verified cache
How It Works

The RAG Flow

Instead of asking a general AI model and hoping it knows the answer, we feed it our vetted information first. Here's how.

1. Tag
A team member tags a Confluence page or Jira ticket with a special label
2. Process
A webhook fires, sending the content through a pipeline that breaks it into chunks and converts it to "AI language" (embeddings)
3. Store
Stored in Weaviate, a specialized database that lets AI agents find exactly the right pieces quickly
4. Retrieve
When someone asks a question, the system searches for the most relevant chunks from our vetted knowledge
5. Answer
The AI generates a response grounded in our docs, with citations. No guessing, no hallucinating.
Without RAG
"I think Fusion might support that..."
No source. Could be wrong.
vs.
With RAG
"Yes. See User Manual ch.12, §4.2"
Cited. Verifiable. Grounded.
Constraint #2

Multiple departments need this. We can't rebuild for each one. One architecture, many consumers, clear boundaries.

Flexibility vs. simplicity. Config-driven over code-driven.

Architecture

Designing for Multi-Domain from Day One

RFP response is the first use case. But Legal, Compliance, Client Success, and Support all need the same capability. The architecture serves all of them without retooling.

governance.yaml → Two-axis taxonomy: Routing (which department) × Content-type (what kind of doc)
rag-rfp
rag-legal
rag-csm
rag-support
manual
RFP_Manual
Legal_Manual
CSM_Manual
Supp_Manual
release-notes
RFP_Release
Legal_Release
CSM_Release
Supp_Release
compliance
RFP_Comply
Legal_Comply
CSM_Comply
Supp_Comply
hl7-interfaces
RFP_HL7
Legal_HL7
CSM_HL7
Supp_HL7
  • Each intersection maps to a Weaviate class. Adding a new department = a YAML config change. No code changes. RBAC gates who sees what.
  • "Interface" = HL7 interface capabilities (speaking with other vendor systems like pharmacy, lab, EHR).
  • When tags are removed, the system soft-deletes (archives) vectors rather than persisting them. This prevents mistakes from polluting signal.
→ New requirements: Two-axis tag routing. Config-driven domain extensibility. RBAC-scoped access. Soft-delete on tag removal.
Requirements v2

Building the list

2 / 12

Functional

  • Citation-backed responses
  • "Insufficient Evidence" as first-class output
  • Multi-pass retrieval with fallback advisory
  • Two-axis tag routing
  • Config-driven domain extensibility
  • RBAC-scoped access per department
  • Soft-delete on tag removal

Non-Functional

  • 100% citation presence
  • <1% unsupported claims
  • Idempotent ingestion (reprocess = zero duplicates)
Architecture

System Architecture

Presentation Layer
RFP UI
Legal UI
CSM UI
MCP Clients
REST / GraphQL / MCP
Each domain ships OOTB workflows via /slash commands, surfaced as recommended prompts. Stored prompt templates for reuse.
Agent Harness
Classify
Route
Retrieve
Rerank
Score
Generate
Cite
[Haiku]   [Sonnet]
Storage
Weaviate:
RFP_*
Legal_*
CSM_*
General
Hybrid (Vector + BM25) + Cross-encoder
Ingestion
Confluence
Webhook
Service Bus
Tag Route
Snapshot
Chunk
Embed
Upsert
Future Dev: Exploring a centralized "ToolShed" MCP server (inspired by Stripe's Minions architecture) with access to Fusion-specific tools and third-party connections like Weaviate, Confluence, and Jira. One server, any client. See backup slide on harnesses for more.
Constraint #3

Knowledge changes constantly, and we have a lot of it. 46K pages, 131K tickets. Stay current without drowning in noise.

Volume vs. quality. Curate aggressively, ingest idempotently.

Ingestion

The Ingestion Pipeline

1Webhook Fires

Confluence page created/updated triggers an Atlassian webhook. Event-driven means zero polling overhead.

2Queue

Service Bus absorbs bursts, retries failures, dead-letters poison events. Decouples ingestion from source.

3Tag Route

Labels map to taxonomy axes. Determines which Weaviate class(es) receive the content. Removals trigger soft-delete.

4Snapshot

Canonical version stored in Blob Storage. Vector DB is derived, never authoritative. Always rebuildable.

5Chunk

Semantic chunking: heading boundaries first, then paragraphs, then sentences. 300–500 tokens with 50-token overlap.

6Dedup & Embed

Content-hash deduplication ensures idempotency. Embedding model generates vectors for each chunk.

7Upsert

Vectors land in the correct Weaviate class with full metadata: source ID, heading path, authority tier, timestamps.

Every version gets a canonical snapshot before chunking. The vector DB is derived, never authoritative. You can always rebuild from snapshots.
→ New requirement: Idempotent ingestion. Reprocessing a page must never create duplicates.
Data Quality

The Laffer Curve of RAG

More data isn't always better. With default HNSW settings, recall drops from 99% at 10K vectors to 85% at 10M. At Fusion's scale (~600K-1M chunk vectors), naive ingestion puts us in the degradation zone.

Vector Count (chunks indexed) Retrieval Recall 10K 100K 1M 10M 99% 92% 85% 99% recall 85% recall Fusion (monolithic) ~600K-1M vectors Fusion (per-class) ~2K-5K vectors
  • "Ingest everything" trap: noise dilutes signal as vector count grows
  • Our defense: domain-isolated classes, 2K-5K vectors per query scope
  • Authority tiers: Gold 3x / Silver 2x / Reference 1x scoring per class
→ New requirement: Domain-isolated vector classes. Each query must search thousands of curated vectors, not millions of noisy ones.

HNSW (Hierarchical Navigable Small World) is the approximate nearest-neighbor algorithm most vector databases use to find similar chunks quickly.
Sarkar, "HNSW at Scale: Why Your RAG System Gets Worse as the Vector Database Grows," Towards Data Science, Jan 2026

Future-Proofing

Swappable Components

Models commoditize. The moat is the harness. Every component behind an abstraction. Upgrades are config changes.

Today → Tomorrow
MiniLM (text)
Gemini Embedding v2 (multimodal)
Weaviate
Any DB via RetrievalBackend protocol
Haiku / Sonnet
Best model tomorrow via model router
Cohere Reranker
Any cross-encoder
Confluence
.htm manuals, PDFs, future sources
Constraint #4

"eMAR" and "electronic medication administration" are the same concept. "Barcode scanning" is an exact term. Search has to handle both.

Precision vs. cost. Hybrid search + smart model routing.

Retrieval

Why Hybrid Search

Vector Search
Semantic similarity
+
BM25
Exact term matching
RRF + Rerank
Cross-encoder top-5
Vector Search
Q: "Does Fusion support barcode scanning?"
[0.31, -0.82, 0.15, ...]
Matched: "...medication verification process using scanning technology..."
[0.29, -0.79, 0.18, ...]
similarity: 0.84
BM25 Keyword
Q: "barcode scanning"
Matched: "The barcode scanning module allows technicians to..."
BM25: 12.4
RRF + Rerank
Merged: Chunk A (vector rank #3, BM25 rank #1) → RRF score
Cross-encoder reranker re-scores all candidates using the full query-chunk pair.
Top 5 chunks selected by relevance, not just keyword or semantic match alone.

Pure vector misses exact terms (drug names, product codes). Pure keyword misses semantic similarity. Hybrid with RRF merges both ranked lists. Cross-encoder reranking: 15 candidates to 5. +15–20% precision. Adds 200–400ms.

→ New requirement: P95 latency under 12 seconds. Reranking adds 200-400ms, but it's worth the precision gain for our use case.
LLM Strategy

Two-Model Pipeline

Query
Haiku
classify/route
Retrieval
Sonnet
generate
Confidence
check
Response

Haiku: Classifier & Router ~$0.01/query

Classifies domain, detects compound queries, decomposes into sub-queries. Structured output via tool_use. Independently testable.

Haiku Output
"Does Fusion support HL7 ADT feeds and what's the SLA?"
compound = true, split into 2 sub-queries:
[1] domain: rfp intent: retrieve "HL7 ADT feed support"
[2] domain: rfp intent: retrieve "HL7 interface SLA"
↓ routes to: RFP_Manuals, RFP_HL7

Sonnet: Generator ~$0.15/query

Generates citation-backed responses from reranked chunks. Every claim must trace to source. Confidence scoring is deterministic, happens after generation.

Sonnet Output
"Yes. Fusion EHR supports HL7 ADT (Admit, Discharge, Transfer) feeds via its integration engine. [Integration Guide p.12] Standard SLA for new HL7 interfaces is 10 business days. [SLA Matrix §3.1]"
CONFIDENCE: HIGH
2 gold sources, relevance 0.91
CITATIONS: 2
Integration Guide, SLA Matrix
  • Circuit breaker: embedding down = keyword-only fallback, system stays up
  • Model router: config surface. New models or compliance boundaries = no code changes.
→ New requirements: Query decomposition for compound questions. Circuit breaker for graceful degradation. Monthly cost under $800.
Requirements v3

The list grows

7 / 12

Functional

  • Citation-backed responses
  • "Insufficient Evidence" as first-class output
  • Multi-pass retrieval with fallback advisory
  • Two-axis tag routing
  • Config-driven domain extensibility
  • RBAC-scoped access
  • Soft-delete on tag removal
  • Query decomposition for compound questions
  • Circuit breaker / graceful degradation
  • Configurable model router

Non-Functional

  • 100% citation presence
  • <1% unsupported claims
  • Idempotent ingestion
  • P95 latency <12s
  • Monthly cost <$800
Constraint #5

The system will be wrong sometimes. The question isn't whether. It's whether it knows when.

Trust vs. coverage. We'd rather refuse than guess.

Confidence

Confidence Scoring

HIGH

Multiple Gold sources agree

Gold = 3x weight

?

MEDIUM

Silver sources, partial coverage

Silver = 2x weight

INSUFFICIENT

Refuses to answer, logs the gap

Reference = 1x weight

EXAMPLE
Q: "Does Fusion EHR support electronic prescribing?"
Retrieved: 2 gold-tier sources (user manual ch.12, compliance doc §4.2) + 1 silver-tier (implementation guide §7), relevance 0.89
→ Confidence: HIGH | Weighted score: (2 × 3) + (1 × 2) = 8 > threshold of 5
Gold 3x
Gold 3x
Silver 2x
= 8 > threshold (5) HIGH
  • Rule-based, not learned. Fully explainable, no black-box confidence.
  • Gold sources count 3x in the weighted confidence calculation. Two gold sources matching = weighted score of 6. One silver = 2. The threshold for HIGH is 5+.
  • Compound queries: overall confidence = minimum across sub-queries.
→ New requirements: Source authority tiers (Gold 3x, Silver 2x, Reference 1x). Over 90% accuracy on 50-question validation set.
Refusal

Teaching the System to Refuse

  • LLMs are sycophantic. Will guess to be "helpful."
  • Claude chosen for refusal backbone, but model-level isn't enough
  • Harness enforcement: confidence scoring is code, not model judgment. Below threshold = generation blocked.
  • Every refusal logged: what was searched, what was missing
The gap tracker turned out to be the system's most valuable output: a prioritized list of documentation holes.
100 queries
82 answered
with citations
18 insufficient
evidence logged
Gap Report
prioritized backlog
→ New requirement: Gap tracking. Every refusal must log what was searched, what was missing, and feed that directly to content stakeholders.
Requirements v4

Almost there

12 / 14

Functional

  • Citation-backed responses
  • "Insufficient Evidence"
  • Multi-pass retrieval + fallback advisory
  • Two-axis tag routing
  • Config-driven extensibility
  • RBAC-scoped access
  • Soft-delete on tag removal
  • Query decomposition
  • Circuit breaker
  • Configurable model router
  • Golden answer cache for recurring queries
  • User feedback loop (thumbs up/down + corrections)
  • Gap tracking: refusals feed content backlog
  • Source authority tiers

Non-Functional

  • 100% citation presence
  • <1% unsupported claims
  • Idempotent ingestion
  • P95 <12s
  • Monthly cost <$800
  • >90% accuracy on 50-question validation set
Constraint #6

RFPs ask the same questions across submissions. Users will catch errors the system can't. Both signals need to feed back into the loop.

Repetition is an optimization opportunity. User corrections are free eval data.

Evaluation

The Eval Harness

  • 50 questions: real RFP questions + adversarial out-of-scope probes
  • Built first, not last. The eval harness existed before the generation pipeline.
  • CI/CD integration: automated triggers after large doc updates, model switches, chunking changes, weekly regression
  • User feedback: thumbs up/down + correction text on every response. Catches "confident but wrong."
SAMPLE EVAL RUN
Q: "Does Fusion support HL7 ADT feeds?"
Expected: Yes, with citation to Integration Guide p.12
Actual: "Yes. Fusion supports HL7 ADT (Admit/Discharge/Transfer) feeds..." [cite: Integration Guide]
Precision@5: 0.80 Faithfulness: 1.00 Citation Acc: 1.00 Confidence: HIGH
Gap tracker feeds directly to content stakeholders. Every "I don't know" becomes a backlog item with search context and SME routing.
94%
Precision@5
97%
Faithfulness
100%
Citation Acc
8.2s
P95 Latency
47/50 queries returned
correct top-5 chunks
0 hallucinated claims
across 50-question set
Every citation links to
a real Confluence source
95th percentile response
time end-to-end
→ New requirements: Golden answer cache for recurring queries. User feedback loop to catch confident-but-wrong responses.
Evaluation

Eval Types Matrix

These eval types apply beyond just RFP. They validate any domain deployment.

Eval TypeWhat It MeasuresApplies To
Retrieval Precision@5Right chunks in top 5?All domains
Answer FaithfulnessEvery claim supported by source?All domains
Citation AccuracyCitations point to real sources?All domains
Confidence CalibrationHigh confidence = actually right?All domains
Refusal AccuracyRefuses when should, accepts when should?All domains
Gap Report QualityRecommendations actionable?RFP, Support
Cross-Domain LeakageRBAC prevents bleed?Multi-domain
Chunking BoundaryAnswers split across chunks?All domains
Optimization

Golden Answer Cache

RFPs ask the same questions across submissions. Verified answers shouldn't go through the full pipeline every time.

New Query
Cache Check
similarity > 0.95?
HIT
Serve Verified Answer
no LLM call, no retrieval
MISS
Full RAG Pipeline
retrieve, rerank, generate

How it works

  • BD team approves a response with a "verify" action
  • Question + verified answer stored with embedding
  • New queries checked against cache before pipeline runs
  • Served with a "Verified" badge so users know it's human-reviewed

Why it matters

  • Zero latency, zero API cost for recurring questions
  • Consistency: same question always gets same answer across RFPs
  • Trust: "verified" badge signals human review
  • Translates directly to Boltline (recurring work plan queries)
Infrastructure

Azure Architecture

User (Browser)
Connects through APIM to Azure services
Azure APIM
Rate limiting • caching (20-30% hit) • auth • RBAC
Frontend (Container App)
React SPA, domain-specific workflows, DNS'd internally
Container App: API
Agent harness, query pipeline
Container App: Ingestion
Webhook listener, chunker
Container App: Weaviate
Self-hosted on our cloud, native hybrid search
Service Bus
Dead-letter enabled
Blob Storage
Canonical snapshots
Key Vault + AD
Secrets • RBAC
Anthropic API
Haiku + Sonnet (external)
Non-Azure (external) Azure User/Client
Cost

What This Costs

ComponentMonthly
Container Apps (3 containers, scale-to-zero)$80 – $150
Weaviate (self-hosted)$15 – $25
APIM + networking~$50
Anthropic API (Haiku + Sonnet)$100 – $400
Storage + queue$50 – $130
Total$295 – $755 /mo
One BD professional's time on manual RFP research costs more per month than the entire platform. Plus SME opportunity cost: subject matter experts pulled off product work to help the revenue team.
Results

What We Gained

Revenue Pipeline

  • Q1 2026 was our biggest Q1 by a factor of 1.4x
  • Strong possibility of adding 3 more deals per quarter into the pipeline for the year
  • Estimated additional $6-10M above initial target
  • More at-bats per year: faster RFP turnaround means we can pursue deals we previously had to pass on

Operational Improvements

  • Multimodal improvements driving better SLA adherence on support
  • NPS for client satisfaction and eNPS for support team both growing steadily
  • Better responses, easier troubleshooting through augmented workforce (internal)
  • Our domain is too complex for an Intercom/Finn-style chatbot. Augmenting our own people is moving the needle.

Numbers shared at Fusion Health Town Hall, 3/18/2026
Monthly reporting from the Operations team (Value & Automation division, under which the AI function is housed)

Traceability

Closing the Loop

Every requirement traces to a design decision.

RequirementDesign Decision
Citation-backed responsesSonnet generates inline citations from retrieved chunks
Insufficient EvidenceThree-tier confidence scoring, rule-based refusal
Multi-domain routingTwo-axis taxonomy, config-driven Weaviate classes
RBACAzure AD + tag-scoped class access
Idempotent ingestionContent-hash dedup + canonical blob snapshots
P95 <12sHaiku classifier, APIM caching, hybrid search
Cost <$800/moScale-to-zero containers, two-model split
Gap trackingRefusal logs → content backlog pipeline
>90% accuracy50-question eval harness, built first
Recurring query optimizationGolden answer cache: verified responses bypass RAG pipeline
Catching confident-but-wrongUser feedback loop: thumbs up/down + corrections, weekly review
Optimizations

If I Had More Time

Agentic Canvas Workflows

RFP section building + citation implantation in a document canvas (like Claude Artifacts), so BD pros iterate on documents inside the platform.

Golden Answer Cache

Verified answer store for recurring questions. Human-approved responses bypass the full RAG pipeline. Highest-ROI optimization for repetitive query patterns.

Cross-Domain Intelligence Synthesis

Queries spanning multiple department corpora with appropriate access controls. Synthesize across silos while respecting RBAC.

Teams Integration

Extending the MCP server pattern to a Microsoft Teams bot, where teams actually work. Same intelligence, delivered where the conversations happen.

Translation

How This Translates to Boltline

Modular Harness Architecture

FedRAMP, ~9 approved models. Harness performs regardless. Evals catch regressions on swap.

Azure + Anthropic AWS Bedrock (FedRAMP)

Work Plans, BOMs, and Part Traceability

Work plans + BOMs as RAG corpus. Technicians query specs, get cited answers from actual docs.

RFP knowledge retrieval Work plan + BOM intelligence

Hybrid Search for Exact Matches

Part numbers, serial numbers, QR codes. Exact-match miss is a showstopper. Hybrid search handles both.

Drug names, product codes Part #s, serial #s, assembly IDs

Corpus Quality at Scale

Multi-tenant multimodal data. Class-scoped search prevents degradation. Evals catch regressions.

Domain-isolated classes Per-customer class scoping
Stoke Space

Thank You

Questions? Let's go deep on anything.

Connor England

Backup: Harnesses

A Note on Harnesses

Models are commoditizing fast. The real moat is the harness: deterministic workflows that govern when and how intelligence gets applied. Two projects shaped how I think about this.

Block's Goose

Open-source, LLM-agnostic agent framework. MCP as the sole integration standard. Local-first, no vendor calls. Now in the Linux Foundation alongside MCP itself.

Stripe's Minions

Built on a Goose fork. 1,300+ PRs/week, zero human-written code. Key innovation: "Blueprints" (hybrid deterministic + agentic nodes). ~500 tools available, ~15 curated per task.

Request
Pre-hydrate
Context
LLM
Tool Call
Execute
Tool
Context
Revision
Response

Where these influenced Fusion Intel

  • Blueprints pattern (from Minions): our confidence scoring and citation linking are deterministic nodes. Retrieval and generation are the agentic nodes. The LLM never decides whether to refuse; the harness does.
  • Context pre-hydration (from Minions): we retrieve, rerank, and assemble the context payload before the generation model sees anything. The LLM's job is writing, not searching.
  • Tool curation (from Minions): a Legal query only sees Legal_* Weaviate classes + General. We scope tools per domain rather than exposing everything.
  • MCP as universal integration (from Goose): one protocol for all tool access. Same brain, any interface.
Backup: UX

UX Decisions Under the Hood

Backup: Chunking

Chunking Strategy

Chunk boundaries matter more than chunk size. A well-placed split preserves meaning; a bad one destroys it.
Backup: Data Quality Deep Dive

The RAG Quality Cliff

Two independent studies confirm the same pattern: vector search accuracy degrades meaningfully as corpus size grows, even with good embeddings.

EyeLevel Research, 2024

Tested Pinecone + OpenAI ada-002 embeddings across 1K, 10K, and 100K pages. 310 real answer-bearing pages + filler. 92 test questions, human-evaluated.

1K pages
baseline
10K pages
visible drop
100K pages
–12%

Expected degradation: 10-12% per 100K pages. A "page" is a full document page (PDF, HTML), not a chunk.

Sarkar (TDS), Jan 2026

Focused on HNSW algorithm behavior. With fixed index parameters (M=32, ef_search=128), recall degrades as vector count scales.

10K vectors
99%
1M vectors
~92%
10M vectors
85%

A "vector" is a single chunk embedding. 46K Confluence pages at ~10-20 chunks each = 600K-1M vectors.

Why this matters for system design

  • The degradation is a function of noise ratio, not a magic number. The more irrelevant content in your index relative to any given query, the harder it is to surface the right chunks. Both studies confirm this independently.
  • Mitigation strategies compound. Domain-isolated classes (search 2K-5K vectors, not 1M). Authority tier weighting (gold sources float above noise). Hybrid search (BM25 catches exact matches that vector search misses at scale). Cross-encoder reranking (re-scores the top-N for true relevance). Each layer recovers precision that raw vector search loses.
  • The "ingest everything" strategy has a structural ceiling. Teams adopting "ingest all docs of types X, Y, Z" into one index will hit this wall. The question isn't if, it's when. Class decomposition and curation delay the cliff indefinitely.
  • For Boltline: this matters even more in a multi-tenant manufacturing context. Customers with large volumes of data, especially multimodal (work plan docs, CAD references, inspection photos, test logs), need thoughtful class design and search strategies from day one. A platform that lets customer data grow unchecked into a monolithic index will hemorrhage effectiveness at scale. The right approach is guiding customers toward well-structured classes, curated ingestion policies, and domain-scoped search, so that the system gets more useful as data grows instead of less.

EyeLevel, "Do Vector Databases Lose Accuracy at Scale?", 2024   |   Sarkar, "HNSW at Scale," TDS, Jan 2026