Stoke Space
Fusion Intel

Fusion Intel

Building a Knowledge Engine from First Principles

Connor England

Please do interrupt with questions!

About Me

  • Current: Applied AI Developer & Solutions Architect, Fusion Health
  • Before: Co-founded SmartSave (fintech, USTE processing). Built Pidgeon Health (open-source C# healthcare data platform, used at Fusion and beyond)
  • Early Career: QA Lead, 3 food manufacturing plants (HACCP/SQF)
  • Certs: AWS Solutions Architect, CompTIA Security+, DeepLearning.AI Data Engineering
  • Flight: Lifelong love for aviation and aerospace. Currently working on PPL/IR.
Family Aviation Teddy and Guitar
Manufacturing QA → Entrepreneurship → Healthcare AI → (hopefully) aerospace.
Overview

What We'll Cover

This talk is about

  • Fusion Intel: multi-domain RAG platform (designed + 80% coded by me)
  • Modular AI development that absorbs improvements fast
  • Lessons learned, applied to Boltline

This talk will

  • Derive requirements from domain constraints
  • Walk through architecture and implementation
  • Share strong opinions, loosely held
  • Get into the nitty gritty!
Part I

Overview

The Problem, The Principles, The Constraints

Fusion Intel
Context

What Is Fusion Intel?

Accurate, cited answers from our own documentation, powered by AI.

The problem with general AI

General AI guesses. Dangerous in our domain.

What RAG changes

Search our docs first. Every claim traceable to source.

Intelligent Search
Keyword + semantic across thousands of pages
AI-Powered Answers
Grounded responses with citations
Multi-Domain
One platform, curated per department
Correctness over helpfulness. Cite, flag, or refuse.
The Problem

Why Build This?

First: RFP Bottleneck

  • RFPs are the bloodflow of growth
  • 40-80 hrs per RFP
  • Clear first safe use case

Next: Cross-Domain Intelligence Multiplier

  • Same knowledge bases serve all departments
  • Build once, scale from first principles

Build vs. Buy

Rovo (Atlassian's AI) tested and rejected. Couldn't enforce citations, hallucinated frequently.

Team Usability

Technically sound but usable for BD and comms professionals, not just engineers.

Compliance

Teams already using ChatGPT internally with zero guardrails. Hallucinations going into client-facing materials.

Framing

Constraints & Opportunities

Domain Constraints

  • Multi-department, one platform
  • HIPAA: full data control, no RAG-in-a-box
  • Docs scattered across Confluence + OneDrive
  • Cross-domain sharing with clear boundaries

Technical Constraints

  • RAG degrades at scale (46K pages, 131K tickets)
  • Vector search misses exact terms
  • LLMs are sycophantic: guess over refuse
  • AI moves fast: every component swappable
Part II

How It Works

Core Mechanics

Constraint #1

A wrong RFP claim can cost millions in failed contracts. If we can't cite the source, we can't use the answer.

Tradeoff: helpfulness vs. honesty

The Governing Principle
Principle 1: "Tell the truth, or at least don't lie."
Principle 2: "Log and lead."
Query user asks a question Pass 1 Weaviate Hybrid Search vector + BM25 + rerank weighted score >= 5? Gold 3x + Silver 2x + Ref 1x YES (HIGH) NO Pass 2 Confluence MCP CQL search key spaces found it? YES Generate Cited response Tag source if from Pass 2 NO Pass 3: Intelligence Advisory Rich audit trail: what was searched, what's missing SME contacts + suggested questions + gap report "Insufficient Evidence" with actionable next steps
A confidently wrong answer costs more than "I need more information."
→ New requirement: Citation-backed responses. ‘Insufficient Evidence’ as a first-class output. Source authority tiers.
Requirements v1

What we know so far

0 / 15

Functional

  • Citation-backed responses
  • "Insufficient Evidence" as a first-class output
  • Multi-pass retrieval with fallback advisory
  • Source authority tiers (Gold 3x, Silver 2x, Ref 1x)

Non-Functional

  • 100% citation presence on answered queries
  • <1% unsupported claims
How It Works

The RAG Flow

1. Tag
Label a Confluence page or Jira ticket
2. Process
Webhook fires, chunks and embeds content
3. Store
Vectors stored in Weaviate
4. Retrieve
Hybrid search finds relevant chunks
5. Answer
Cited response from our docs
Without RAG
Guesses without sources.
vs.
With RAG
Cited, verifiable, grounded.
Architecture

System Architecture

Presentation Layer
RFP UI
Legal UI
CSM UI
MCP Clients
REST / GraphQL / MCP
Agent Harness
Classify
Route
Retrieve
Rerank
Score
Generate
Cite
[Haiku]   [Sonnet]
Storage
Weaviate:
RFP
Legal
Support
CSM
General
Hybrid (Vector + BM25) + Cross-encoder
Ingestion
Confluence
Webhook
Service Bus
Tag Route
Snapshot
Chunk
Embed
Upsert
The harness enforces RBAC at query time. An RFP query gets collection='RFP', content_type_filter=['manual','release_notes']. The model never sees Legal or Support content.
Constraint #2

Users search by meaning and by exact terms (release versions, compliance codes, prescription identifiers). The retrieval layer needs to handle both well.

Tradeoff: precision vs. cost

Retrieval

Hybrid Search

Vector Search
Semantic similarity
+
BM25
Exact term matching
RRF + Rerank
Cross-encoder top-5
Vector Search
Q: "Does Fusion support barcode scanning?"
[0.31, -0.82, 0.15, ...]
Matched: "...medication verification process using scanning technology..."
[0.29, -0.79, 0.18, ...]
similarity: 0.84
BM25 Keyword
Q: "barcode scanning"
Matched: "The barcode scanning module allows technicians to..."
BM25: 12.4
RRF + Rerank
Merged: Chunk A (vector rank #3, BM25 rank #1) → RRF score
Cross-encoder reranker re-scores all candidates using the full query-chunk pair.
Top 5 chunks selected by relevance, not just keyword or semantic match alone.

Neither search alone is enough. Hybrid + cross-encoder reranking: +15-20% precision.

→ New requirement: Hybrid retrieval (vector + BM25 + reranking). P95 latency under 12 seconds.
LLM Strategy

Model Strategy

Query
Haiku
classify/route
Retrieve
Rerank
Sonnet
generate
Confidence
check
Response

Haiku: Classifier ~$0.01/q

Routes queries to domain, decomposes compound questions. Structured JSON output.

Cohere: Reranker ~$0.003/q

Cross-encoder rescores top 15 → top 5. +15-20% precision. Evaluated in the eval harness.

Sonnet: Generator ~$0.15/q

Generates cited responses from reranked chunks. Every claim traces to source.

→ New requirements: Query decomposition. Circuit breaker. Configurable model router. Monthly cost under $800.
Constraint #3

Sometimes the evidence just isn't there. That's signal, not a dead end, as long as the system tells the truth instead of guessing.

Tradeoff: honesty vs. sycophancy

Confidence

Confidence Scoring

HIGH

Gold sources, high relevance

?

MEDIUM

Partial match, verify

INSUFFICIENT

No evidence, system refuses

EXAMPLE
Q: "Does Fusion EHR support electronic prescribing?"
Retrieved: 2 gold-tier sources (user manual ch.12, compliance doc §4.2) + 1 silver-tier (implementation guide §7), relevance 0.89
→ Confidence: HIGH | Weighted score: (2 × 3) + (1 × 2) = 8 > threshold of 5
Gold 3x
Gold 3x
Silver 2x
= 8 > threshold (5) HIGH
→ New requirements: Rule-based confidence scoring. Weighted authority tiers drive trust level.
Refusal

Teaching the System to Refuse

  • LLMs guess to be helpful. Harness blocks generation below threshold.
  • Every refusal logged with search context.
Gap tracker: most valuable output. Prioritized list of documentation holes.
100 queries
82 answered
with citations
18 insufficient
evidence logged
Gap Report
prioritized backlog
→ New requirement: Harness-enforced refusal fidelity. Every refusal logged with search context.
Requirements v2

The list grows

9 / 15

Functional

  • Citation-backed responses
  • "Insufficient Evidence"
  • Multi-pass retrieval
  • Source authority tiers
  • Hybrid search (vector + BM25 + rerank)
  • Query decomposition
  • Circuit breaker
  • Confidence scoring (rule-based)
  • Configurable model router

Non-Functional

  • 100% citation presence
  • <1% unsupported claims
  • P95 latency <12s
  • Monthly cost <$800
Part III

Data Pipeline & Quality

Ingestion, Scale, and Optimization

Constraint #4

46K Confluence pages and 131K Jira tickets. The system has to stay current without drowning in noise.

Tradeoff: volume vs. quality

Ingestion

The Ingestion Pipeline

1Webhook Fires

Confluence page created/updated triggers an Atlassian webhook. Event-driven means zero polling overhead.

2Queue

Service Bus absorbs bursts, retries failures, dead-letters poison events. Decouples ingestion from source.

3Tag Route

Labels map to taxonomy axes. Determines which Weaviate class(es) receive the content. Removals trigger soft-delete.

4Snapshot

Canonical version stored in Blob Storage. Vector DB is derived, never authoritative. Always rebuildable.

5Chunk

Semantic chunking: heading boundaries first, then paragraphs, then sentences. 300–500 tokens with 50-token overlap.

6Dedup & Embed

Content-hash deduplication ensures idempotency. Embedding model generates vectors for each chunk.

7Upsert

Vectors land in the correct Weaviate class with full metadata: source ID, heading path, authority tier, timestamps.

Canonical snapshots: always rebuildable, never re-crawl.
→ New requirement: Idempotent ingestion. Canonical snapshots for rebuildability.
Data Quality

The Laffer Curve of RAG

Recall drops from 99% at 10K vectors to 85% at 10M. Class isolation keeps us in the safe zone.

Vector Count (chunks indexed) Retrieval Recall 10K 100K 1M 10M 99% 92% 85% 99% recall 85% recall Fusion (monolithic) ~600K-1M vectors Fusion (per-class) ~2K-5K vectors
  • "Ingest everything" = noise wins
  • Our defense: 2K-5K vectors per query, not 600K+
→ New requirement: Domain-isolated vector classes. Curated search scope per query.

HNSW (Hierarchical Navigable Small World) is the approximate nearest-neighbor algorithm most vector databases use to find similar chunks quickly.
Sarkar, "HNSW at Scale: Why Your RAG System Gets Worse as the Vector Database Grows," Towards Data Science, Jan 2026

Architecture

Document Curation & Class Isolation

One collection per domain. Tags route content in. The harness filters by content_type at query time.

Source
Confluence Page
rag-rfp-manual rag-legal-contract
ingest
Weaviate Vector Store (1 collection per domain)
RFP
manual ✓
release_notes
past_rfp
Legal
contract ✓
soc2
compliance
Support
kb_articles
troubleshoot
CSM
playbooks
onboarding
General (all teams)
product_specs
brand_spine
query
RFP
Legal
Support
CSM
Dual-tagged pages get ingested into both collections. The harness filters by content_type at query time. No row-level security needed in Weaviate.
Requirements v3

Growing stronger

11 / 15

Functional

  • Citation-backed responses
  • "Insufficient Evidence"
  • Multi-pass retrieval
  • Source authority tiers
  • Hybrid search (vector + BM25 + rerank)
  • Query decomposition
  • Circuit breaker
  • Confidence scoring (rule-based)
  • Configurable model router
  • Idempotent ingestion
  • Golden answer cache for recurring queries
  • Domain-isolated vector classes

Non-Functional

  • 100% citation presence
  • <1% unsupported claims
  • P95 latency <12s
  • Monthly cost <$800
  • Recurring queries served from verified cache
Part IV

Proving It Works

Evaluation & Feedback

Constraint #5

RFPs repeat the same questions. Users catch errors the system misses. Both signals should feed back in.

Tradeoff: build speed vs. feedback quality

Optimization

Golden Answer Cache

"Is Fusion EHR HIPAA compliant?" gets asked on every RFP. The answer never changes.

Caching an Answer
User gets
RAG response
Light inference:
"likely stable answer?"
Modal prompt:
"Save for future?"
✓ Cached
answer + citations
Removable anytime
(like ChatGPT memory)
New Query Flow
New query
Cache check
similarity > 0.95?
HIT →
Serve verified (no LLM, no retrieval)
MISS →
Full RAG pipeline
Evaluation

The Eval Harness

  • 50 real RFP questions + adversarial probes
  • CI/CD triggers on doc updates, model swaps, weekly regression
SAMPLE EVAL RUN
Q: "Does Fusion support HL7 ADT feeds?"
Precision@5: 0.80 Faithfulness: 1.00 Citation Acc: 1.00 Confidence: HIGH
94%
Precision@5
97%
Faithfulness
100%
Citation Acc
8.2s
P95 Latency
47/50 queries returned
correct top-5 chunks
0 hallucinated claims
across 50-question set
Every citation links to
a real Confluence source
95th percentile response
time end-to-end
→ New requirements: Golden answer cache for recurring queries. User feedback loop. Over 90% accuracy target.
Evaluation

Eval Types Matrix

These eval types apply beyond just RFP. They validate any domain deployment.

Eval TypeWhat It MeasuresApplies To
Retrieval Precision@5Right chunks in top 5?All domains
Answer FaithfulnessEvery claim supported by source?All domains
Citation AccuracyCitations point to real sources?All domains
Confidence CalibrationHigh confidence = actually right?All domains
Refusal AccuracyRefuses when should, accepts when should?All domains
Gap Report QualityRecommendations actionable?RFP, Support
Cross-Domain LeakageRBAC prevents bleed?Multi-domain
Chunking BoundaryAnswers split across chunks?All domains
Requirements v4

Almost there

13 / 15

Functional

  • Citation-backed responses
  • "Insufficient Evidence"
  • Multi-pass retrieval
  • Source authority tiers
  • Hybrid search (vector + BM25 + rerank)
  • Query decomposition
  • Circuit breaker
  • Confidence scoring (rule-based)
  • Configurable model router
  • Idempotent ingestion
  • Golden answer cache for recurring queries
  • Domain-isolated vector classes
  • Feedback loop (thumbs up/down + corrections)
  • Gap tracking (refusals feed backlog)

Non-Functional

  • 100% citation presence
  • <1% unsupported claims
  • P95 latency <12s
  • Monthly cost <$800
  • Recurring queries served from verified cache
  • >90% accuracy on 50-question validation
Part V

How It Scales

Multi-Domain & Future-Proofing

Constraint #6

Multiple departments need grounded intelligence. We can't rebuild for each one.

Tradeoff: flexibility vs. simplicity

Architecture

Designing for Multi-Domain from Day One

governance.yaml → Two-axis taxonomy: Routing (which department) × Content-type (what kind of doc)
rag-rfp
rag-legal
rag-csm
rag-support
manual
RFP_Manual
Legal_Manual
CSM_Manual
Supp_Manual
release-notes
RFP_Release
Legal_Release
CSM_Release
Supp_Release
compliance
RFP_Comply
Legal_Comply
CSM_Comply
Supp_Comply
hl7-interfaces
RFP_HL7
Legal_HL7
CSM_HL7
Supp_HL7
  • Two-axis taxonomy: routing x content-type
  • Adding a department = YAML config change
  • RBAC gates who sees what
→ New requirements: Two-axis tag routing. Config-driven domain extensibility. RBAC-scoped access. Soft-delete on tag removal.
Future-Proofing

Swappable Components

Models commoditize. The moat is the harness. Every component behind an abstraction. Upgrades are config changes.

Today → Tomorrow
MiniLM (text)
Gemini Embedding v2 (multimodal)
Weaviate
Any DB via RetrievalBackend protocol
Haiku / Sonnet
Best model tomorrow via model router
Cohere Reranker
Any cross-encoder
Confluence
.htm manuals, PDFs, future sources
Infrastructure

Azure Architecture

User (Browser)
Connects through APIM to Azure services
Azure APIM
Rate limiting • caching (20-30% hit) • auth • RBAC
Frontend (Container App)
React SPA, domain-specific workflows, DNS'd internally
Container App: API
Agent harness, query pipeline
Container App: Ingestion
Webhook listener, chunker
Container App: Weaviate
Self-hosted on our cloud, native hybrid search
Service Bus
Dead-letter enabled
Blob Storage
Canonical snapshots
Key Vault + AD
Secrets • RBAC
Anthropic API
Haiku + Sonnet (external)
Non-Azure (external) Azure User/Client
Cost

What This Costs

ComponentMonthly
Container Apps (3 containers, scale-to-zero)$80 – $150
Weaviate (self-hosted)$15 – $25
APIM + networking~$50
Anthropic API (Haiku + Sonnet)$100 – $400
Storage + queue$50 – $130
Total$295 – $755 /mo
One BD professional's manual research costs more per month than the entire platform.
Results

What We Gained

Revenue Pipeline

  • Q1 2026: biggest by 1.4x
  • +3 deals/quarter potential
  • $6-10M above target
  • More at-bats per year

Operational Improvements

  • SLA adherence improving
  • NPS + eNPS growing
  • Augmented workforce, not chatbot
  • Domain too complex for off-the-shelf

Numbers shared at Fusion Health Town Hall, 3/18/2026
Monthly reporting from the Operations team (Value & Automation division, under which the AI function is housed)

Traceability

Closing the Loop

RequirementDesign Decision
Citation-backed responsesSonnet generates inline citations from retrieved chunks
Insufficient EvidenceThree-tier confidence scoring, rule-based refusal
Multi-domain routingTwo-axis taxonomy, config-driven Weaviate classes
RBACAzure AD + tag-scoped class access
Idempotent ingestionContent-hash dedup + canonical blob snapshots
P95 <12sHaiku classifier, APIM caching, hybrid search
Cost <$800/moScale-to-zero containers, two-model split
Gap trackingRefusal logs → content backlog pipeline
>90% accuracy50-question eval harness, built first
Recurring query optimizationGolden answer cache: verified responses bypass RAG pipeline
Catching confident-but-wrongUser feedback loop: thumbs up/down + corrections, weekly review
Optimizations

If I Had More Time

Agentic Canvas Workflows

RFP section building in-platform

Deeper Multimodal

LMS videos, product demos, ticket screenshots as first-class knowledge sources.

Predictive Gap Analysis

Pattern analysis on refusal logs to surface documentation gaps before they're queried.

Teams Integration

MCP bot where work actually happens

Translation

How This Translates to Boltline

Modular Harness Architecture

FedRAMP-ready, evals catch regressions on swap

Azure + Anthropic AWS Bedrock (FedRAMP)

Work Plans, BOMs, and Part Traceability

Work plans + BOMs as RAG corpus

RFP knowledge retrieval Work plan + BOM intelligence

Hybrid Search for Exact Matches

Part #s and serial #s need exact match

Drug names, product codes Part #s, serial #s, assembly IDs

Corpus Quality at Scale

Per-customer class scoping prevents degradation

Domain-isolated classes Per-customer class scoping
Stoke Space

Thank You

Questions? Let's go deep on anything.

Connor England

Backup: Harnesses

A Note on Harnesses

The moat is the harness, not the model. Two projects shaped this thinking.

Block's Goose

Open-source, LLM-agnostic agent framework. MCP as the sole integration standard. Local-first, no vendor calls. Now in the Linux Foundation alongside MCP itself.

Stripe's Minions

Built on a Goose fork. 1,300+ PRs/week, zero human-written code. Key innovation: "Blueprints" (hybrid deterministic + agentic nodes). ~500 tools available, ~15 curated per task.

Request
Pre-hydrate
Context
LLM
Tool Call
Execute
Tool
Context
Revision
Response

How we applied this

  • Blueprints: confidence + citation = deterministic. Retrieval + generation = agentic.
  • Pre-hydration: context assembled before LLM sees anything.
  • Tool curation: Legal query only sees Legal_* classes. Scoped per domain.
  • MCP: one protocol, any interface. Same brain everywhere.
  • Harness-level RLS: Weaviate has no row-level security. The harness injects collection + content_type filters at query time. Deterministic, fast, configurable per role.
Backup: UX

UX Decisions Under the Hood

Backup: Chunking

Chunking Strategy

Chunk boundaries matter more than chunk size. A well-placed split preserves meaning; a bad one destroys it.
Backup: Data Quality Deep Dive

The RAG Quality Cliff

Two independent studies confirm the same pattern: vector search accuracy degrades meaningfully as corpus size grows, even with good embeddings.

EyeLevel Research, 2024

Tested Pinecone + OpenAI ada-002 embeddings across 1K, 10K, and 100K pages. 310 real answer-bearing pages + filler. 92 test questions, human-evaluated.

1K pages
baseline
10K pages
visible drop
100K pages
–12%

Expected degradation: 10-12% per 100K pages. A "page" is a full document page (PDF, HTML), not a chunk.

Sarkar (TDS), Jan 2026

Focused on HNSW algorithm behavior. With fixed index parameters (M=32, ef_search=128), recall degrades as vector count scales.

10K vectors
99%
1M vectors
~92%
10M vectors
85%

A "vector" is a single chunk embedding. 46K Confluence pages at ~10-20 chunks each = 600K-1M vectors.

Takeaways

  • Noise ratio drives degradation. More irrelevant content = harder to surface the right chunks.
  • Mitigations compound: class isolation + authority tiers + hybrid search + reranking.
  • "Ingest everything" has a ceiling. Class decomposition delays it indefinitely.
  • For Boltline: multi-tenant customers with large multimodal data need class-scoped search from day one.

EyeLevel, "Do Vector Databases Lose Accuracy at Scale?", 2024   |   Sarkar, "HNSW at Scale," TDS, Jan 2026