Fusion Intel
Building a Knowledge Engine from First Principles
Connor England
Please do interrupt with questions!
About Me
Current: Applied AI Developer & Solutions Architect, Fusion Health
Before: Co-founded SmartSave (fintech, USTE processing). Built Pidgeon Health (open-source C# healthcare data platform, used at Fusion and beyond)
Early Career: QA Lead, 3 food manufacturing plants (HACCP/SQF)
Certs: AWS Solutions Architect, CompTIA Security+, DeepLearning.AI Data Engineering
Flight: Lifelong love for aviation and aerospace. Currently working on PPL/IR.
Manufacturing QA → Entrepreneurship → Healthcare AI → (hopefully) aerospace.
Overview
What We'll Cover
This talk is about
Fusion Intel: multi-domain RAG platform (designed + 80% coded by me)
Modular AI development that absorbs improvements fast
Lessons learned, applied to Boltline
This talk will
Derive requirements from domain constraints
Walk through architecture and implementation
Share strong opinions, loosely held
Get into the nitty gritty!
Part I
Overview
The Problem, The Principles, The Constraints
Context
What Is Fusion Intel?
Accurate, cited answers from our own documentation, powered by AI.
The problem with general AI
General AI guesses. Dangerous in our domain.
What RAG changes
Search our docs first. Every claim traceable to source.
Intelligent Search
Keyword + semantic across thousands of pages
AI-Powered Answers
Grounded responses with citations
Multi-Domain
One platform, curated per department
Correctness over helpfulness. Cite, flag, or refuse.
The Problem
Why Build This?
First: RFP Bottleneck
RFPs are the bloodflow of growth
40-80 hrs per RFP
Clear first safe use case
Next: Cross-Domain Intelligence Multiplier
Same knowledge bases serve all departments
Build once, scale from first principles
Build vs. Buy
Rovo (Atlassian's AI) tested and rejected. Couldn't enforce citations, hallucinated frequently.
Team Usability
Technically sound but usable for BD and comms professionals, not just engineers.
Compliance
Teams already using ChatGPT internally with zero guardrails. Hallucinations going into client-facing materials.
Framing
Constraints & Opportunities
Domain Constraints
Multi-department, one platform
HIPAA: full data control, no RAG-in-a-box
Docs scattered across Confluence + OneDrive
Cross-domain sharing with clear boundaries
Technical Constraints
RAG degrades at scale (46K pages, 131K tickets)
Vector search misses exact terms
LLMs are sycophantic: guess over refuse
AI moves fast: every component swappable
Part II
How It Works
Core Mechanics
Constraint #1
A wrong RFP claim can cost millions in failed contracts. If we can't cite the source, we can't use the answer.
Tradeoff: helpfulness vs. honesty
The Governing Principle
Principle 1: "Tell the truth, or at least don't lie."
Principle 2: "Log and lead."
Query
user asks a question
Pass 1
Weaviate Hybrid Search
vector + BM25 + rerank
weighted
score
>= 5?
Gold 3x + Silver 2x + Ref 1x
YES (HIGH)
NO
Pass 2
Confluence MCP
CQL search key spaces
found
it?
YES
Generate
Cited response
Tag source if from Pass 2
NO
Pass 3: Intelligence Advisory
Rich audit trail: what was searched, what's missing
SME contacts + suggested questions + gap report
"Insufficient Evidence" with actionable next steps
A confidently wrong answer costs more than "I need more information."
→ New requirement: Citation-backed responses. ‘Insufficient Evidence’ as a first-class output. Source authority tiers.
Requirements v1
What we know so far
Functional
Citation-backed responses
"Insufficient Evidence" as a first-class output
Multi-pass retrieval with fallback advisory
Source authority tiers (Gold 3x, Silver 2x, Ref 1x)
Non-Functional
100% citation presence on answered queries
<1% unsupported claims
How It Works
The RAG Flow
1. Tag
Label a Confluence page or Jira ticket
→
2. Process
Webhook fires, chunks and embeds content
→
3. Store
Vectors stored in Weaviate
→
4. Retrieve
Hybrid search finds relevant chunks
→
5. Answer
Cited response from our docs
Without RAG
Guesses without sources.
vs.
With RAG
Cited, verifiable, grounded.
Architecture
System Architecture
Presentation Layer
RFP UI
Legal UI
CSM UI
MCP Clients‡
REST / GraphQL / MCP
↓
Agent Harness
Classify
→
Route
→
Retrieve
→
Rerank
→
Score
→
Generate
→
Cite
[Haiku] [Sonnet]
↓
Storage
Weaviate:
RFP
Legal
Support
CSM
General
Hybrid (Vector + BM25) + Cross-encoder
↓
Ingestion
Confluence
→
Webhook
→
Service Bus
→
Tag Route
→
Snapshot
→
Chunk
→
Embed
→
Upsert
The harness enforces RBAC at query time. An RFP query gets collection='RFP', content_type_filter=['manual','release_notes']. The model never sees Legal or Support content.
Constraint #2
Users search by meaning and by exact terms (release versions, compliance codes, prescription identifiers). The retrieval layer needs to handle both well.
Tradeoff: precision vs. cost
Retrieval
Hybrid Search
Vector Search
Semantic similarity
+
→
RRF + Rerank
Cross-encoder top-5
Vector Search
Q: "Does Fusion support barcode scanning?"
[0.31, -0.82, 0.15, ...]
Matched: "...medication verification process using scanning technology ..."
[0.29, -0.79, 0.18, ...]
similarity: 0.84
BM25 Keyword
Q: "barcode scanning "
Matched: "The barcode scanning module allows technicians to..."
BM25: 12.4
RRF + Rerank
Merged: Chunk A (vector rank #3, BM25 rank #1) → RRF score
Cross-encoder reranker re-scores all candidates using the full query-chunk pair.
Top 5 chunks selected by relevance, not just keyword or semantic match alone.
Neither search alone is enough. Hybrid + cross-encoder reranking: +15-20% precision.
→ New requirement: Hybrid retrieval (vector + BM25 + reranking). P95 latency under 12 seconds.
LLM Strategy
Model Strategy
Query
→
Haikuclassify/route
→
Retrieve
→
Rerank
→
Sonnetgenerate
→
Confidencecheck
→
Response
Haiku: Classifier ~$0.01/q
Routes queries to domain, decomposes compound questions. Structured JSON output.
Cohere: Reranker ~$0.003/q
Cross-encoder rescores top 15 → top 5. +15-20% precision. Evaluated in the eval harness.
Sonnet: Generator ~$0.15/q
Generates cited responses from reranked chunks. Every claim traces to source.
→ New requirements: Query decomposition. Circuit breaker. Configurable model router. Monthly cost under $800.
Constraint #3
Sometimes the evidence just isn't there. That's signal, not a dead end, as long as the system tells the truth instead of guessing.
Tradeoff: honesty vs. sycophancy
Confidence
Confidence Scoring
✓
HIGH
Gold sources, high relevance
?
MEDIUM
Partial match, verify
✗
INSUFFICIENT
No evidence, system refuses
EXAMPLE
Q: "Does Fusion EHR support electronic prescribing?"
Retrieved: 2 gold-tier sources (user manual ch.12, compliance doc §4.2) + 1 silver-tier (implementation guide §7), relevance 0.89
→ Confidence: HIGH | Weighted score: (2 × 3) + (1 × 2) = 8 > threshold of 5
Gold 3x
Gold 3x
Silver 2x
= 8 > threshold (5)
HIGH
→ New requirements: Rule-based confidence scoring. Weighted authority tiers drive trust level.
Refusal
Teaching the System to Refuse
LLMs guess to be helpful. Harness blocks generation below threshold.
Every refusal logged with search context.
Gap tracker: most valuable output. Prioritized list of documentation holes.
→
82 answeredwith citations
→
18 insufficientevidence logged
→
Gap Reportprioritized backlog
→ New requirement: Harness-enforced refusal fidelity. Every refusal logged with search context.
Requirements v2
The list grows
Functional
Citation-backed responses
"Insufficient Evidence"
Multi-pass retrieval
Source authority tiers
Hybrid search (vector + BM25 + rerank)
Query decomposition
Circuit breaker
Confidence scoring (rule-based)
Configurable model router
Non-Functional
100% citation presence
<1% unsupported claims
P95 latency <12s
Monthly cost <$800
Part III
Data Pipeline & Quality
Ingestion, Scale, and Optimization
Constraint #4
46K Confluence pages and 131K Jira tickets. The system has to stay current without drowning in noise.
Tradeoff: volume vs. quality
Ingestion
The Ingestion Pipeline
Confluence page created/updated triggers an Atlassian webhook. Event-driven means zero polling overhead.
2 Queue
Service Bus absorbs bursts, retries failures, dead-letters poison events. Decouples ingestion from source.
Labels map to taxonomy axes. Determines which Weaviate class(es) receive the content. Removals trigger soft-delete.
Canonical version stored in Blob Storage. Vector DB is derived, never authoritative. Always rebuildable.
5 Chunk
Semantic chunking: heading boundaries first, then paragraphs, then sentences. 300–500 tokens with 50-token overlap.
Content-hash deduplication ensures idempotency. Embedding model generates vectors for each chunk.
Vectors land in the correct Weaviate class with full metadata: source ID, heading path, authority tier, timestamps.
Canonical snapshots: always rebuildable, never re-crawl.
→ New requirement: Idempotent ingestion. Canonical snapshots for rebuildability.
Data Quality
The Laffer Curve of RAG
Recall drops from 99% at 10K vectors to 85% at 10M‡ . Class isolation keeps us in the safe zone.
Vector Count (chunks indexed)
Retrieval Recall
10K
100K
1M
10M
99%
92%
85%
99% recall
85% recall
Fusion (monolithic)
~600K-1M vectors
Fusion (per-class)
~2K-5K vectors
"Ingest everything" = noise wins
Our defense: 2K-5K vectors per query, not 600K+
→ New requirement: Domain-isolated vector classes. Curated search scope per query.
† HNSW (Hierarchical Navigable Small World) is the approximate nearest-neighbor algorithm most vector databases use to find similar chunks quickly.
‡ Sarkar, "HNSW at Scale: Why Your RAG System Gets Worse as the Vector Database Grows," Towards Data Science, Jan 2026
Architecture
Document Curation & Class Isolation
One collection per domain. Tags route content in. The harness filters by content_type at query time.
Source
Confluence Page
rag-rfp-manual
rag-legal-contract
→
ingest
Weaviate Vector Store (1 collection per domain)
RFP
manual ✓
release_notes
past_rfp
Legal
contract ✓
soc2
compliance
General (all teams)
product_specs
brand_spine
Dual-tagged pages get ingested into both collections. The harness filters by content_type at query time. No row-level security needed in Weaviate.
Requirements v3
Growing stronger
Functional
Citation-backed responses
"Insufficient Evidence"
Multi-pass retrieval
Source authority tiers
Hybrid search (vector + BM25 + rerank)
Query decomposition
Circuit breaker
Confidence scoring (rule-based)
Configurable model router
Idempotent ingestion
Golden answer cache for recurring queries
Domain-isolated vector classes
Non-Functional
100% citation presence
<1% unsupported claims
P95 latency <12s
Monthly cost <$800
Recurring queries served from verified cache
Part IV
Proving It Works
Evaluation & Feedback
Constraint #5
RFPs repeat the same questions. Users catch errors the system misses. Both signals should feed back in.
Tradeoff: build speed vs. feedback quality
Optimization
Golden Answer Cache
"Is Fusion EHR HIPAA compliant?" gets asked on every RFP. The answer never changes.
Caching an Answer
User gets RAG response
→
Light inference:"likely stable answer?"
→
Modal prompt:"Save for future?"
→
✓ Cachedanswer + citations
Removable anytime (like ChatGPT memory)
New Query Flow
New query
→
Cache checksimilarity > 0.95?
HIT →
Serve verified (no LLM, no retrieval)
Evaluation
The Eval Harness
50 real RFP questions + adversarial probes
CI/CD triggers on doc updates, model swaps, weekly regression
SAMPLE EVAL RUN
Q: "Does Fusion support HL7 ADT feeds?"
Precision@5: 0.80
Faithfulness: 1.00
Citation Acc: 1.00
Confidence: HIGH
47/50 queries returned correct top-5 chunks
0 hallucinated claims across 50-question set
Every citation links to a real Confluence source
95th percentile response time end-to-end
→ New requirements: Golden answer cache for recurring queries. User feedback loop. Over 90% accuracy target.
Evaluation
Eval Types Matrix
These eval types apply beyond just RFP. They validate any domain deployment.
Eval Type What It Measures Applies To
Retrieval Precision@5 Right chunks in top 5? All domains
Answer Faithfulness Every claim supported by source? All domains
Citation Accuracy Citations point to real sources? All domains
Confidence Calibration High confidence = actually right? All domains
Refusal Accuracy Refuses when should, accepts when should? All domains
Gap Report Quality Recommendations actionable? RFP, Support
Cross-Domain Leakage RBAC prevents bleed? Multi-domain
Chunking Boundary Answers split across chunks? All domains
Requirements v4
Almost there
Functional
Citation-backed responses
"Insufficient Evidence"
Multi-pass retrieval
Source authority tiers
Hybrid search (vector + BM25 + rerank)
Query decomposition
Circuit breaker
Confidence scoring (rule-based)
Configurable model router
Idempotent ingestion
Golden answer cache for recurring queries
Domain-isolated vector classes
Feedback loop (thumbs up/down + corrections)
Gap tracking (refusals feed backlog)
Non-Functional
100% citation presence
<1% unsupported claims
P95 latency <12s
Monthly cost <$800
Recurring queries served from verified cache
>90% accuracy on 50-question validation
Part V
How It Scales
Multi-Domain & Future-Proofing
Constraint #6
Multiple departments need grounded intelligence. We can't rebuild for each one.
Tradeoff: flexibility vs. simplicity
Architecture
Designing for Multi-Domain from Day One
governance.yaml → Two-axis taxonomy: Routing (which department) × Content-type (what kind of doc)
RFP_Manual
Legal_Manual
CSM_Manual
Supp_Manual
RFP_Release
Legal_Release
CSM_Release
Supp_Release
RFP_Comply
Legal_Comply
CSM_Comply
Supp_Comply
RFP_HL7
Legal_HL7
CSM_HL7
Supp_HL7
Two-axis taxonomy: routing x content-type
Adding a department = YAML config change
RBAC gates who sees what
→ New requirements: Two-axis tag routing. Config-driven domain extensibility. RBAC-scoped access. Soft-delete on tag removal.
Future-Proofing
Swappable Components
Models commoditize. The moat is the harness. Every component behind an abstraction. Upgrades are config changes.
Today → Tomorrow
MiniLM (text)
→
Gemini Embedding v2 (multimodal)
Weaviate
→
Any DB via RetrievalBackend protocol
Haiku / Sonnet
→
Best model tomorrow via model router
Cohere Reranker
→
Any cross-encoder
Confluence
→
.htm manuals, PDFs, future sources
Infrastructure
Azure Architecture
User (Browser)
Connects through APIM to Azure services
↓
Azure APIM
Rate limiting • caching (20-30% hit) • auth • RBAC
Frontend (Container App)
React SPA, domain-specific workflows, DNS'd internally
↓
Container App: API
Agent harness, query pipeline
Container App: Ingestion
Webhook listener, chunker
Container App: Weaviate
Self-hosted on our cloud, native hybrid search
↓
Service Bus
Dead-letter enabled
Blob Storage
Canonical snapshots
Key Vault + AD
Secrets • RBAC
Anthropic API
Haiku + Sonnet (external)
Non-Azure (external)
Azure
User/Client
Container Apps (scale-to-zero). Self-hosted Weaviate (data sovereignty). Service Bus (durable messaging).
Cost
What This Costs
Component Monthly
Container Apps (3 containers, scale-to-zero) $80 – $150
Weaviate (self-hosted) $15 – $25
APIM + networking ~$50
Anthropic API (Haiku + Sonnet) $100 – $400
Storage + queue $50 – $130
Total $295 – $755 /mo
One BD professional's manual research costs more per month than the entire platform.
Results
What We Gained
Revenue Pipeline†
Q1 2026: biggest by 1.4x
+3 deals/quarter potential
$6-10M above target
More at-bats per year
Operational Improvements‡
SLA adherence improving
NPS + eNPS growing
Augmented workforce, not chatbot
Domain too complex for off-the-shelf
† Numbers shared at Fusion Health Town Hall, 3/18/2026
‡ Monthly reporting from the Operations team (Value & Automation division, under which the AI function is housed)
Traceability
Closing the Loop
Requirement Design Decision
Citation-backed responses Sonnet generates inline citations from retrieved chunks
Insufficient Evidence Three-tier confidence scoring, rule-based refusal
Multi-domain routing Two-axis taxonomy, config-driven Weaviate classes
RBAC Azure AD + tag-scoped class access
Idempotent ingestion Content-hash dedup + canonical blob snapshots
P95 <12s Haiku classifier, APIM caching, hybrid search
Cost <$800/mo Scale-to-zero containers, two-model split
Gap tracking Refusal logs → content backlog pipeline
>90% accuracy 50-question eval harness, built first
Recurring query optimization Golden answer cache: verified responses bypass RAG pipeline
Catching confident-but-wrong User feedback loop: thumbs up/down + corrections, weekly review
Optimizations
If I Had More Time
Agentic Canvas Workflows
RFP section building in-platform
Deeper Multimodal
LMS videos, product demos, ticket screenshots as first-class knowledge sources.
Predictive Gap Analysis
Pattern analysis on refusal logs to surface documentation gaps before they're queried.
Teams Integration
MCP bot where work actually happens
Translation
How This Translates to Boltline
Modular Harness Architecture
FedRAMP-ready, evals catch regressions on swap
Azure + Anthropic
→
AWS Bedrock (FedRAMP)
Work Plans, BOMs, and Part Traceability
Work plans + BOMs as RAG corpus
RFP knowledge retrieval
→
Work plan + BOM intelligence
Hybrid Search for Exact Matches
Part #s and serial #s need exact match
Drug names, product codes
→
Part #s, serial #s, assembly IDs
Corpus Quality at Scale
Per-customer class scoping prevents degradation
Domain-isolated classes
→
Per-customer class scoping
Thank You
Questions? Let's go deep on anything.
Connor England
Backup: Harnesses
A Note on Harnesses
The moat is the harness, not the model. Two projects shaped this thinking.
Block's Goose
Open-source, LLM-agnostic agent framework. MCP as the sole integration standard. Local-first, no vendor calls. Now in the Linux Foundation alongside MCP itself.
Stripe's Minions
Built on a Goose fork. 1,300+ PRs/week, zero human-written code. Key innovation: "Blueprints" (hybrid deterministic + agentic nodes). ~500 tools available, ~15 curated per task.
Request
→
Pre-hydrate Context
→
LLM Tool Call
→
Execute Tool
→
Context Revision
→
Response
How we applied this
Blueprints: confidence + citation = deterministic. Retrieval + generation = agentic.
Pre-hydration: context assembled before LLM sees anything.
Tool curation: Legal query only sees Legal_* classes. Scoped per domain.
MCP: one protocol, any interface. Same brain everywhere.
Harness-level RLS: Weaviate has no row-level security. The harness injects collection + content_type filters at query time. Deterministic, fast, configurable per role.
Backup: UX
UX Decisions Under the Hood
Streaming responses. Users see generation in real-time, reducing perceived latency by ~60%.
Citations rendered as clickable links to source Confluence pages, with excerpt previews on hover.
Confidence badge shown prominently. Users learn to trust the system because it tells them when it's unsure.
Query history with re-ask capability. Refine without retyping.
Admin view shows the full retrieval trace: what was searched, what was retrieved, what was filtered, what was cited.
Backup: Chunking
Chunking Strategy
Semantic chunking: split on heading boundaries first, then paragraph breaks, then sentence boundaries.
Target: 300–500 tokens per chunk. Enough context to be useful, small enough for precise retrieval.
Overlap: 50-token sliding window between chunks to preserve context at boundaries.
Metadata preserved per chunk: source page ID, heading hierarchy, authority tier, last-modified date, content-type tag.
Chunk boundaries matter more than chunk size. A well-placed split preserves meaning; a bad one destroys it.
Backup: Data Quality Deep Dive
The RAG Quality Cliff
Two independent studies confirm the same pattern: vector search accuracy degrades meaningfully as corpus size grows, even with good embeddings.
EyeLevel Research, 2024
Tested Pinecone + OpenAI ada-002 embeddings across 1K, 10K, and 100K pages. 310 real answer-bearing pages + filler. 92 test questions, human-evaluated.
Expected degradation: 10-12% per 100K pages. A "page" is a full document page (PDF, HTML), not a chunk.
Sarkar (TDS), Jan 2026
Focused on HNSW algorithm behavior. With fixed index parameters (M=32, ef_search=128), recall degrades as vector count scales.
A "vector" is a single chunk embedding. 46K Confluence pages at ~10-20 chunks each = 600K-1M vectors.
Takeaways
Noise ratio drives degradation. More irrelevant content = harder to surface the right chunks.
Mitigations compound: class isolation + authority tiers + hybrid search + reranking.
"Ingest everything" has a ceiling. Class decomposition delays it indefinitely.
For Boltline: multi-tenant customers with large multimodal data need class-scoped search from day one.
EyeLevel, "Do Vector Databases Lose Accuracy at Scale?", 2024
|
Sarkar, "HNSW at Scale," TDS, Jan 2026