Fusion Intel

Building a Knowledge Engine from First Principles

Connor England

Please do interrupt with questions!

About Me

Current: Applied AI Developer & Solutions Architect, Fusion Health
Before: Co-founded SmartSave (fintech, USTE processing). Built Pidgeon Health (open-source C# healthcare data platform, used at Fusion and beyond)
Early Career: QA Lead, 3 food manufacturing plants (HACCP/SQF)
Certs: AWS Solutions Architect, CompTIA Security+, DeepLearning.AI Data Engineering
Flight: Lifelong love for aviation and aerospace. Currently working on PPL/IR.

Manufacturing QA → Entrepreneurship → Healthcare AI → (hopefully) aerospace.

Overview

What We'll Cover

This talk is about

Fusion Intel: multi-domain RAG platform (designed + 80% coded by me)
Modular AI development that absorbs improvements fast
Lessons learned, applied to Boltline

This talk will

Derive requirements from domain constraints
Walk through architecture and implementation
Share strong opinions, loosely held
Get into the nitty gritty!

Part I

Overview

The Problem, The Principles, The Constraints

Context

What Is Fusion Intel?

Accurate, cited answers from our own documentation, powered by AI.

The problem with general AI

General AI guesses. Dangerous in our domain.

What RAG changes

Search our docs first. Every claim traceable to source.

Intelligent Search

Keyword + semantic across thousands of pages

AI-Powered Answers

Grounded responses with citations

Multi-Domain

One platform, curated per department

Correctness over helpfulness. Cite, flag, or refuse.

The Problem

Why Build This?

First: RFP Bottleneck

RFPs are the bloodflow of growth
40-80 hrs per RFP
Clear first safe use case

Next: Cross-Domain Intelligence Multiplier

Same knowledge bases serve all departments
Build once, scale from first principles

Build vs. Buy

Rovo (Atlassian's AI) tested and rejected. Couldn't enforce citations, hallucinated frequently.

Team Usability

Technically sound but usable for BD and comms professionals, not just engineers.

Compliance

Teams already using ChatGPT internally with zero guardrails. Hallucinations going into client-facing materials.

Framing

Constraints & Opportunities

Domain Constraints

Multi-department, one platform
HIPAA: full data control, no RAG-in-a-box
Docs scattered across Confluence + OneDrive
Cross-domain sharing with clear boundaries

Technical Constraints

RAG degrades at scale (46K pages, 131K tickets)
Vector search misses exact terms
LLMs are sycophantic: guess over refuse
AI moves fast: every component swappable

Part II

How It Works

Core Mechanics

Constraint #1

A wrong RFP claim can cost millions in failed contracts. If we can't cite the source, we can't use the answer.

Tradeoff: helpfulness vs. honesty

The Governing Principle

Principle 1: "Tell the truth, or at least don't lie."

Principle 2: "Log and lead."

A confidently wrong answer costs more than "I need more information."

→ New requirement: Citation-backed responses. ‘Insufficient Evidence’ as a first-class output. Source authority tiers.

Requirements v1

What we know so far

0 / 15

Functional

Citation-backed responses
"Insufficient Evidence" as a first-class output
Multi-pass retrieval with fallback advisory
Source authority tiers (Gold 3x, Silver 2x, Ref 1x)

Non-Functional

100% citation presence on answered queries
<1% unsupported claims

How It Works

The RAG Flow

1. Tag

Label a Confluence page or Jira ticket

→

2. Process

Webhook fires, chunks and embeds content

→

3. Store

Vectors stored in Weaviate

→

4. Retrieve

Hybrid search finds relevant chunks

→

5. Answer

Cited response from our docs

Without RAG

Guesses without sources.

vs.

With RAG

Cited, verifiable, grounded.

Architecture

System Architecture

Presentation Layer

RFP UI

Legal UI

CSM UI

MCP Clients^‡

REST / GraphQL / MCP

↓

Agent Harness

Classify

→

Route

→

Retrieve

→

Rerank

→

Score

→

Generate

→

Cite

[Haiku] [Sonnet]

↓

Storage

Weaviate:

RFP

Legal

Support

CSM

General

Hybrid (Vector + BM25) + Cross-encoder

↓

Ingestion

Confluence

→

Webhook

→

Service Bus

→

Tag Route

→

Snapshot

→

Chunk

→

Embed

→

Upsert

The harness enforces RBAC at query time. An RFP query gets collection='RFP', content_type_filter=['manual','release_notes']. The model never sees Legal or Support content.

Constraint #2

Users search by meaning and by exact terms (release versions, compliance codes, prescription identifiers). The retrieval layer needs to handle both well.

Tradeoff: precision vs. cost

Retrieval

Hybrid Search

Vector Search

Semantic similarity

+

BM25

Exact term matching

→

RRF + Rerank

Cross-encoder top-5

Vector Search

Q: "Does Fusion support barcode scanning?"

[0.31, -0.82, 0.15, ...]

Matched: "...medication verification process using scanning technology..."

[0.29, -0.79, 0.18, ...]

similarity: 0.84

BM25 Keyword

Q: "barcode scanning"

Matched: "The barcode scanning module allows technicians to..."

BM25: 12.4

RRF + Rerank

Merged: Chunk A (vector rank #3, BM25 rank #1) → RRF score

Cross-encoder reranker re-scores all candidates using the full query-chunk pair.

Top 5 chunks selected by relevance, not just keyword or semantic match alone.

Neither search alone is enough. Hybrid + cross-encoder reranking: +15-20% precision.

→ New requirement: Hybrid retrieval (vector + BM25 + reranking). P95 latency under 12 seconds.

LLM Strategy

Model Strategy

Query

→

Haiku
classify/route

→

Retrieve

→

Rerank

→

Sonnet
generate

→

Confidence
check

→

Response

Haiku: Classifier ~$0.01/q

Routes queries to domain, decomposes compound questions. Structured JSON output.

Cohere: Reranker ~$0.003/q

Cross-encoder rescores top 15 → top 5. +15-20% precision. Evaluated in the eval harness.

Sonnet: Generator ~$0.15/q

Generates cited responses from reranked chunks. Every claim traces to source.

→ New requirements: Query decomposition. Circuit breaker. Configurable model router. Monthly cost under $800.

Constraint #3

Sometimes the evidence just isn't there. That's signal, not a dead end, as long as the system tells the truth instead of guessing.

Tradeoff: honesty vs. sycophancy

Confidence

Confidence Scoring

HIGH

Gold sources, high relevance

MEDIUM

Partial match, verify

INSUFFICIENT

No evidence, system refuses

EXAMPLE

Q: "Does Fusion EHR support electronic prescribing?"

Retrieved: 2 gold-tier sources (user manual ch.12, compliance doc §4.2) + 1 silver-tier (implementation guide §7), relevance 0.89

→ Confidence: HIGH | Weighted score: (2 × 3) + (1 × 2) = 8 > threshold of 5

Gold 3x

Silver 2x

= 8 > threshold (5) HIGH

→ New requirements: Rule-based confidence scoring. Weighted authority tiers drive trust level.

Refusal

Teaching the System to Refuse

LLMs guess to be helpful. Harness blocks generation below threshold.
Every refusal logged with search context.

Gap tracker: most valuable output. Prioritized list of documentation holes.

100 queries

→

82 answered
with citations

→

18 insufficient
evidence logged

→

Gap Report
prioritized backlog

→ New requirement: Harness-enforced refusal fidelity. Every refusal logged with search context.

Requirements v2

The list grows

9 / 15

Functional

Citation-backed responses
"Insufficient Evidence"
Multi-pass retrieval
Source authority tiers
Hybrid search (vector + BM25 + rerank)
Query decomposition
Circuit breaker
Confidence scoring (rule-based)
Configurable model router

Non-Functional

100% citation presence
<1% unsupported claims
P95 latency <12s
Monthly cost <$800

Part III

Data Pipeline & Quality

Ingestion, Scale, and Optimization

Constraint #4

46K Confluence pages and 131K Jira tickets. The system has to stay current without drowning in noise.

Tradeoff: volume vs. quality

Ingestion

The Ingestion Pipeline

1Webhook Fires

Confluence page created/updated triggers an Atlassian webhook. Event-driven means zero polling overhead.

2Queue

Service Bus absorbs bursts, retries failures, dead-letters poison events. Decouples ingestion from source.

3Tag Route

Labels map to taxonomy axes. Determines which Weaviate class(es) receive the content. Removals trigger soft-delete.

4Snapshot

Canonical version stored in Blob Storage. Vector DB is derived, never authoritative. Always rebuildable.

5Chunk

Semantic chunking: heading boundaries first, then paragraphs, then sentences. 300–500 tokens with 50-token overlap.

6Dedup & Embed

Content-hash deduplication ensures idempotency. Embedding model generates vectors for each chunk.

7Upsert

Vectors land in the correct Weaviate class with full metadata: source ID, heading path, authority tier, timestamps.

Canonical snapshots: always rebuildable, never re-crawl.

→ New requirement: Idempotent ingestion. Canonical snapshots for rebuildability.

Data Quality

The Laffer Curve of RAG

Recall drops from 99% at 10K vectors to 85% at 10M^‡. Class isolation keeps us in the safe zone.

"Ingest everything" = noise wins
Our defense: 2K-5K vectors per query, not 600K+

→ New requirement: Domain-isolated vector classes. Curated search scope per query.

† HNSW (Hierarchical Navigable Small World) is the approximate nearest-neighbor algorithm most vector databases use to find similar chunks quickly.
‡ Sarkar, "HNSW at Scale: Why Your RAG System Gets Worse as the Vector Database Grows," Towards Data Science, Jan 2026

Architecture

Document Curation & Class Isolation

One collection per domain. Tags route content in. The harness filters by content_type at query time.

Source

Confluence Page

rag-rfp-manual rag-legal-contract

→

ingest

Weaviate Vector Store (1 collection per domain)

RFP

manual ✓

release_notes

past_rfp

Legal

contract ✓

soc2

compliance

Support

kb_articles

troubleshoot

CSM

playbooks

onboarding

General (all teams)

product_specs

brand_spine

↔

query

RFP

Legal

Support

CSM

Dual-tagged pages get ingested into both collections. The harness filters by content_type at query time. No row-level security needed in Weaviate.

Requirements v3

Growing stronger

11 / 15

Functional

Citation-backed responses
"Insufficient Evidence"
Multi-pass retrieval
Source authority tiers
Hybrid search (vector + BM25 + rerank)
Query decomposition
Circuit breaker
Confidence scoring (rule-based)
Configurable model router
Idempotent ingestion
Golden answer cache for recurring queries
Domain-isolated vector classes

Non-Functional

100% citation presence
<1% unsupported claims
P95 latency <12s
Monthly cost <$800
Recurring queries served from verified cache

Part IV

Proving It Works

Evaluation & Feedback

Constraint #5

RFPs repeat the same questions. Users catch errors the system misses. Both signals should feed back in.

Tradeoff: build speed vs. feedback quality

Optimization

Golden Answer Cache

"Is Fusion EHR HIPAA compliant?" gets asked on every RFP. The answer never changes.

Caching an Answer

User gets
RAG response

→

Light inference:
"likely stable answer?"

→

Modal prompt:
"Save for future?"

→

✓ Cached
answer + citations

Removable anytime
(like ChatGPT memory)

New Query Flow

New query

→

Cache check
similarity > 0.95?

HIT →

Serve verified (no LLM, no retrieval)

MISS →

Full RAG pipeline

Evaluation

The Eval Harness

50 real RFP questions + adversarial probes
CI/CD triggers on doc updates, model swaps, weekly regression

SAMPLE EVAL RUN

Q: "Does Fusion support HL7 ADT feeds?"

Precision@5: 0.80 Faithfulness: 1.00 Citation Acc: 1.00 Confidence: HIGH

94%

Precision@5

97%

Faithfulness

100%

Citation Acc

8.2s

P95 Latency

47/50 queries returned
correct top-5 chunks

0 hallucinated claims
across 50-question set

Every citation links to
a real Confluence source

95th percentile response
time end-to-end

→ New requirements: Golden answer cache for recurring queries. User feedback loop. Over 90% accuracy target.

Evaluation

Eval Types Matrix

These eval types apply beyond just RFP. They validate any domain deployment.

Eval Type	What It Measures	Applies To
Retrieval Precision@5	Right chunks in top 5?	All domains
Answer Faithfulness	Every claim supported by source?	All domains
Citation Accuracy	Citations point to real sources?	All domains
Confidence Calibration	High confidence = actually right?	All domains
Refusal Accuracy	Refuses when should, accepts when should?	All domains
Gap Report Quality	Recommendations actionable?	RFP, Support
Cross-Domain Leakage	RBAC prevents bleed?	Multi-domain
Chunking Boundary	Answers split across chunks?	All domains

Requirements v4

Almost there

13 / 15

Functional

Citation-backed responses
"Insufficient Evidence"
Multi-pass retrieval
Source authority tiers
Hybrid search (vector + BM25 + rerank)
Query decomposition
Circuit breaker
Confidence scoring (rule-based)
Configurable model router
Idempotent ingestion
Golden answer cache for recurring queries
Domain-isolated vector classes
Feedback loop (thumbs up/down + corrections)
Gap tracking (refusals feed backlog)

Non-Functional

100% citation presence
<1% unsupported claims
P95 latency <12s
Monthly cost <$800
Recurring queries served from verified cache
>90% accuracy on 50-question validation

Part V

How It Scales

Multi-Domain & Future-Proofing

Constraint #6

Multiple departments need grounded intelligence. We can't rebuild for each one.

Tradeoff: flexibility vs. simplicity

Architecture

Designing for Multi-Domain from Day One

governance.yaml → Two-axis taxonomy: Routing (which department) × Content-type (what kind of doc)

rag-rfp

rag-legal

rag-csm

rag-support

manual

RFP_Manual

Legal_Manual

CSM_Manual

Supp_Manual

release-notes

RFP_Release

Legal_Release

CSM_Release

Supp_Release

compliance

RFP_Comply

Legal_Comply

CSM_Comply

Supp_Comply

hl7-interfaces

RFP_HL7

Legal_HL7

CSM_HL7

Supp_HL7

Two-axis taxonomy: routing x content-type
Adding a department = YAML config change
RBAC gates who sees what

→ New requirements: Two-axis tag routing. Config-driven domain extensibility. RBAC-scoped access. Soft-delete on tag removal.

Future-Proofing

Swappable Components

Models commoditize. The moat is the harness. Every component behind an abstraction. Upgrades are config changes.

Today → Tomorrow

MiniLM (text)

→

Gemini Embedding v2 (multimodal)

Weaviate

→

Any DB via RetrievalBackend protocol

Haiku / Sonnet

→

Best model tomorrow via model router

Cohere Reranker

→

Any cross-encoder

Confluence

→

.htm manuals, PDFs, future sources

Infrastructure

Azure Architecture

User (Browser)

Connects through APIM to Azure services

↓

Azure APIM

Rate limiting • caching (20-30% hit) • auth • RBAC

Frontend (Container App)

React SPA, domain-specific workflows, DNS'd internally

↓

Container App: API

Agent harness, query pipeline

Container App: Ingestion

Webhook listener, chunker

Container App: Weaviate

Self-hosted on our cloud, native hybrid search

↓

Service Bus

Dead-letter enabled

Blob Storage

Canonical snapshots

Key Vault + AD

Secrets • RBAC

Anthropic API

Haiku + Sonnet (external)

Non-Azure (external) Azure User/Client

Container Apps (scale-to-zero). Self-hosted Weaviate (data sovereignty). Service Bus (durable messaging).

Cost

What This Costs

Component	Monthly
Container Apps (3 containers, scale-to-zero)	$80 – $150
Weaviate (self-hosted)	$15 – $25
APIM + networking	~$50
Anthropic API (Haiku + Sonnet)	$100 – $400
Storage + queue	$50 – $130
Total	$295 – $755 /mo

One BD professional's manual research costs more per month than the entire platform.

Results

What We Gained

Revenue Pipeline^†

Q1 2026: biggest by 1.4x
+3 deals/quarter potential
$6-10M above target
More at-bats per year

Operational Improvements^‡

SLA adherence improving
NPS + eNPS growing
Augmented workforce, not chatbot
Domain too complex for off-the-shelf

† Numbers shared at Fusion Health Town Hall, 3/18/2026
‡ Monthly reporting from the Operations team (Value & Automation division, under which the AI function is housed)

Traceability

Closing the Loop

Requirement	Design Decision
Citation-backed responses	Sonnet generates inline citations from retrieved chunks
Insufficient Evidence	Three-tier confidence scoring, rule-based refusal
Multi-domain routing	Two-axis taxonomy, config-driven Weaviate classes
RBAC	Azure AD + tag-scoped class access
Idempotent ingestion	Content-hash dedup + canonical blob snapshots
P95 <12s	Haiku classifier, APIM caching, hybrid search
Cost <$800/mo	Scale-to-zero containers, two-model split
Gap tracking	Refusal logs → content backlog pipeline
>90% accuracy	50-question eval harness, built first
Recurring query optimization	Golden answer cache: verified responses bypass RAG pipeline
Catching confident-but-wrong	User feedback loop: thumbs up/down + corrections, weekly review

Optimizations

If I Had More Time

Agentic Canvas Workflows

RFP section building in-platform

Deeper Multimodal

LMS videos, product demos, ticket screenshots as first-class knowledge sources.

Predictive Gap Analysis

Pattern analysis on refusal logs to surface documentation gaps before they're queried.

Teams Integration

MCP bot where work actually happens

Translation

How This Translates to Boltline

Modular Harness Architecture

FedRAMP-ready, evals catch regressions on swap

Azure + Anthropic → AWS Bedrock (FedRAMP)

Work Plans, BOMs, and Part Traceability

Work plans + BOMs as RAG corpus

RFP knowledge retrieval → Work plan + BOM intelligence

Hybrid Search for Exact Matches

Part #s and serial #s need exact match

Drug names, product codes → Part #s, serial #s, assembly IDs

Corpus Quality at Scale

Per-customer class scoping prevents degradation

Domain-isolated classes → Per-customer class scoping

Thank You

Questions? Let's go deep on anything.

Connor England

Backup: Harnesses

A Note on Harnesses

The moat is the harness, not the model. Two projects shaped this thinking.

Block's Goose

Open-source, LLM-agnostic agent framework. MCP as the sole integration standard. Local-first, no vendor calls. Now in the Linux Foundation alongside MCP itself.

Stripe's Minions

Built on a Goose fork. 1,300+ PRs/week, zero human-written code. Key innovation: "Blueprints" (hybrid deterministic + agentic nodes). ~500 tools available, ~15 curated per task.

Request

→

Pre-hydrate
Context

→

LLM
Tool Call

→

Execute
Tool

→

Context
Revision

→

Response

How we applied this

Blueprints: confidence + citation = deterministic. Retrieval + generation = agentic.
Pre-hydration: context assembled before LLM sees anything.
Tool curation: Legal query only sees Legal_* classes. Scoped per domain.
MCP: one protocol, any interface. Same brain everywhere.
Harness-level RLS: Weaviate has no row-level security. The harness injects collection + content_type filters at query time. Deterministic, fast, configurable per role.

Backup: UX

UX Decisions Under the Hood

Streaming responses. Users see generation in real-time, reducing perceived latency by ~60%.
Citations rendered as clickable links to source Confluence pages, with excerpt previews on hover.
Confidence badge shown prominently. Users learn to trust the system because it tells them when it's unsure.
Query history with re-ask capability. Refine without retyping.
Admin view shows the full retrieval trace: what was searched, what was retrieved, what was filtered, what was cited.

Backup: Chunking

Chunking Strategy

Semantic chunking: split on heading boundaries first, then paragraph breaks, then sentence boundaries.
Target: 300–500 tokens per chunk. Enough context to be useful, small enough for precise retrieval.
Overlap: 50-token sliding window between chunks to preserve context at boundaries.
Metadata preserved per chunk: source page ID, heading hierarchy, authority tier, last-modified date, content-type tag.

Chunk boundaries matter more than chunk size. A well-placed split preserves meaning; a bad one destroys it.

Backup: Data Quality Deep Dive

The RAG Quality Cliff

Two independent studies confirm the same pattern: vector search accuracy degrades meaningfully as corpus size grows, even with good embeddings.

EyeLevel Research, 2024

Tested Pinecone + OpenAI ada-002 embeddings across 1K, 10K, and 100K pages. 310 real answer-bearing pages + filler. 92 test questions, human-evaluated.

1K pages

baseline

10K pages

visible drop

100K pages

–12%

Expected degradation: 10-12% per 100K pages. A "page" is a full document page (PDF, HTML), not a chunk.

Sarkar (TDS), Jan 2026

Focused on HNSW algorithm behavior. With fixed index parameters (M=32, ef_search=128), recall degrades as vector count scales.

10K vectors

99%

1M vectors

~92%

10M vectors

85%

A "vector" is a single chunk embedding. 46K Confluence pages at ~10-20 chunks each = 600K-1M vectors.

Takeaways

Noise ratio drives degradation. More irrelevant content = harder to surface the right chunks.
Mitigations compound: class isolation + authority tiers + hybrid search + reranking.
"Ingest everything" has a ceiling. Class decomposition delays it indefinitely.
For Boltline: multi-tenant customers with large multimodal data need class-scoped search from day one.

EyeLevel, "Do Vector Databases Lose Accuracy at Scale?", 2024 | Sarkar, "HNSW at Scale," TDS, Jan 2026

Fusion Intel

About Me

What We'll Cover

This talk is about

This talk will

What Is Fusion Intel?

The problem with general AI

What RAG changes

Why Build This?

First: RFP Bottleneck

Next: Cross-Domain Intelligence Multiplier

Build vs. Buy

Team Usability

Compliance

Constraints & Opportunities

Domain Constraints

Technical Constraints

What we know so far

Functional

Non-Functional

The RAG Flow

System Architecture

Hybrid Search

Model Strategy

Haiku: Classifier ~$0.01/q

Cohere: Reranker ~$0.003/q

Sonnet: Generator ~$0.15/q

Confidence Scoring

HIGH

MEDIUM

INSUFFICIENT

Teaching the System to Refuse

The list grows

Functional

Non-Functional

The Ingestion Pipeline

The Laffer Curve of RAG

Document Curation & Class Isolation

Growing stronger

Functional

Non-Functional

Golden Answer Cache

The Eval Harness

Eval Types Matrix

Almost there

Functional

Non-Functional

Designing for Multi-Domain from Day One

Swappable Components

Azure Architecture

What This Costs

What We Gained

Revenue Pipeline†

Operational Improvements‡

Closing the Loop

If I Had More Time

Agentic Canvas Workflows

Deeper Multimodal

Predictive Gap Analysis

Teams Integration

How This Translates to Boltline

Modular Harness Architecture

Work Plans, BOMs, and Part Traceability

Hybrid Search for Exact Matches

Corpus Quality at Scale

Thank You

A Note on Harnesses

Block's Goose

Stripe's Minions

How we applied this

UX Decisions Under the Hood

Chunking Strategy

The RAG Quality Cliff

EyeLevel Research, 2024

Sarkar (TDS), Jan 2026

Takeaways

Revenue Pipeline^†

Operational Improvements^‡