Modular AI development that absorbs improvements fast
Lessons learned, applied to Boltline
This talk will
Derive requirements from domain constraints
Walk through architecture and implementation
Share strong opinions, loosely held
Get into the nitty gritty!
Context
What Is Fusion Intel?
A platform that lets our teams ask questions and get accurate, cited answers from our own documentation, powered by AI.
The problem with general AI
Trained on public data, not ours. Guesses when it doesn't know.
What RAG changes
Retrieval-Augmented Generation: searches our vetted docs first, then answers. Every claim traceable to source.
Intelligent Search
Finds the right information across thousands of pages using both keyword and meaning-based search
AI-Powered Answers
Generates clear, natural-language responses grounded in our documentation, with citations
Multi-Domain
Built once, used by RFP, Legal, Support, Client Success. Each team gets their own curated knowledge base.
Prioritizes correctness over helpfulness. Cite, flag, or refuse. Never guess.
The Problem
Why Build This?
First: RFP Bottleneck
RFPs are the bloodflow of business growth
40-80 hrs per RFP, SMEs pulled off product work
Clear first use case: massive upside, limited downside
Next: Cross-Domain Intelligence Multiplier
Same knowledge bases serve Legal, Compliance, Support, CSM
Build once well, then scale from first principles
"We have all [the intel] we need to be winning deals left and right, if we could actually find it in time. Rovo sucks. Half the time, it can't find anything. Half of the remaining half, it lies. I don't want to ground our responses in what a bot told [our BD team] if failing to honor that claim means a few million sunk and our brand ruined. There's no question that this class of technology is powerful. [...]There's got to be a way."
M. Jakovcic, CRO, initial planning meeting 12/2025
Framing
Constraints & Opportunities
Domain Constraints
Multi-department, one platform (RFP, Legal, Support, CSM)
HIPAA compliance: full data control required, no RAG-in-a-box
Docs scattered across Confluence + OneDrive, no unified search
Cross-domain sharing with clear department boundaries
Technical Constraints
RAG quality degrades at scale (46K pages, 131K tickets)
LLMs are sycophantic: will guess instead of refuse
AI moves fast: every component must be swappable
Constraint #1
The core value prop is genuinely accurate intelligence. If we don't have provenance, we don't have an app.
Helpfulness vs. honesty. We chose honesty.
The Governing Principle
Principle 1: "Tell the truth, or at least don't lie."
Principle 2: "Log and lead."
First Pass: Weaviate hybrid search across authority-tiered classes
Second Pass: Confluence deep search via Atlassian MCP. If found, recommend tagging into corpora.
Third Pass: Intelligence advisory. What was searched, what's missing, who to ask next.
A confidently wrong answer costs more than "I need more information."
→ New requirement:Citation-backed responses. "Insufficient Evidence" as a first-class output with actionable gap reporting.
Requirements v1
What we know so far
0 / 12
Functional
Citation-backed responses
"Insufficient Evidence" as a first-class output
Multi-pass retrieval with fallback advisory
Feedback loop for confident-but-wrong responses
Non-Functional
100% citation presence on answered queries
<1% unsupported claims
Recurring queries served from verified cache
How It Works
The RAG Flow
Instead of asking a general AI model and hoping it knows the answer, we feed it our vetted information first. Here's how.
1. Tag
A team member tags a Confluence page or Jira ticket with a special label
→
2. Process
A webhook fires, sending the content through a pipeline that breaks it into chunks and converts it to "AI language" (embeddings)
→
3. Store
Stored in Weaviate, a specialized database that lets AI agents find exactly the right pieces quickly
→
4. Retrieve
When someone asks a question, the system searches for the most relevant chunks from our vetted knowledge
→
5. Answer
The AI generates a response grounded in our docs, with citations. No guessing, no hallucinating.
Without RAG
"I think Fusion might support that..."
No source. Could be wrong.
vs.
With RAG
"Yes. See User Manual ch.12, §4.2"
Cited. Verifiable. Grounded.
Constraint #2
Multiple departments need this. We can't rebuild for each one. One architecture, many consumers, clear boundaries.
Flexibility vs. simplicity. Config-driven over code-driven.
Architecture
Designing for Multi-Domain from Day One
RFP response is the first use case. But Legal, Compliance, Client Success, and Support all need the same capability. The architecture serves all of them without retooling.
Each intersection maps to a Weaviate class. Adding a new department = a YAML config change. No code changes. RBAC gates who sees what.
"Interface" = HL7 interface capabilities (speaking with other vendor systems like pharmacy, lab, EHR).
When tags are removed, the system soft-deletes (archives) vectors rather than persisting them. This prevents mistakes from polluting signal.
→ New requirements:Two-axis tag routing. Config-driven domain extensibility. RBAC-scoped access. Soft-delete on tag removal.
Requirements v2
Building the list
2 / 12
Functional
Citation-backed responses
"Insufficient Evidence" as first-class output
Multi-pass retrieval with fallback advisory
Two-axis tag routing
Config-driven domain extensibility
RBAC-scoped access per department
Soft-delete on tag removal
Non-Functional
100% citation presence
<1% unsupported claims
Idempotent ingestion (reprocess = zero duplicates)
Architecture
System Architecture
Presentation Layer
RFP UI
Legal UI
CSM UI
MCP Clients‡
REST / GraphQL / MCP
Each domain ships OOTB workflows via /slash commands, surfaced as recommended prompts. Stored prompt templates for reuse.
↓
Agent Harness
Classify
→
Route
→
Retrieve
→
Rerank
→
Score
→
Generate
→
Cite
[Haiku] [Sonnet]
↓
Storage
Weaviate:
RFP_*
Legal_*
CSM_*
General
Hybrid (Vector + BM25) + Cross-encoder
↓
Ingestion
Confluence
→
Webhook
→
Service Bus
→
Tag Route
→
Snapshot
→
Chunk
→
Embed
→
Upsert
‡Future Dev: Exploring a centralized "ToolShed" MCP server (inspired by Stripe's Minions architecture) with access to Fusion-specific tools and third-party connections like Weaviate, Confluence, and Jira. One server, any client. See backup slide on harnesses for more.
Constraint #3
Knowledge changes constantly, and we have a lot of it. 46K pages, 131K tickets. Stay current without drowning in noise.
Volume vs. quality. Curate aggressively, ingest idempotently.
Ingestion
The Ingestion Pipeline
1Webhook Fires
Confluence page created/updated triggers an Atlassian webhook. Event-driven means zero polling overhead.
2Queue
Service Bus absorbs bursts, retries failures, dead-letters poison events. Decouples ingestion from source.
3Tag Route
Labels map to taxonomy axes. Determines which Weaviate class(es) receive the content. Removals trigger soft-delete.
4Snapshot
Canonical version stored in Blob Storage. Vector DB is derived, never authoritative. Always rebuildable.
5Chunk
Semantic chunking: heading boundaries first, then paragraphs, then sentences. 300–500 tokens with 50-token overlap.
6Dedup & Embed
Content-hash deduplication ensures idempotency. Embedding model generates vectors for each chunk.
7Upsert
Vectors land in the correct Weaviate class with full metadata: source ID, heading path, authority tier, timestamps.
Every version gets a canonical snapshot before chunking. The vector DB is derived, never authoritative. You can always rebuild from snapshots.
→ New requirement:Idempotent ingestion. Reprocessing a page must never create duplicates.
Data Quality
The Laffer Curve of RAG
More data isn't always better. With default HNSW† settings, recall drops from 99% at 10K vectors to 85% at 10M‡. At Fusion's scale (~600K-1M chunk vectors), naive ingestion puts us in the degradation zone.
"Ingest everything" trap: noise dilutes signal as vector count grows
Our defense: domain-isolated classes, 2K-5K vectors per query scope
Authority tiers: Gold 3x / Silver 2x / Reference 1x scoring per class
→ New requirement:Domain-isolated vector classes. Each query must search thousands of curated vectors, not millions of noisy ones.
Generates citation-backed responses from reranked chunks. Every claim must trace to source. Confidence scoring is deterministic, happens after generation.
Sonnet Output
"Yes. Fusion EHR supports HL7 ADT (Admit, Discharge, Transfer) feeds via its integration engine. [Integration Guide p.12] Standard SLA for new HL7 interfaces is 10 business days. [SLA Matrix §3.1]"
CONFIDENCE: HIGH 2 gold sources, relevance 0.91
CITATIONS: 2 Integration Guide, SLA Matrix
Circuit breaker: embedding down = keyword-only fallback, system stays up
Model router: config surface. New models or compliance boundaries = no code changes.
→ New requirements:Query decomposition for compound questions. Circuit breaker for graceful degradation. Monthly cost under $800.
Requirements v3
The list grows
7 / 12
Functional
Citation-backed responses
"Insufficient Evidence" as first-class output
Multi-pass retrieval with fallback advisory
Two-axis tag routing
Config-driven domain extensibility
RBAC-scoped access
Soft-delete on tag removal
Query decomposition for compound questions
Circuit breaker / graceful degradation
Configurable model router
Non-Functional
100% citation presence
<1% unsupported claims
Idempotent ingestion
P95 latency <12s
Monthly cost <$800
Constraint #5
The system will be wrong sometimes. The question isn't whether. It's whether it knows when.
Trust vs. coverage. We'd rather refuse than guess.
Confidence
Confidence Scoring
HIGH
Multiple Gold sources agree
Gold = 3x weight
MEDIUM
Silver sources, partial coverage
Silver = 2x weight
INSUFFICIENT
Refuses to answer, logs the gap
Reference = 1x weight
EXAMPLE
Q: "Does Fusion EHR support electronic prescribing?"
→ Confidence: HIGH | Weighted score: (2 × 3) + (1 × 2) = 8 > threshold of 5
Gold 3x
Gold 3x
Silver 2x
= 8 > threshold (5)HIGH
Rule-based, not learned. Fully explainable, no black-box confidence.
Gold sources count 3x in the weighted confidence calculation. Two gold sources matching = weighted score of 6. One silver = 2. The threshold for HIGH is 5+.
Compound queries: overall confidence = minimum across sub-queries.
→ New requirements:Source authority tiers (Gold 3x, Silver 2x, Reference 1x). Over 90% accuracy on 50-question validation set.
Refusal
Teaching the System to Refuse
LLMs are sycophantic. Will guess to be "helpful."
Claude chosen for refusal backbone, but model-level isn't enough
Harness enforcement: confidence scoring is code, not model judgment. Below threshold = generation blocked.
Every refusal logged: what was searched, what was missing
The gap tracker turned out to be the system's most valuable output: a prioritized list of documentation holes.
100 queries
→
82 answered with citations
→
18 insufficient evidence logged
→
Gap Report prioritized backlog
→ New requirement:Gap tracking. Every refusal must log what was searched, what was missing, and feed that directly to content stakeholders.
Requirements v4
Almost there
12 / 14
Functional
Citation-backed responses
"Insufficient Evidence"
Multi-pass retrieval + fallback advisory
Two-axis tag routing
Config-driven extensibility
RBAC-scoped access
Soft-delete on tag removal
Query decomposition
Circuit breaker
Configurable model router
Golden answer cache for recurring queries
User feedback loop (thumbs up/down + corrections)
Gap tracking: refusals feed content backlog
Source authority tiers
Non-Functional
100% citation presence
<1% unsupported claims
Idempotent ingestion
P95 <12s
Monthly cost <$800
>90% accuracy on 50-question validation set
Constraint #6
RFPs ask the same questions across submissions. Users will catch errors the system can't. Both signals need to feed back into the loop.
Repetition is an optimization opportunity. User corrections are free eval data.
Evaluation
The Eval Harness
50 questions: real RFP questions + adversarial out-of-scope probes
Built first, not last. The eval harness existed before the generation pipeline.
CI/CD integration: automated triggers after large doc updates, model switches, chunking changes, weekly regression
User feedback: thumbs up/down + correction text on every response. Catches "confident but wrong."
SAMPLE EVAL RUN
Q: "Does Fusion support HL7 ADT feeds?"
Expected: Yes, with citation to Integration Guide p.12
Container Apps over AKS: simpler, scale-to-zero saves ~40%. Service Bus over Redis: durable, dead-letter. Self-hosted Weaviate over Pinecone: data sovereignty, no vendor lock-in.
Cost
What This Costs
Component
Monthly
Container Apps (3 containers, scale-to-zero)
$80 – $150
Weaviate (self-hosted)
$15 – $25
APIM + networking
~$50
Anthropic API (Haiku + Sonnet)
$100 – $400
Storage + queue
$50 – $130
Total
$295 – $755 /mo
One BD professional's time on manual RFP research costs more per month than the entire platform. Plus SME opportunity cost: subject matter experts pulled off product work to help the revenue team.
Results
What We Gained
Revenue Pipeline†
Q1 2026 was our biggest Q1 by a factor of 1.4x
Strong possibility of adding 3 more deals per quarter into the pipeline for the year
Estimated additional $6-10M above initial target
More at-bats per year: faster RFP turnaround means we can pursue deals we previously had to pass on
Operational Improvements‡
Multimodal improvements driving better SLA adherence on support
NPS for client satisfaction and eNPS for support team both growing steadily
Better responses, easier troubleshooting through augmented workforce (internal)
Our domain is too complex for an Intercom/Finn-style chatbot. Augmenting our own people is moving the needle.
† Numbers shared at Fusion Health Town Hall, 3/18/2026 ‡ Monthly reporting from the Operations team (Value & Automation division, under which the AI function is housed)
Traceability
Closing the Loop
Every requirement traces to a design decision.
Requirement
Design Decision
Citation-backed responses
Sonnet generates inline citations from retrieved chunks
Insufficient Evidence
Three-tier confidence scoring, rule-based refusal
Multi-domain routing
Two-axis taxonomy, config-driven Weaviate classes
RBAC
Azure AD + tag-scoped class access
Idempotent ingestion
Content-hash dedup + canonical blob snapshots
P95 <12s
Haiku classifier, APIM caching, hybrid search
Cost <$800/mo
Scale-to-zero containers, two-model split
Gap tracking
Refusal logs → content backlog pipeline
>90% accuracy
50-question eval harness, built first
Recurring query optimization
Golden answer cache: verified responses bypass RAG pipeline
Catching confident-but-wrong
User feedback loop: thumbs up/down + corrections, weekly review
Optimizations
If I Had More Time
Agentic Canvas Workflows
RFP section building + citation implantation in a document canvas (like Claude Artifacts), so BD pros iterate on documents inside the platform.
Golden Answer Cache
Verified answer store for recurring questions. Human-approved responses bypass the full RAG pipeline. Highest-ROI optimization for repetitive query patterns.
Cross-Domain Intelligence Synthesis
Queries spanning multiple department corpora with appropriate access controls. Synthesize across silos while respecting RBAC.
Teams Integration
Extending the MCP server pattern to a Microsoft Teams bot, where teams actually work. Same intelligence, delivered where the conversations happen.
Domain-isolated classes→Per-customer class scoping
Thank You
Questions? Let's go deep on anything.
Connor England
Backup: Harnesses
A Note on Harnesses
Models are commoditizing fast. The real moat is the harness: deterministic workflows that govern when and how intelligence gets applied. Two projects shaped how I think about this.
Block's Goose
Open-source, LLM-agnostic agent framework. MCP as the sole integration standard. Local-first, no vendor calls. Now in the Linux Foundation alongside MCP itself.
Stripe's Minions
Built on a Goose fork. 1,300+ PRs/week, zero human-written code. Key innovation: "Blueprints" (hybrid deterministic + agentic nodes). ~500 tools available, ~15 curated per task.
Request
→
Pre-hydrate Context
→
LLM Tool Call
→
Execute Tool
→
Context Revision
→
Response
Where these influenced Fusion Intel
Blueprints pattern (from Minions): our confidence scoring and citation linking are deterministic nodes. Retrieval and generation are the agentic nodes. The LLM never decides whether to refuse; the harness does.
Context pre-hydration (from Minions): we retrieve, rerank, and assemble the context payload before the generation model sees anything. The LLM's job is writing, not searching.
Tool curation (from Minions): a Legal query only sees Legal_* Weaviate classes + General. We scope tools per domain rather than exposing everything.
MCP as universal integration (from Goose): one protocol for all tool access. Same brain, any interface.
Backup: UX
UX Decisions Under the Hood
Streaming responses. Users see generation in real-time, reducing perceived latency by ~60%.
Citations rendered as clickable links to source Confluence pages, with excerpt previews on hover.
Confidence badge shown prominently. Users learn to trust the system because it tells them when it's unsure.
Query history with re-ask capability. Refine without retyping.
Admin view shows the full retrieval trace: what was searched, what was retrieved, what was filtered, what was cited.
Backup: Chunking
Chunking Strategy
Semantic chunking: split on heading boundaries first, then paragraph breaks, then sentence boundaries.
Target: 300–500 tokens per chunk. Enough context to be useful, small enough for precise retrieval.
Overlap: 50-token sliding window between chunks to preserve context at boundaries.
Chunk boundaries matter more than chunk size. A well-placed split preserves meaning; a bad one destroys it.
Backup: Data Quality Deep Dive
The RAG Quality Cliff
Two independent studies confirm the same pattern: vector search accuracy degrades meaningfully as corpus size grows, even with good embeddings.
EyeLevel Research, 2024
Tested Pinecone + OpenAI ada-002 embeddings across 1K, 10K, and 100K pages. 310 real answer-bearing pages + filler. 92 test questions, human-evaluated.
1K pages
baseline
10K pages
visible drop
100K pages
–12%
Expected degradation: 10-12% per 100K pages. A "page" is a full document page (PDF, HTML), not a chunk.
Sarkar (TDS), Jan 2026
Focused on HNSW algorithm behavior. With fixed index parameters (M=32, ef_search=128), recall degrades as vector count scales.
10K vectors
99%
1M vectors
~92%
10M vectors
85%
A "vector" is a single chunk embedding. 46K Confluence pages at ~10-20 chunks each = 600K-1M vectors.
Why this matters for system design
The degradation is a function of noise ratio, not a magic number. The more irrelevant content in your index relative to any given query, the harder it is to surface the right chunks. Both studies confirm this independently.
Mitigation strategies compound. Domain-isolated classes (search 2K-5K vectors, not 1M). Authority tier weighting (gold sources float above noise). Hybrid search (BM25 catches exact matches that vector search misses at scale). Cross-encoder reranking (re-scores the top-N for true relevance). Each layer recovers precision that raw vector search loses.
The "ingest everything" strategy has a structural ceiling. Teams adopting "ingest all docs of types X, Y, Z" into one index will hit this wall. The question isn't if, it's when. Class decomposition and curation delay the cliff indefinitely.
For Boltline: this matters even more in a multi-tenant manufacturing context. Customers with large volumes of data, especially multimodal (work plan docs, CAD references, inspection photos, test logs), need thoughtful class design and search strategies from day one. A platform that lets customer data grow unchecked into a monolithic index will hemorrhage effectiveness at scale. The right approach is guiding customers toward well-structured classes, curated ingestion policies, and domain-scoped search, so that the system gets more useful as data grows instead of less.