Deep Horizon — Scaling AI collaboration

/01·B

◆ Industries · same agent · six deployments

One agent.
Six knowledge
frontiers.

Same RL-trained policy. Same API. Specialized to the way each industry actually stores, references, and recalls knowledge — and the exact queries each team gets stuck on.

/ 01 · classified knowledge under hard isolation

Defense
& Intel.

Operational decision recall across classification boundaries. Tool gating respects compartment walls — the agent never queries a corpus it wasn't admitted to. Self-hosted on your GPUs, your network, your authority to operate.

14 tool calls

Median to a grounded answer · vs. 110 on majority vote

0 hops

In-process · sub-ms tool dispatch · zero egress

◆ Example querieshandled

◆ COA reconstruction · J3 / planning

"Reconstruct the COA review chain for OP-77 Phoenix — every dissent, every reason, every cited assessment."

◆ Analyst attribution · cyber

"Who on the intel team has reported on Helios-cluster activity in the last 90 days, and which assessments converged?"

◆ Cross-source synthesis · logistics

"What did the J4 conclude about the November supply gap, who supported that conclusion, and who pushed back?"

/ 02 · institutional memory across PIs, papers, grants

Research
Labs.

Institutional memory across PIs, papers, grants, and lab notebooks. Resolves authorship chains, cites the meeting where a hypothesis was first floated, surfaces who actually knows what across the lab.

50.2%

Profile completeness on 18-person research bench

67 leaves

Avg. structured profile slots populated per entity

◆ Example querieshandled

◆ Profile extraction · PI onboarding

"Build me Aisha Patel's research profile — papers, collaborators, grant history, expertise, recent topics."

◆ Collaborator discovery · cross-lab

"Who in the lab has worked on attention mechanisms with Boris Katz, and what did they each conclude independently?"

◆ Hypothesis archaeology · internal

"When did we first propose the gradient-routing approach, who pushed back, and what was the resolution?"

/ 03 · deal & relationship intelligence

Financial
Services.

Deal & relationship intelligence over IB chat, CRM, and email. Answers "who at Lazard did we last talk to about MidCap Energy" with a citation chain — not a document hit.

$0.012

Per complex query · VGS k=2 · agent-callable

15–25×

Cheaper than frontier API per reasoning query

◆ Example querieshandled

◆ Touchpoint reconstruction · coverage

"What was the last touchpoint with Goldman on the Helios financing, and what did they push back on?"

◆ Account history · relationship

"Who from the team has covered the MidCap Energy account in the last 18 months, and what's been said about credit?"

◆ Internal sentiment · risk

"Synthesize all internal commentary on Q3 default risk for the EMEA book — by sector, by analyst, by date."

/ 04 · clinical-trial memory · protocol recall

Life
Sciences.

Clinical-trial memory and protocol recall across CRO threads, IRB amendments, and investigator notes. Time-aware reasoning resolves "as of the v3 amendment" without re-indexing the corpus.

±3 pp

Parity gate · deployed API tracks eval-harness within

audit-ready

Every answer carries a citation chain to source

◆ Example querieshandled

◆ Amendment trace · protocol

"What did the principal investigator decide about the v3 dosing change, and what was the IRB rationale?"

◆ Site signal · operations

"Which sites flagged the cohort B drop-out spike, when, and what was the resolution per site?"

◆ Decision archaeology · steering committee

"Reconstruct every protocol amendment to TRIAL-42, who drove each, and what evidence supported it."

/ 05 · case & precedent recall

Legal
& Compliance.

Case & precedent recall across discovery, depositions, and partner correspondence. The agent surfaces the analogous matter — not just the matching word — and grounds every answer in source.

12 → 1

Hours of associate review collapsed to one query

cited

Every claim traced back to the source document

◆ Example querieshandled

◆ Precedent search · firm-wide

"Which of our matters since 2022 involve a similar consent-decree carve-out to Acme v. Olson?"

◆ Drafting history · deal

"Who drafted the indemnification clause in the Phoenix deal and what was the negotiation history with opposing counsel?"

◆ Correspondence pull · discovery

"Surface every email between us and opposing counsel about the privilege log in this matter — chronologically."

/ 06 · inter-agency knowledge · namespace isolation

Public
Sector.

Inter-agency knowledge with hard team isolation. Each agency gets its own memory namespace. API keys are team-scoped. No cross-jurisdiction data leakage by construction — enforced by the gating plugin, not policy.

0 egress

Self-hosted · per-agency namespace · sub-ms dispatch

N keys

Per-agency bearer-auth · no cross-namespace queries

◆ Example querieshandled

◆ Cleared synthesis · DHS / DOJ

"What did DHS conclude about the December supply-chain incident, and who has clearance to see the underlying source?"

◆ Policy archaeology · program office

"Reconstruct the policy reasoning behind the 2025 grant program revision — every memo, every author, every dissent."

◆ Working group lookup · inter-agency

"Who across the inter-agency working group has worked on rural broadband, and what did each agency conclude?"

/04

◆ Training loop · why RL, not prompts

The agent learns which
search strategies
actually work.

A prompted agent uses the same strategy every time. An RL-trained agent has learned — from thousands of rollout trajectories — when to cross-reference, when to go deeper, and when it has enough evidence to commit. This is learned behavior, not instructed behavior.

/ 01 · input

Corpus

Team conversations, decisions, docs, code reviews — the substrate.

▸

/ 02 · generate

Question gen.

Model generates its own training questions from the corpus.

▸

/ 03 · explore

Agent rollouts

Search · reason · answer. Thousands of trajectories per iteration.

▸

/ 04 · score

Reward signal

Nugget-coverage scoring (Voorhees) — did the agent get it right?

▸

/ 05 · update

OAPL update

Off-policy RL. The policy learns which strategies actually pay off.

◂ iterate · 5.5× compound improvement across iter 1 → iter 2b

◆ Why RL, not just prompting

A prompted agent runs the same heuristic every query. A trained agent has learned — over thousands of rollouts — which search strategies pay off for which question shapes. Learned behavior, not instructed behavior.

◆ Test-time compute · each step counts

Value-Guided Search picks the highest-scoring action per step — one smart trajectory. Parallel Thinking runs N trajectories and merges results — maximum coverage. The value model is trained on Deep Horizon rollouts, not generic.

/05

◆ Benchmarks · measured against frontier models · open eval harness

Numbers, not
narratives.

Evaluated on real team corpora and academic benchmarks. Every claimed improvement has a parity gate: the deployed API must reproduce eval-harness numbers within ±3 percentage points.

/A · 18-person bench · profile extraction

Structured-profile coverage

Given a person's name and a team memory corpus, extract a complete structured profile across identity, professional background, education, relationships, publications, and all known facts.

Model

Accuracy

Cost / Profile

Deep Horizon iter-2b + PT N=10

50.2% · $0.15

Claude Sonnet 4.6

43.8% · $1.14

GLM 4.5 Air (base, no RL)

8.2% · $0.04

+6.4 pp over Sonnet. 7.6× cheaper. 5× faster. Method: 10 independent agent rollouts with per-leaf union aggregation — a novel mechanical merge across rollouts for maximum coverage with no hallucination risk.

/B · OAPL 101-question bench · factoid QA

Short-answer factual recall

Answer short factual questions (1–5 words) about people, events, and relationships in the team corpus. Tool-call efficiency matters as much as accuracy.

Model

Accuracy

Tool calls

Deep Horizon iter-2b + VGS k=2

45.5% · ~14

DH iter-2b + Majority Vote N=10

39.6% · ~110

GLM 4.5 Air (base, no RL)

23.0% · ~12

+5.9 pp over majority vote with 8× fewer tool calls. Value-Guided Search uses a trained value model (Qwen3-4B) to pick the best action at each step — smarter, not just more compute.

/05·B

◆ Training trajectory · 5.5× improvement through RL

From 8% to 45.5%.

8.2%

Base
no RL

23.0%

Iter 1
first OAPL

28.7%

Iter 1.1
+ extraction

33.7%

Iter 2b
multi-bench

45.5%

Iter 2b + VGS
test-time compute

0% accuracy OAPL 101-Q bench · factoid QA 50%

Iteration

Acc.

What changed

Base · no RL

8.2%

GLM 4.5 Air out-of-the-box

Iter 1

23.0%

First OAPL training run (reward-function fix was critical)

Iter 1.1

28.7%

Added extraction questions to training mix

Iter 2b

33.7%

Multi-benchmark training (HotpotQA, MuSiQue, QAMPARI, FinanceBench)

Iter 2b + VGS

45.5%

Test-time compute · Value-Guided Search

/06

◆ Test-time compute · dispatched per task shape

Two inference
strategies. Pick
the right one.

The dispatcher inspects the request, picks a strategy, and the agent inherits the right inference budget. Parallel Thinking buys breadth. Value-Guided Search buys depth.

Best for · profile extraction · schema-fill tasks

Parallel Thinking.

How it works. Spawn N independent agent rollouts in parallel. Each searches the corpus independently and produces a candidate answer. For structured profiles, aggregate with per-leaf union (our novel aggregator); for short answers, an LLM aggregator.

Why it works. Different rollouts find different facts. Union aggregation combines coverage from all rollouts without hallucination — a fact only ships if a rollout cited it.

→ 50.2% on profile extraction · +6.4 pp over Sonnet · $0.15 / profile

Best for · factoid questions · short-answer retrieval

Value-Guided Search.

How it works. At each step, sample k candidate actions from the policy. Score each candidate with a trained value model (Qwen3-4B fine-tuned on Deep Horizon rollouts). Execute the highest-scoring action. Repeat until the agent commits.

Why it works. Instead of more rollouts (breadth), VGS makes each rollout smarter (depth). The value model learns which search queries and reasoning paths lead to correct answers.

→ 45.5% on factoid QA · +5.9 pp over majority vote · 8× fewer tool calls

	KARL · Databricks	Deep Horizon
Training algorithm	OAPL	OAPL · same
Architecture	aroll harness + lifecycle plugins	aroll harness + lifecycle plugins · same framework
Test-time compute	Parallel Thinking + Value-Guided Search	PT + VGS · same strategies
Application domain	Academic QA benchmarks · HotpotQA, MuSiQue, QAMPARI	Team knowledge · people, decisions, relationships, expertise
Novel contribution	Proved RL works for knowledge agents	Per-leaf union for structured extraction · promptless recall · agent-to-agent API
Target user	Researchers	Engineering teams using AI daily

/09

◆ Agent-to-agent · the collaboration layer

The next step is
AI-to-AI.
Your stack is ready.

Human-AI collaboration is solved. The next step is AI-AI. Every agent in your stack needs the same team context. Same multi-hop reasoning. Same API. Structured JSON in, structured JSON out.

◆ code review

needs module ownership and the constraints the owner set.

◆ planning

needs architectural decisions and the rationale behind them.

◆ incident response

needs the history of similar issues and how they were resolved.

◆ onboarding

needs months of team context synthesized for a new hire.

◆ release notes

needs cross-PR narrative — who shipped what, why, what landed together.

// any agent in your stack can call this

# Agent-to-agent knowledge query
curl -X POST https://api.deephorizon.dev/v1/agent/search \
  -H "Authorization: Bearer $AGENT_KEY" \
  -d '{
    "query": "Who owns the payment processing module
              and what were the last 3 architectural
              decisions affecting it?",
    "team_id": "engineering",
    "model": "iter2b-vgs-k2",
    "caller": "code-review-agent"
  }'

# Same API. Same quality. Agent or human.
{
  "owner": "Sarah Chen",
  "decisions": […3 cited entries…],
  "n_tool_calls": 14,
  "cost_usd": 0.014
}

This is what "scaling AI collaboration" means. Not just human + AI. Human + AI + AI + AI — all reasoning over the same team knowledge, all getting smarter as the underlying agent improves through training.

Operation	Deep Horizon	Claude Sonnet	Savings
Profile extraction	$0.15 / profile	$1.14 / profile	7.6× cheaper
Factoid search · VGS k=2	$0.012 / query	$0.18 – 0.30 / query	15 – 25× cheaper
Factoid search · PT N=10	$0.055 / query	$0.18 – 0.30 / query	3 – 5× cheaper
Always-on orchestration	~$30 / month	per-call pricing	predictable

/11

◆ Interface · two endpoints · human or agent caller

Two endpoints.
That's it.

RESTful API. Bearer token. Drop it into any workflow. Model selection is a single field — the dispatcher does the rest.

/v1/agent/search · complex reasoning query

# Complex reasoning query
curl -X POST https://api.deephorizon.dev/v1/agent/search \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "query": "What was the reasoning behind the auth
              rewrite and who drove it?",
    "team_id": "engineering"
  }'

# Response
{
  "answer": "The auth middleware rewrite was driven
    by legal/compliance requirements around session
    token storage. Sarah Chen led the effort,
    decision finalized March 5. Key constraint:
    tokens must rotate every 24h...",
  "model_used": "iter2b-vgs-k2",
  "n_tool_calls": 12,
  "elapsed_seconds": 68.4,
  "cost_usd": 0.014
}

/v1/agent/extract · structured knowledge

# Structured knowledge extraction
curl -X POST https://api.deephorizon.dev/v1/agent/extract \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "target_entity": "Sarah Chen",
    "team_id": "engineering"
  }'

# Response
{
  "profile": {
    "name": "Sarah Chen",
    "role": "Senior Backend Engineer",
    "owns": ["auth middleware", "session mgmt"],
    "recent_decisions": ["token rotation", …],
    "collaborators": ["Boris K.", "Alex M."],
    "expertise": ["security", "distributed sys"]
  },
  "n_leaves_populated": 67,
  "cost_usd": 0.15
}

/11·B

◆ Available models

Model	Best for	Method	Default for
iter2b-vgs-k2	Factoid questions	Value-Guided Search	/search
iter2b-pt-n10	Profile extraction	Parallel Thinking + per-leaf union	/extract
iter2b-single	Quick baseline	Single rollout, no TTC	—
claude-sonnet-4-6	Fallback	Frontier API path	—

Capability	Deep Horizon	KARL · Databricks	Mem0 / Zep	Frontier APIs
RL-trained reasoning agent	● yes	● yes	○ no	○ no
Test-time compute · PT + VGS	● yes	● yes	○ no	○ no
Beats frontier on extraction	+6.4 pp over Sonnet	+pp over GPT-4 (reported)	n/a	baseline
Promptless context injection	● yes	○ no	○ no	○ no
Agent-to-agent knowledge API	● yes	○ no	key-value store	n/a
Structured profile extraction	50.2% accuracy	not addressed	○ no	prompt-only
Open-weight policy model	● yes	● yes	n/a	○ no
Application domain	team collaboration	academic benchmarks	memory storage	general
Pricing · per complex query	$0.012	research only	SaaS tiers	$0.18+

Scaling AI
collaboration.

Train the agent.
Not the prompt.

Agentic
Reasoning.

Promptless
Context.

Agent-to
-Agent.

One agent.
Six knowledge
frontiers.

Defense
& Intel.

Research
Labs.

Financial
Services.

Life
Sciences.

Legal
& Compliance.

Public
Sector.

AI breaks on the
questions that
matter most.

The AI starts informed.
You start typing.

The agent learns which
search strategies
actually work.

Numbers, not
narratives.

Structured-profile coverage

Short-answer factual recall

From 8% to 45.5%.

Two inference
strategies. Pick
the right one.

Parallel Thinking.

Value-Guided Search.

Same agent.
Trained, then
inferred.

◆ KARL-faithful lifecycle plugins

◆ In-process retrieval

◆ Open-weight policy

◆ Trained value model

Same research lineage.
Different application.

The next step is
AI-to-AI.
Your stack is ready.

Predictable
by the month.
Not the token.

Two endpoints.
That's it.

How Deep Horizon
compares.

Your AI stack is missing
a collaboration layer.

Scaling AI collaboration.

Train the agent.Not the prompt.

AgenticReasoning.

PromptlessContext.

Agent-to-Agent.

One agent.Six knowledgefrontiers.

Defense& Intel.

ResearchLabs.

FinancialServices.

LifeSciences.

Legal& Compliance.

PublicSector.

AI breaks on thequestions thatmatter most.

The AI starts informed.You start typing.

The agent learns whichsearch strategiesactually work.

Numbers, notnarratives.

Structured-profile coverage

Short-answer factual recall

From 8% to 45.5%.

Two inferencestrategies. Pickthe right one.

Parallel Thinking.

Value-Guided Search.

Same agent.Trained, theninferred.

◆ KARL-faithful lifecycle plugins

◆ In-process retrieval

◆ Open-weight policy

◆ Trained value model

Same research lineage.Different application.

The next step isAI-to-AI.Your stack is ready.

Predictableby the month.Not the token.

Two endpoints.That's it.

How Deep Horizoncompares.

Your AI stack is missing a collaboration layer.

Scaling AI
collaboration.

Train the agent.
Not the prompt.

Agentic
Reasoning.

Promptless
Context.

Agent-to
-Agent.

One agent.
Six knowledge
frontiers.

Defense
& Intel.

Research
Labs.

Financial
Services.

Life
Sciences.

Legal
& Compliance.

Public
Sector.

AI breaks on the
questions that
matter most.

The AI starts informed.
You start typing.

The agent learns which
search strategies
actually work.

Numbers, not
narratives.

Two inference
strategies. Pick
the right one.

Same agent.
Trained, then
inferred.

Same research lineage.
Different application.

The next step is
AI-to-AI.
Your stack is ready.

Predictable
by the month.
Not the token.

Two endpoints.
That's it.

How Deep Horizon
compares.

Your AI stack is missing
a collaboration layer.