WIM - Advocate Labs

1The Commercial Insurance Market

Roughly 30 million commercial insurance policies are in force in the United States, carrying half a trillion dollars in annual premium. Nearly every policy securing debt is reviewed against lender, broker, or GSE’s written requirements at origination and at every renewal. A single review decomposes into hundreds of discrete checks; a Freddie Mac program produces over 1000 checks. Multiply through and the market executes billions of expert insurance tasks a year, in many cases, entirely by hand. A missed check is an uninsured loss with a legal owner: an errors-and-omissions event.

The review itself is a verification problem with three unstructured inputs: a written requirement set stating what must be true, the asset’s risk profile determining which requirements activate, and the policy as issued, in bespoke language from one of 1,500+ carriers, stating what is actually true. A complete review answers one question hundreds of times over: does the coverage in force satisfy every requirement this case activates? Every unsatisfied requirement becomes a flagged gap the borrower must cure. This is the problem.

2The Largest Legal Niche

Commercial insurance is a legal contract problem. Frontier models face the same challenges here as in every legal vertical: deep contextual reasoning, exact factual accuracy, and judgment calls that are ultimately a human’s to make. Policy checking has one property the rest of legal work lacks: much of the reasoning can be offloaded to rule-based logic. When what is being protected is properly mapped to what protections are required, the problem becomes partially deterministic, unlocking accuracy gains and significantly reducing cost. Reading and interpreting documents is then framed in the context of premade deterministic decisions.

Run 000 measures what that framing is worth. Blind, the best frontier model finds 37% of the real coverage gaps in a case. Paired with WIM, the best finds 63%. The deterministic rule engine finds 74% at twenty-six cents per review, before a human touches the file; a licensed reviewer finds 96% at $67.50. Twelve of the nineteen adjudicated gaps were found by zero models in the blind condition, because the requirements that flag them exist only in the compiled rule set, not in the published guide.

System	Macro accuracy	Real gaps found	Cost / review
Claude Sonnet 4.6Anthropic · WIM provided	62.1%	63%	$0.45
GPT-5.5OpenAI · WIM provided	55.4%	58%	$0.78
Claude Opus 4.8Anthropic · WIM provided	46.5%	11%	$0.99
Gemini 3.1 ProGoogle · WIM provided	46.0%	47%	$0.4
GPT-5.4 MiniOpenAI · WIM provided	45.5%	0%	$0.09
Claude Fable 5Anthropic · WIM provided (text condition)	44.3%	32%	$3.63
Grok 4.20xAI · WIM provided	41.0%	26%	$0.12
GPT-5.4OpenAI · WIM provided	39.7%	37%	$0.32
Claude Haiku 4.5Anthropic · WIM provided	36.1%	0%	$0.14

Table 1. Run 000R WIM-condition results (same systems as the solid dots in Figure 1), ranked by class-balanced macro accuracy across verdict statuses. "Real gaps found" matches Figure 1's x-axis for the same systems. Cost is actual billed per complete review (two-review average). Pilot data; n grows with each added case.

3How the Benchmark Works

The benchmark unit is one complete review of a real case. Every contestant receives the same inputs: the case’s insurance and collateral documents, and Chapter 31 of the Freddie Mac Multifamily Guide, the published source of the requirements being checked. Contestants are scored on the share of the case’s real coverage gaps they find, against a gold record built from document-cited adjudications and expert rulings. Cost is captured per review from actual billed inference, never price sheets.

The review runs under three conditions, the rungs visible in Figure 1. Blind: the model must construct the review itself from the guide and the documents, deciding which requirements the case activates and evaluating each one. WIM condition: the model is additionally handed the WIM-generated check set for the case, an enumeration that does not exist outside our system. Advocate: WIM decides which rules the case activates deterministically, the model only reads and extracts, and the rule engine computes every verdict. The distance between the rungs is the measured value of each layer.

The blind condition exposes the part nobody benchmarks: knowing which rules are in play. Handed 160 candidate checks and asked which apply to the case, frontier models agreed with the engine’s deterministic applicability at most a third of the time. They issued confident verdicts on builder’s-risk rules with no construction and blanket-policy rules on certificate cases. Most of a review’s failure surface is upstream of reading the policy.

4WIM and the rule engine

The World Insurance Model is five years of R&D making the review computable: a taxonomy of commercial insurance across 100+ asset types, and a rule engine that expresses every requirement as a deterministic, machine-evaluable condition, a child of our own DSL and the English language. WIM ingests the three inputs defined in Section 1 (the requirement set, the risk profile, the policy as issued) and emits the checks the case activates. A flood determination showing Zone AE activates the flood requirement and emits a task for a model or insurance expert to complete, grounded to the documents that triggered it. Fully customizable, with zero hallucination risk in the deterministic layer.

Unstructured in

Requirement set

agency guide · loan covenants

Risk profile

flood zone · construction · asset type

Policy documents

declarations · schedules · forms

WIM

taxonomy
rule engine

Deterministic tasks out

Confirm flood coverage in force at required limit

← flood determination shows Zone AE

Check earthquake sublimit against requirement

← endorsement TP T1 82 attached

Resolve per-location building limit

← schedule reads “Include in Blanket 1”

Figure 2. WIM compiles a case's unstructured inputs (the requirement set, the asset's risk profile, and the policy as issued) into the deterministic checks that case requires.

5Composition of the Advocate Harness

The Advocate Harness outperforms raw frontier inference on both accuracy and cost through two assets: WIM, and gold-label evals. The harness is WIM at inference time: a division of cognitive labor in which the model does only what software cannot. It reads, interprets, and extracts, and hands everything deterministic to the rule engine through tool calls and prompting. Typed schemas make malformed answers impossible. Failed validations bounce back with the objection attached. Requirement math costs no tokens and admits no hallucination.

Gold labels iteratively improve the harness. A team of 50+ insurance experts works inside WIM’s own environment, producing verified gold labels for live tasks as a byproduct of doing the work. Every label is a graded episode: the same checks the benchmark scores, with the expert’s answer attached