APEXE3 AI Benchmarks — Capital Markets & ESG

Headline numbers

81.0%

APEXE3 agent + GLM-5.1 on Daloopa FinRetrieval (500 Qs · LLM-judge). Beats GPT-5.2 +Reasoning 70.8%, Gemini 3 Pro +Reasoning 69.2%, Claude Opus 4.5 19.8%. PDF →

83%

APEXE3 Coder Harness + GLM 5.1 on ESGbench (13 issuer PDFs, 159 QA pairs)

62.1%

GLM-5.2 on SWE-bench Pro — beats GPT-5.5 (58.6%) by +3.5pp (Z.ai, verified 2026-06-13)

76%

APEXE3 Coder Harness + Mistral Small 4 on ESGbench — EU-sovereign open weights beating public frontier zero-shots

+18 pp

Biggest APEXE3 Coder Harness uplift — on Mistral Small 4 (58% → 76% on ESGbench)

Capital-markets retrieval — Daloopa FinRetrieval, 500 questions

APEXE3 internal evaluation · April 2026 · LLM-judge scoring

Result

Accuracy on Daloopa's 500-question benchmark · LLM-judge scoring · arXiv 2603.04403

Configuration	Accuracy
APEXE3 agent + GLM-5.1	81.0%
GPT-5.2 + WebSearch + Reasoning	70.8%
Gemini 3 Pro + WebSearch + Reasoning	69.2%
Claude Opus 4.5 + WebSearch + Reasoning	19.8%

Open-weights reasoning model paired with the APEXE3 agent: domain-specific reasoning, targeted query construction, and careful handling of heterogeneous primary-source formats. The lift comes from how the agent reasons and searches, not from raw model scale. · Read the full benchmark result (PDF) →

Benchmark composition

500 Qs · 41 countries · 475 distinct issuers · fiscal periods 2015–2026

Top issuer HQ	%	Question category	%
United States	42	Balance sheet	20
Japan	10	Cash flow	19
UK · Australia	5 ea.	Operational KPIs	18
Brazil	4	Income statement	16
India · Canada	3 ea.	Guidance / outlook	15
+34 others	28	Segments / geography	13

How the agent answers a question

"What was Prestige Estates Projects Ltd's consolidated net profit for the period for calendar Q1 2025, in INR millions?" — Daloopa FinRetrieval, Q21

Translate the period. Recognises Prestige as an Indian issuer, March fiscal year-end — reformulates "calendar Q1 2025" → fiscal Q4 FY25 before searching.
Identify the primary source. Issues queries with the fiscal framing; the company's own Q4 FY25 filing surfaces alongside news and aggregators.
Reject a conflicting secondary source. A news article reports ₹25 cr standalone profit — the agent recognises this is not the consolidated figure the question asks for, and continues to the primary filing.
Read a scanned 19-page PDF. Combines an OCR tool (open-weights small OCR model) with document-reading and in-document search to locate the consolidated net profit line for quarter ending 31 March 2025.

Agent answer: ₹431 million — value and unit match Daloopa ground truth.

Our ESGbench runs — open-weight models on 13 EU/global issuers

Apple · 2024 ESG Microsoft · 2024 ESG ExxonMobil · 2024 Climate JPMorgan · 2024 TCFD Coca-Cola · 2024 Env Nestlé · 2024 ESG/CSV Samsung Electronics · 2024 Samsung SDS · 2024 Unilever · 2024 Annual Tata Steel · 2024 BRSR Reliance · 2025 BRSR Safaricom · 2024 MTN Nigeria · 2024

Open-weight models on ESGbench — APEXE3 Coder Harness vs. raw model

APEXE3 Coder Harness lift — pp gained per model

Δ accuracy · raw model → inside APEXE3 Coder Harness · 9–18pp range

03b

ESG benchmark landscape

ESGbench · ESG-Bench (AAAI 2026) · ESGenius · ESGReveal · Nature GHG

ESG-Bench (AAAI 2026): CoT crushes direct fine-tuning

overall accuracy % · paper arxiv 2603.13154 · 300-item test split

Paper figures across three base models. CoT 4-step beats both no-FT and supervised FT on every base. Mistral-7B CoT-4 = 90.0% overall, F1 73.5.

ESGenius: RAG lift beats parameter count

zero-shot → RAG accuracy · paper arxiv 2506.01646 · 1,136 MCQs

QwQ-32B jumps 39.00 → 76.14% with RAG — a 95% relative lift. Retrieval architecture > raw scale: the pattern that makes a data-layer startup investable.

ESGenius zero-shot leaderboard — top models

accuracy on 1,136 MCQs · paper figures

Even frontier reasoning models cap near 72% without retrieval. The gap to RAG-augmented mid-size models shows where the product opportunity is.

ESG-Bench label distribution (270 annotated instances)

what LLMs actually output on ESG long-context QA

Paper shows 46.7% correct, 34.8% incomplete, 15.6% hallucinated, 3.0% answer-not-found. Incomplete + hallucinated = 50.4% of outputs need work — the wedge for a vertical LLM.

Cost per million tokens

USD · public API pricing

Input price per Mtok

USD · public API pricing (OpenRouter / provider direct, April 2026)

Sources: Mistral, Alibaba, Z.ai, OpenAI, Anthropic, Google list prices via OpenRouter + pricepertoken.com. GLM 5.1 listed at $0.95/M input ($1.40 direct from Z.ai).

Benchmark evaluation sets

sortable view · finance + ESG

Finance · 6 ESG · 6 Peer-reviewed · 4 Fresh < 90 days · 3

Benchmark	Domain	Scale	Status	Year	Why cite
Daloopa FinRetrieval	Finance · retrieval	500 Qs · 14 configs	Fresh	2026	APEXE3 agent + GLM-5.1: 81.0% — beats GPT-5.2 70.8%, Opus 4.5 19.8%. PDF
FinBen	Finance · holistic	42 datasets · 24 tasks	Peer-reviewed	2024+	FinOS / Linux Foundation — EU-credible
FinanceBench	Finance · QA	150 Qs · 10-Ks		2023	Small but widely quoted
FinBench	Finance · reasoning	—	Peer-reviewed	2026	Reasoning-heavy complement to FinBen
Open FinLLM Leaderboard	Finance · live	living		2026	FinOS-hosted real-time FinBen runs
FinGPT	Finance · OSS	model family		2024+	OSS stack has overtaken BloombergGPT
ESGbench	ESG · pipeline	configurable	Fresh	2026	Fork & run on your own corpus tonight
ESG-Bench	ESG · hallucination	human-annotated QA	AAAI 2026	2026	Hallucination labels — regulator gold
ESGenius	ESG · MCQ	1,136 MCQs · 231 docs	EMNLP 2025	2025	50 models 0.5B–671B tested
ESGReveal	ESG · extraction	—	Elsevier	2024	Peer-reviewed, EU-friendly
GHG Emission Extraction	ESG · Scope 1/2/3	benchmark dataset	Nature	2025	Nature-published — top credibility
ESG Report Completeness	ESG · quality	—		2025	Topic + quality classification

Frontier and open-weight model benchmarks with focus on capital-markets & ESG AI evaluations.

Headline numbers

Capital-markets retrieval — Daloopa FinRetrieval, 500 questions

Result

Benchmark composition

How the agent answers a question

Our ESGbench runs — open-weight models on 13 EU/global issuers

Open-weight models on ESGbench — APEXE3 Coder Harness vs. raw model

APEXE3 Coder Harness lift — pp gained per model

ESG benchmark landscape

ESG-Bench (AAAI 2026): CoT crushes direct fine-tuning

ESGenius: RAG lift beats parameter count

ESGenius zero-shot leaderboard — top models

ESG-Bench label distribution (270 annotated instances)

Open-weight and frontier model benchmarks

Qwen 3.5 family — verified & replicated benchmarks

Qwen 3.5-397B-A17B (flagship) vs. 27B — capability profile

Active parameters per token — frontier & open weights

Release timeline — Q1 2026

Cost per million tokens

Price × capability scatter

Input price per Mtok

Intelligence per billion active params — rising fast

Intelligence per $ — private vs public frontier APIs

Cost per intelligence — private (APEXE3) vs. public

Cost reduction % — YoY through 2026

Benchmark evaluation sets

Benchmark screener

Comparison chart

Sources