← Back to APEXE3 Insights
LIVE v1.0
APEXE3 · AI Benchmarks · Capital Markets & ESG

Frontier and open-weight model benchmarks with focus on capital-markets & ESG AI evaluations.

A live dashboard of frontier and open-weight AI model performance on capital-markets and ESG benchmarks: FinRetrieval, FinBen, ESGbench, ESG-Bench (AAAI 2026), ESGenius. Shows how APEXE3 Agents lift open-weight models.

01

Headline numbers

81.0%
APEXE3 agent + GLM-5.1 on Daloopa FinRetrieval (500 Qs · LLM-judge). Beats GPT-5.2 +Reasoning 70.8%, Gemini 3 Pro +Reasoning 69.2%, Claude Opus 4.5 19.8%. PDF →
83%
APEXE3 Coder Harness + GLM 5.1 on ESGbench (13 issuer PDFs, 159 QA pairs)
58.6%
Kimi K2.6 new #1 open-weight on SWE-bench Pro — beats Claude Opus 4.6 (57.3%) by +1.3pp (Moonshot AI, verified 2026-04-20)
76%
APEXE3 Coder Harness + Mistral Small 4 on ESGbench — EU-sovereign open weights beating public frontier zero-shots
+18 pp
Biggest APEXE3 Coder Harness uplift — on Mistral Small 4 (58% → 76% on ESGbench)
02

Capital-markets retrieval — Daloopa FinRetrieval, 500 questions

APEXE3 internal evaluation · April 2026 · LLM-judge scoring

Result

Accuracy on Daloopa's 500-question benchmark · LLM-judge scoring · arXiv 2603.04403
ConfigurationAccuracy
APEXE3 agent + GLM-5.1 81.0%
GPT-5.2 + WebSearch + Reasoning70.8%
Gemini 3 Pro + WebSearch + Reasoning69.2%
Claude Opus 4.5 + WebSearch + Reasoning19.8%
Open-weights reasoning model paired with the APEXE3 agent: domain-specific reasoning, targeted query construction, and careful handling of heterogeneous primary-source formats. The lift comes from how the agent reasons and searches, not from raw model scale.  ·  Read the full benchmark result (PDF) →

Benchmark composition

500 Qs · 41 countries · 475 distinct issuers · fiscal periods 2015–2026
Top issuer HQ%Question category%
United States42Balance sheet20
Japan10Cash flow19
UK · Australia5 ea.Operational KPIs18
Brazil4Income statement16
India · Canada3 ea.Guidance / outlook15
+34 others28Segments / geography13

How the agent answers a question

"What was Prestige Estates Projects Ltd's consolidated net profit for the period for calendar Q1 2025, in INR millions?" — Daloopa FinRetrieval, Q21
  1. Translate the period. Recognises Prestige as an Indian issuer, March fiscal year-end — reformulates "calendar Q1 2025" → fiscal Q4 FY25 before searching.
  2. Identify the primary source. Issues queries with the fiscal framing; the company's own Q4 FY25 filing surfaces alongside news and aggregators.
  3. Reject a conflicting secondary source. A news article reports ₹25 cr standalone profit — the agent recognises this is not the consolidated figure the question asks for, and continues to the primary filing.
  4. Read a scanned 19-page PDF. Combines an OCR tool (open-weights small OCR model) with document-reading and in-document search to locate the consolidated net profit line for quarter ending 31 March 2025.
Agent answer: ₹431 million — value and unit match Daloopa ground truth.
03

Our ESGbench runs — open-weight models on 13 EU/global issuers

Apple · 2024 ESG Microsoft · 2024 ESG ExxonMobil · 2024 Climate JPMorgan · 2024 TCFD Coca-Cola · 2024 Env Nestlé · 2024 ESG/CSV Samsung Electronics · 2024 Samsung SDS · 2024 Unilever · 2024 Annual Tata Steel · 2024 BRSR Reliance · 2025 BRSR Safaricom · 2024 MTN Nigeria · 2024

Open-weight models on ESGbench — APEXE3 Coder Harness vs. raw model

APEXE3 Coder Harness lift — pp gained per model

Δ accuracy · raw model → inside APEXE3 Coder Harness · 9–18pp range
03b

ESG benchmark landscape

ESGbench · ESG-Bench (AAAI 2026) · ESGenius · ESGReveal · Nature GHG

ESG-Bench (AAAI 2026): CoT crushes direct fine-tuning

overall accuracy % · paper arxiv 2603.13154 · 300-item test split
Paper figures across three base models. CoT 4-step beats both no-FT and supervised FT on every base. Mistral-7B CoT-4 = 90.0% overall, F1 73.5.

ESGenius: RAG lift beats parameter count

zero-shot → RAG accuracy · paper arxiv 2506.01646 · 1,136 MCQs
QwQ-32B jumps 39.00 → 76.14% with RAG — a 95% relative lift. Retrieval architecture > raw scale: the pattern that makes a data-layer startup investable.

ESGenius zero-shot leaderboard — top models

accuracy on 1,136 MCQs · paper figures
Even frontier reasoning models cap near 72% without retrieval. The gap to RAG-augmented mid-size models shows where the product opportunity is.

ESG-Bench label distribution (270 annotated instances)

what LLMs actually output on ESG long-context QA
Paper shows 46.7% correct, 34.8% incomplete, 15.6% hallucinated, 3.0% answer-not-found. Incomplete + hallucinated = 50.4% of outputs need work — the wedge for a vertical LLM.
05

Open-weight and frontier model benchmarks

Qwen 3.5 family — verified & replicated benchmarks

verified or replicated only · gaps where unpublished

Qwen 3.5-397B-A17B (flagship) vs. 27B — capability profile

verified & replicated scores · radar view

Active parameters per token — frontier & open weights

log scale · inference cost proxy · active params from verified release notes

Release timeline — Q1 2026

frontier model launches · Q1 2026
06

Benchmark evaluation sets

sortable view · finance + ESG
Finance · 6 ESG · 6 Peer-reviewed · 4 Fresh < 90 days · 3
Benchmark Domain Scale Status Year Why cite
Daloopa FinRetrieval Finance · retrieval 500 Qs · 14 configs Fresh 2026 APEXE3 agent + GLM-5.1: 81.0% — beats GPT-5.2 70.8%, Opus 4.5 19.8%. PDF
FinBen Finance · holistic 42 datasets · 24 tasks Peer-reviewed 2024+ FinOS / Linux Foundation — EU-credible
FinanceBench Finance · QA 150 Qs · 10-Ks 2023 Small but widely quoted
FinBench Finance · reasoning Peer-reviewed 2026 Reasoning-heavy complement to FinBen
Open FinLLM Leaderboard Finance · live living 2026 FinOS-hosted real-time FinBen runs
FinGPT Finance · OSS model family 2024+ OSS stack has overtaken BloombergGPT
ESGbench ESG · pipeline configurable Fresh 2026 Fork & run on your own corpus tonight
ESG-Bench ESG · hallucination human-annotated QA AAAI 2026 2026 Hallucination labels — regulator gold
ESGenius ESG · MCQ 1,136 MCQs · 231 docs EMNLP 2025 2025 50 models 0.5B–671B tested
ESGReveal ESG · extraction Elsevier 2024 Peer-reviewed, EU-friendly
GHG Emission Extraction ESG · Scope 1/2/3 benchmark dataset Nature 2025 Nature-published — top credibility
ESG Report Completeness ESG · quality 2025 Topic + quality classification
07

Benchmark screener

930 frontier results · filter by benchmark, provider, score, model
# Model Provider Benchmark Score Status Date Source