Accuracy Benchmark

The accuracy benchmark suite evaluates LLM providers’ ability to analyze structured data in TeaLeaf format. It sends analysis prompts with TeaLeaf-formatted business data to multiple providers and scores the responses.

Overview

The workflow:

Takes JSON data from various business domains
Converts it to TeaLeaf format using tealeaf-core
Sends analysis prompts to multiple LLM providers
Evaluates and compares the responses using a scoring framework

Supported Providers

Provider	Environment Variable	Model
Anthropic	`ANTHROPIC_API_KEY`	Claude Sonnet 4.5 (Extended Thinking)
OpenAI	`OPENAI_API_KEY`	GPT-5.2

Installation

Pre-built Binaries

Download the latest release from GitHub Releases:

Platform	Architecture	File
Windows	x64	`accuracy-benchmark-windows-x64.zip`
Windows	ARM64	`accuracy-benchmark-windows-arm64.zip`
macOS	Intel	`accuracy-benchmark-macos-x64.tar.gz`
macOS	Apple Silicon	`accuracy-benchmark-macos-arm64.tar.gz`
Linux	x64	`accuracy-benchmark-linux-x64.tar.gz`
Linux	ARM64	`accuracy-benchmark-linux-arm64.tar.gz`
Linux	x64 (static)	`accuracy-benchmark-linux-musl-x64.tar.gz`

Build from Source

cargo build -p accuracy-benchmark --release

# Or run directly
cargo run -p accuracy-benchmark -- --help

Usage

# Run with all available providers
cargo run -p accuracy-benchmark -- run

# Run with specific providers
cargo run -p accuracy-benchmark -- run --providers anthropic,openai

# Run specific categories only
cargo run -p accuracy-benchmark -- run --categories finance,retail

# Compare TeaLeaf vs JSON format performance
cargo run -p accuracy-benchmark -- run --compare-formats

# Verbose output
cargo run -p accuracy-benchmark -- -v run

# List available tasks
cargo run -p accuracy-benchmark -- list-tasks

# Generate configuration template
cargo run -p accuracy-benchmark -- init-config -o my-config.json

Benchmark Tasks

The suite includes 12 tasks across 10 business domains:

ID	Domain	Complexity	Output Type
FIN-001	Finance	Simple	Calculation
FIN-002	Finance	Moderate	Calculation
RET-001	Retail	Simple	Summary
RET-002	Retail	Complex	Recommendation
HLT-001	Healthcare	Simple	Summary
TEC-001	Technology	Moderate	Analysis
MKT-001	Marketing	Moderate	Calculation
LOG-001	Logistics	Moderate	Analysis
HR-001	HR	Moderate	Analysis
MFG-001	Manufacturing	Moderate	Calculation
RE-001	Real Estate	Complex	Recommendation
LEG-001	Legal	Complex	Analysis

Data Sources

Each task specifies input data in one of two ways:

Inline JSON:

#![allow(unused)]
fn main() {
BenchmarkTask::new("FIN-001", "finance", "Analyze this data:\n\n{tl_data}")
    .with_json_data(serde_json::json!({
        "revenue": 1000000,
        "expenses": 750000
    }))
}

JSON file reference:

#![allow(unused)]
fn main() {
BenchmarkTask::new("LOG-001", "logistics", "Analyze this data:\n\n{tl_data}")
    .with_json_file("tasks/logistics/data/shipments.json")
}

The {tl_data} placeholder in the prompt template is replaced with TeaLeaf-formatted data before sending to the LLM.

Analysis Framework

Accuracy Metrics

Responses are evaluated across five dimensions:

Metric	Weight	Description
Completeness	25%	Were all expected elements addressed?
Relevance	25%	How relevant is the response to the task?
Coherence	20%	Is the response well-structured?
Factual Accuracy	20%	Do values match validation patterns?
Actionability	10%	For recommendations – are they actionable?

Element Detection

Each task defines expected elements that should appear in the response:

#![allow(unused)]
fn main() {
// Keyword presence check
.expect("metric", "Total revenue calculation", true)

// Regex pattern validation
.expect_with_pattern("metric", "Percentage value", true, r"\d+\.?\d*%")
}

Without pattern: checks for keyword presence from description
With pattern: validates using regex (e.g., \$[\d,]+ for dollar amounts)

Scoring Rubrics

Different rubrics apply based on output type:

Output Type	Key Criteria
Calculation	Numeric content (5+ numbers), structured output
Analysis	Depth, structure, evidence with data
Recommendation	Actionable language, prioritization, justification
Summary	Completeness, conciseness, organization

Coherence Checks

Structure markers: ##, ###, **, -, numbered lists
Paragraph breaks (3+ paragraphs preferred)
Reasonable length (100-2000 words)

Actionability Keywords

For recommendation tasks, these keywords are detected:

recommend, should, suggest, consider, advise
action, implement, improve, optimize, prioritize
next step, immediate, critical, important

Format Comparison Results

Run with --compare-formats to compare TeaLeaf vs JSON input efficiency.

Sample run from February 5, 2026 with Claude Sonnet 4.5 (claude-sonnet-4-5-20250929) and GPT-5.2 (gpt-5.2-2025-12-11):

Provider	TeaLeaf Score	JSON Score	Accuracy Diff	TeaLeaf Input	JSON Input	Input Savings
anthropic	0.988	0.978	+0.010	5,793	8,275	-30.0%
openai	0.901	0.899	+0.002	4,868	7,089	-31.3%

Input tokens = data sent to the model. Output tokens vary by model verbosity.

Key findings:

Provider	Accuracy	Data Token Efficiency
anthropic	Comparable (+1.0%)	TeaLeaf uses ~36% fewer data tokens
openai	Comparable (+0.2%)	TeaLeaf uses ~36% fewer data tokens

TeaLeaf data payloads use ~36% fewer tokens than equivalent JSON (median across 12 tasks, validated with tiktoken). Total input savings are ~30% because shared instruction text dilutes the data-only difference. Savings increase with larger and more structured datasets.

Sample Results: Reference benchmark results are available in accuracy-benchmark/results/sample/ in the repository.

Output Files

Results are saved in two formats:

TeaLeaf Format (`analysis.tl`)

# Accuracy Benchmark Results
# Generated: 2026-02-05 15:29:42 UTC

run_metadata: {
    run_id: "20260205-152419",
    started_at: 2026-02-05T15:24:19Z,
    completed_at: 2026-02-05T15:29:42Z,
    total_tasks: 12,
    providers: [anthropic, openai]
}

responses: @table api_response [
    (FIN-001, openai, "gpt-5.2-2025-12-11", 315, 490, 6742, 2026-02-05T15:24:38Z, success),
    (FIN-001, anthropic, "claude-sonnet-4-5-20250929", 396, 1083, 12309, 2026-02-05T15:24:31Z, success),
    ...
]

analysis_results: @table analysis_result [
    (FIN-001, openai, 0.667, 0.625, 0.943, 0.000),
    (FIN-001, anthropic, 1.000, 1.000, 1.000, 1.000),
    ...
]

comparisons: @table comparison_result [
    (FIN-001, [anthropic, openai], anthropic, 0.389),
    (RET-001, [anthropic, openai], anthropic, 0.047),
    ...
]

summary: {
    total_tasks: 12,
    wins: { anthropic: 11, openai: 1 },
    avg_scores: { anthropic: 0.988, openai: 0.901 },
    by_category: { ... },
    by_complexity: { ... }
}

JSON Summary (`summary.json`)

{
  "run_id": "20260205-152419",
  "total_tasks": 12,
  "provider_rankings": [
    { "provider": "anthropic", "wins": 11, "avg_score": 0.988 },
    { "provider": "openai", "wins": 1, "avg_score": 0.901 }
  ],
  "category_breakdown": {
    "retail": { "leader": "anthropic", "margin": 0.111 },
    "finance": { "leader": "anthropic", "margin": 0.197 },
    ...
  },
  "detailed_results_file": "analysis.tl"
}

Adding Custom Tasks

From JSON Data

#![allow(unused)]
fn main() {
BenchmarkTask::new(
    "CUSTOM-001",
    "custom_category",
    "Analyze this data:\n\n{tl_data}\n\nProvide summary and recommendations."
)
.with_json_file("tasks/custom/data/my_data.json")
.with_complexity(Complexity::Moderate)
.with_output_type(OutputType::Analysis)
.expect("summary", "Data overview", true)
.expect_with_pattern("metric", "Total value", true, r"\d+")
}

From TeaLeaf File

cargo run -p accuracy-benchmark -- run --tasks path/to/tasks.tl

Extending Providers

Implement the LLMProvider trait:

#![allow(unused)]
fn main() {
#[async_trait]
impl LLMProvider for NewProviderClient {
    fn name(&self) -> &str { "newprovider" }

    async fn complete(&self, request: CompletionRequest) -> ProviderResult<CompletionResponse> {
        // Implementation
    }
}
}

Then register in src/providers/mod.rs via create_all_providers() and create_providers().

Directory Structure

accuracy-benchmark/
├── src/
│   ├── main.rs           # CLI interface (clap)
│   ├── lib.rs            # Library exports
│   ├── config.rs         # Configuration management
│   ├── providers/        # LLM provider clients
│   │   ├── traits.rs     # LLMProvider trait
│   │   ├── anthropic.rs  # Claude implementation
│   │   └── openai.rs     # GPT implementation
│   ├── tasks/            # Task definitions
│   │   ├── mod.rs        # BenchmarkTask, DataSource
│   │   ├── categories.rs # Domain, Complexity, OutputType
│   │   └── loader.rs     # TeaLeaf file loader
│   ├── runner/           # Execution engine
│   │   ├── executor.rs   # Parallel task execution
│   │   └── rate_limiter.rs
│   ├── analysis/         # Response analysis
│   │   ├── metrics.rs    # AccuracyMetrics
│   │   ├── scoring.rs    # ScoringRubric
│   │   └── comparator.rs # Cross-provider comparison
│   └── reporting/        # Output generation
│       └── tl_writer.rs  # TeaLeaf format output
├── tasks/                # Sample data by domain
│   ├── finance/data/
│   ├── healthcare/data/
│   ├── retail/data/
│   └── ...
├── results/runs/         # Archived run results
└── Cargo.toml

Keyboard shortcuts

TeaLeaf Documentation