Performance Benchmarks

Purpose

This document defines the reproducible benchmark workflow for FluxGraph.

Goals:

Provide repeatable measurement artifacts for benchmark quality gates.
Separate measured evidence from narrative claims.
Keep benchmark runs deterministic enough for regression comparison.

Scope

Current benchmark executables:

benchmark_signal_store
benchmark_namespace
benchmark_tick
json_loader_bench (optional, when FLUXGRAPH_JSON_ENABLED=ON)
yaml_loader_bench (optional, when FLUXGRAPH_YAML_ENABLED=ON)

Reproducible Runner

Use the benchmark wrapper scripts.

Linux/macOS:

bash ./scripts/bench.sh --preset dev-release

Windows PowerShell:

.\scripts\bench.ps1 -Preset dev-windows-release -Config Release

Optional loader benchmarks:

bash ./scripts/bench.sh --preset dev-release --include-optional

Strict status enforcement (for gated runs):

bash ./scripts/bench.sh --preset dev-release --fail-on-status

The wrappers call scripts/run_benchmarks.py, which:

Configures/builds benchmark targets (unless --no-build is set).
Runs each benchmark executable.
Captures stdout/stderr logs per target.
Emits a machine-readable manifest with environment, git metadata, and parsed benchmark metrics.

By default, status failures are reported but do not fail the run; execution failures still fail. Use --fail-on-status when running gate-enforced benchmark checks.

Wrappers then run scripts/evaluate_benchmarks.py with a policy profile:

local: informational for workstation variability.
ci-hosted: warning-oriented checks for shared CI runners.
ci-dedicated: strict gate profile for stable hardware evidence.

Artifact Contract

Artifacts are stored under:

artifacts/benchmarks/<timestamp>_<preset>/

Required files:

benchmark_results.json
benchmark_evaluation.json
<target>.stdout.log
<target>.stderr.log
configure.stdout.log, configure.stderr.log (when build enabled)
build.stdout.log, build.stderr.log (when build enabled)

benchmark_results.json contains:

Timestamp (UTC)
Preset/config/build directory
Platform, hostname, Python version
Git commit hash and dirty-worktree flag
Per-benchmark executable path, command, exit code, duration, parsed PASS/FAIL status lines

Tick benchmark output additionally tracks measured heap allocations during the timed loop (Allocations, Alloc/tick) so zero-allocation evidence is captured per scenario.

benchmark_evaluation.json contains:

selected policy profile
issue summary (errors/warnings)
per-check findings with metric keys and threshold context

Scenario-versioned keys are used for regression checks, e.g.:

scenario.tick.simple.v1.avg_tick_us
scenario.tick.complex.v1.avg_tick_us
scenario.tick.simple.v1.alloc_per_tick
scenario.tick.complex.v1.alloc_per_tick

Baseline promotion command:

python scripts/promote_benchmark_baseline.py \
  --results artifacts/benchmarks/<run>/benchmark_results.json \
  --policy benchmarks/policy/bench_policy.json \
  --profile ci-hosted \
  --output benchmarks/policy/baselines/ci-hosted.windows-2022.json

Evidence Rules

For any published performance claim, attach:

Commit hash used for the run.
Exact benchmark command.
Full artifact directory or archived equivalent.
Hardware and OS details (captured in manifest + release notes).
Comparison baseline (previous artifact manifest).

Claims without linked artifacts are treated as unsupported.

CI Guidance

Benchmarks are intentionally separated from the default CI correctness lanes.

Recommended:

Run benchmark evidence workflow on demand (workflow_dispatch) or scheduled lane.
Store artifacts as CI build artifacts.
Apply ci-hosted profile on hosted runners and reserve strict gating for dedicated runners.

Next Steps

Calibrate thresholds using several hosted-runner samples (reduce false positives while preserving sensitivity).
Provision and commit a true ci-dedicated baseline from stable hardware.
Add trend reporting (time series comparison across benchmark-evidence workflow runs).