Performance Benchmarks
Purpose
This document defines the reproducible benchmark workflow for FluxGraph.
Goals:
- Provide repeatable measurement artifacts for benchmark quality gates.
- Separate measured evidence from narrative claims.
- Keep benchmark runs deterministic enough for regression comparison.
Scope
Current benchmark executables:
benchmark_signal_storebenchmark_namespacebenchmark_tickjson_loader_bench(optional, whenFLUXGRAPH_JSON_ENABLED=ON)yaml_loader_bench(optional, whenFLUXGRAPH_YAML_ENABLED=ON)
Reproducible Runner
Use the benchmark wrapper scripts.
Linux/macOS:
bash ./scripts/bench.sh --preset dev-release
Windows PowerShell:
.\scripts\bench.ps1 -Preset dev-windows-release -Config Release
Optional loader benchmarks:
bash ./scripts/bench.sh --preset dev-release --include-optional
Strict status enforcement (for gated runs):
bash ./scripts/bench.sh --preset dev-release --fail-on-status
The wrappers call scripts/run_benchmarks.py, which:
- Configures/builds benchmark targets (unless
--no-buildis set). - Runs each benchmark executable.
- Captures stdout/stderr logs per target.
- Emits a machine-readable manifest with environment, git metadata, and parsed benchmark metrics.
By default, status failures are reported but do not fail the run; execution failures still fail.
Use --fail-on-status when running gate-enforced benchmark checks.
Wrappers then run scripts/evaluate_benchmarks.py with a policy profile:
local: informational for workstation variability.ci-hosted: warning-oriented checks for shared CI runners.ci-dedicated: strict gate profile for stable hardware evidence.
Artifact Contract
Artifacts are stored under:
artifacts/benchmarks/<timestamp>_<preset>/
Required files:
benchmark_results.jsonbenchmark_evaluation.json<target>.stdout.log<target>.stderr.logconfigure.stdout.log,configure.stderr.log(when build enabled)build.stdout.log,build.stderr.log(when build enabled)
benchmark_results.json contains:
- Timestamp (UTC)
- Preset/config/build directory
- Platform, hostname, Python version
- Git commit hash and dirty-worktree flag
- Per-benchmark executable path, command, exit code, duration, parsed PASS/FAIL status lines
Tick benchmark output additionally tracks measured heap allocations during the timed loop (Allocations, Alloc/tick) so zero-allocation evidence is captured per scenario.
benchmark_evaluation.json contains:
- selected policy profile
- issue summary (errors/warnings)
- per-check findings with metric keys and threshold context
Scenario-versioned keys are used for regression checks, e.g.:
scenario.tick.simple.v1.avg_tick_usscenario.tick.complex.v1.avg_tick_usscenario.tick.simple.v1.alloc_per_tickscenario.tick.complex.v1.alloc_per_tick
Baseline promotion command:
python scripts/promote_benchmark_baseline.py \
--results artifacts/benchmarks/<run>/benchmark_results.json \
--policy benchmarks/policy/bench_policy.json \
--profile ci-hosted \
--output benchmarks/policy/baselines/ci-hosted.windows-2022.json
Evidence Rules
For any published performance claim, attach:
- Commit hash used for the run.
- Exact benchmark command.
- Full artifact directory or archived equivalent.
- Hardware and OS details (captured in manifest + release notes).
- Comparison baseline (previous artifact manifest).
Claims without linked artifacts are treated as unsupported.
CI Guidance
Benchmarks are intentionally separated from the default CI correctness lanes.
Recommended:
- Run benchmark evidence workflow on demand (
workflow_dispatch) or scheduled lane. - Store artifacts as CI build artifacts.
- Apply
ci-hostedprofile on hosted runners and reserve strict gating for dedicated runners.
Next Steps
- Calibrate thresholds using several hosted-runner samples (reduce false positives while preserving sensitivity).
- Provision and commit a true
ci-dedicatedbaseline from stable hardware. - Add trend reporting (time series comparison across benchmark-evidence workflow runs).