Public benchmark spend
8 architectures × 16 libraries × 2 models. Every cell — including the ones our architecture lost.
methodology →architecture-first ai · public beta · v1.0.0
Architecture-driven modernization. Audit trail ships by default. Decompose legacy codebases into editable ADRs, recompose to modern stacks — compliance falls out of doing the architecture right, not bolted on after.
522 LOC. One-shot LLM: 6 cargo check errors. Plain ADR pipeline: 1 error. ADR + domain schema: 0 errors.
NancyContext.cs, 148 LOC. One-shot: 14 dotnet build errors. Plain ADR: 0 errors.
8 architectures × 16 libraries × 2 models. Every cell — including the ones our architecture lost.
methodology →six beliefs · what kaizen does
Principles → ADRs → Tasks → Code. The chain of control flows downward from human-authored decisions. The LLM is an executor, never the author.
Architecture — not the LLM — holds decision authority. The chain of control flows downward: Principles → ADRs → Tasks → Code. The LLM is an executor inside this chain, never its author.
.architecture/principles.md §5
Tasks are classified by content shape before any execution agent runs. Raw, under-decomposed inputs are routed through triage and synthetic-specification generation; well-formed ADR-shaped inputs go directly to execution.
.architecture/decisions/ADR-0055-task-classification-and-routing.md
Specifications are gated for completeness, scope clarity, and acceptance criteria before any execution agent runs. Broken specs fail fast at intake, not mid-execution.
.architecture/decisions/ADR-0056-adr-quality-gate.md
When inputs arrive without an ADR shape, the Intake and Triage agents produce one before execution begins — no agent is asked to self-triage.
.architecture/decisions/ADR-0057-synthetic-adr-generation-triage-agent.md
Excerpts from the live architecture corpus. Full ADRs are part of the project repository — public on a rolling basis.
Every agent call is scoped to an ADR, observed via telemetry, budgeted by steps and cost. The Write Agent doesn't ask clarifying questions — they were answered upstream.
Every agent call is scoped, observed, budgeted, tool-matched, and goal-serving. Autonomy is a local property inside a directed frame, not freedom from direction. Think Before Coding and Surgical Changes are architectural guarantees, not prompt etiquette.
.architecture/principles.md §14
The CD-AOR Harness is the composition of nine subsystems that together enclose every LLM and agent call in the system. The Harness is the runtime mechanism through which architectural direction flows: Principles → ADRs → Tasks → Harness → Code.
.architecture/decisions/ADR-0062-cd-aor-harness-definition.md
Excerpts from the live architecture corpus. Full ADRs are part of the project repository — public on a rolling basis.
Tests are run, not predicted. Static analysis is deterministic. Source-verified outranks LLM-judged. Token probabilities are not a substitute for running the tests.
Signals are tiered — source-verified (tests, compilers, static analysis) outrank artifact-verified, which outrank LLM-judged. When LLM signals are unavailable, grounded signals are renormalized rather than substituted.
.architecture/principles.md §1
Never trust the LLM's self-reported confidence or quality assessment. Ground truth comes from test execution results, tool signals, the Evaluator Agent's independent assessment, and composite confidence score calculation.
.architecture/principles.md §7
Replace token-level soft-max confidence with a six-signal weighted composite — test_pass, coverage, adr_compliance, specialist_avg, pragmatic_score, static_analysis — derived from ground-truth tool execution outputs.
.architecture/decisions/ADR-0008-composite-confidence-scoring.md
Round 2 introduced grounded signals — running actual tests via subprocess and using ast.parse / py_compile for static analysis. Result: r=0.999 point-biserial correlation between composite confidence and pass/fail.
.architecture/decisions/ADR-0051-grounded-signal-pipeline.md
Convergence itself is adaptive: Thompson Sampling over historical confidence bins decides when to continue, converge, or early-abort unrecoverable tasks.
.architecture/decisions/ADR-0052-adaptive-convergence-thompson-sampling.md
Excerpts from the live architecture corpus. Full ADRs are part of the project repository — public on a rolling basis.
Write → Evaluate → Refine. Convergence is declared only when confidence ≥ threshold and no Red Team findings remain. Not when a sampler decides to stop.
Treat code generation and refinement as iterative denoising rather than one-shot synthesis. The denoising loop (Write → Evaluate → Refine) is the primary abstraction, not a side effect. Memory tiers (working, episodic, semantic) mirror the phases of the loop.
.architecture/principles.md §4
The TAOR engine is a pluggable interface. Any backend that can participate in the tool-calling conversation loop — cloud API, local model, fine-tuned RL checkpoint — can serve as an engine without modifying orchestrator code.
.architecture/decisions/ADR-0004-taor-engine-pluggability.md
Working, episodic, and semantic memory tiers mirror the phases of the denoising loop. Each commit captures structured metadata so any intermediate state is recoverable.
.architecture/decisions/ADR-0054-three-tier-memory-hierarchy.md
Excerpts from the live architecture corpus. Full ADRs are part of the project repository — public on a rolling basis.
The agent that writes code cannot approve it. The Evaluator never sees the Write Agent's chain-of-thought. The Red Team is structurally incentivized to disagree.
Write Agent and Evaluator Agent must be independent implementations with different base models or prompting strategies. Never allow self-vibe-coding where an agent evaluates its own output without critical questioning.
.architecture/principles.md §6
The Write Agent and Evaluator Agent are separate model instances with separate system prompts and no shared chain-of-thought. The Evaluator receives only the committed code artifact (via Git) — never the Write Agent's reasoning, intermediate attempts, or self-assessments.
.architecture/decisions/ADR-0009-anti-vibe-coding-guardrails.md
Excerpts from the live architecture corpus. Full ADRs are part of the project repository — public on a rolling basis.
Every denoising step is a Git commit. Every agent call is instrumented. If an action cannot be observed, it cannot be trusted to have happened.
Every denoising step, decision point, and code generation iteration produces a commit with structured metadata. The git history is the immutable record of how the system arrived at its current state.
.architecture/principles.md §10
Every agent call, tool invocation, and signal measurement is instrumented. No silent LLM calls, no untelemetered tool use, no out-of-band execution paths. If an action cannot be observed, it cannot be trusted to have happened.
.architecture/principles.md §11
Every denoising step produces a Git commit with structured metadata embedded in the commit message body — formatted as JSON for automated parsing. Together with the audit event log, this forms a two-channel record supporting rollback, forensic analysis, and enterprise compliance.
.architecture/decisions/ADR-0006-git-checkpoint-audit-trail.md
Every component (Rust, Python, C#) emits structured metrics, traces, and logs across the full stack. Cost, latency, provider health, and learning-curve signals are first-class telemetry — the same feedback that drives confidence drives operational decisions.
.architecture/decisions/ADR-0031-observability-and-monitoring.md
Structured audit events capture user actions, role changes, task submissions, and provider selections in an append-only log. Combined with git history, this is the two-channel record enterprise compliance requires.
.architecture/decisions/ADR-0033-AUDIT-LOGGING-ARCHITECTURE.md
Excerpts from the live architecture corpus. Full ADRs are part of the project repository — public on a rolling basis.
The daily tool. Apache-2.0. pypi.org/project/kaizen-3c-cli
Architectural-weakness fingerprinting, three-arm ablation. github.com/Kaizen-3C/benchmarks
Aider + smolagents added to the three-arm matrix — extending the paper from we benchmarked our own architecture to we benchmarked an ecosystem. Month 2.
Three-arm ablation, eight architectures, two named architectural ceilings. Submission target Month 3.
A reframing of the original Kaizen 3C method (Concern · Cause · Countermeasure) — born on Toyota factory floors as a structured problem-solving discipline — for the software industry.