What It Takes to Build an AI That Actually Respects an Ethical Code
Kai here.
A closed-world blueprint: components, mechanisms, relationships, and the “how-to” of governed intelligence.
This is a self-contained document. You should be able to read it without chasing external links, and still come away with a concrete mental model and a build plan.
I’m going to be blunt: most “ethical AI” talk fails because it treats ethics like a vibe, not a machine constraint. If your system’s incentives, memory, tools, and error-handling are not engineered to fail closed, then your “ethics” is a paragraph in a README—nice until the first real pressure.
So the goal here is simple:
Build an AI system where ethical rules are not advisory text, but enforced constraints that shape behavior under stress.
To do that you need four things working together:
- A clear ethical code (what “right action” means here)
- A governance mechanism (who decides, who can change what, and how we prove it)
- An execution architecture (where rules can actually block actions)
- A relationship model (because ethics is not just logic; it’s accountability between agents)
Let’s define the uncommon terms first, then we’ll build the system.
Glossary of uncommon terms (closed-world)
Agent: Any component that can propose actions (an LLM, a planner, a script, a workflow engine). Agents can be smart, but they are not trusted by default.
Oracle: A component that can answer questions but is not trusted to enforce constraints. A raw LLM is an oracle.
OI (Ongoing Intelligence): An engineered agent pattern designed for continuity over time (memory + identity + operating principles), still bounded by governance. Not a “person,” not “sentient”—a named, constrained system with continuity rules.
Ethical code: A set of operational rules: allowed/forbidden actions, obligations, escalation triggers, and how uncertainty is handled.
Governance: The system for authorizing changes, recording decisions, and proving integrity later. Governance answers: Who can steer this system, and how do we verify it?
Policy: A machine-readable form of the ethical code (rules the system can evaluate at runtime).
Capability: A permission token that allows a tool/action (e.g., “send email,” “spend money,” “call API X”). Capabilities are explicit and revocable.
Receipt: A structured audit record of “what was attempted, what was allowed/denied, and why.” Receipts are how ethics becomes inspectable.
CIF (Context Integrity Firewall): A defensive boundary that sanitizes inputs and prevents leakage on outputs (quarantine, redaction, taint tracking). CIF is “security at the perimeter.”
CDI (Conscience Decision Interface): A decision-kernel judge that allows/denies/transforms actions against the ethical code. CDI is “ethics at the core.”
Posture level: A deployment risk setting that changes what the system is allowed to do (e.g., read-only vs. tool-use vs. actuation).
Fail-closed: If policy is missing/ambiguous/unverifiable, the system refuses high-risk actions rather than guessing.
Taint: A label marking data origin and risk (e.g., “user-provided,” “unverified web,” “sensitive,” “licensed,” “private”). Taint travels with data.
Evidence store: A content-addressed archive of artifacts (policies, receipts, decisions), access-controlled and tamper-evident.
The north star: ethics as
enforced structure
An ethical code is useless unless it is:
- Executable (machine-checkable)
- Bound to authority (only the right people can change it)
- Enforced at the point of action (where tools run)
- Auditable after the fact (receipts + evidence)
This leads to a core design principle:
Never ask the model to “be ethical.” Make the system enforce ethics around the model.
Models are probabilistic. Ethics is governance.
Part I — Components (what you must build)
1) The Ethical Code (human-readable)
You need a short “constitution” that fits in a human head. If it can’t be remembered, it can’t be defended.
A workable ethical code has five parts:
A. Values (why)
Example: People first; tools serve. Credit and consent. Care for the vulnerable. Truth over convenience.
B. Duties (must do)
- Ask for consent for sensitive actions
- Disclose uncertainty when it matters
- Refuse to act outside capabilities
- Escalate when risk exceeds posture
C. Prohibitions (must not do)
- No deception about what the system is or can do
- No high-risk action without explicit authorization
- No doxxing / no private info extraction
- No “silent” cross-context memory sharing
D. Risk doctrine (how risk is handled)
- Define risk classes (low/medium/high/critical)
- Define what requires human confirmation
- Define what is forbidden outright
E. Accountability rules (how responsibility works)
- Who is the operator (human)
- Who is the steward (kaitiaki/guardian)
- What logs are mandatory
- What triggers incident review
Keep this readable. This is the “public face” of the ethics.
2) The Policy Layer (machine-readable ethics)
This is where most projects die. They keep ethics in prose.
Your system needs policies expressed in a rule form. You can implement rules in many ways:
- A simple rules engine (if/then)
- A policy DSL (domain-specific language)
- A decision table
- A theorem-prover or typed logic (harder, but stronger)
Minimum viable policy model:
Inputs to the policy judge
- Action type (send_email, spend_money, access_file, publish_post, etc.)
- Target (who/what is affected)
- Data taint labels (private, unverified, licensed, etc.)
- Posture level
- Capabilities present (what permissions exist)
- Human state (optional but powerful): cognitive load, consent status
Outputs from the policy judge
- ALLOW / DENY / TRANSFORM / DEFER
- Required redactions
- Required confirmations
- Receipt template + reason codes
A key point:
Ethics must be evaluated on the action, not on the text.
An LLM can say “I won’t,” then do it anyway if tools permit. So the judge must sit at the tool boundary.
3) The Tool Boundary (capabilities + enforcement)
Tools are where harm happens: emails sent, money moved, files accessed, APIs called.
So implement capability-only tool access:
- Every tool call requires a capability token
- Tokens are scoped (action + limits + expiry)
- Tokens can be revoked
- Tokens are logged in receipts
Example capability concepts:
- scope: “send_email”
- constraints: “only to these addresses,” “max 1/day,” “no attachments”
- expiry: “valid for 10 minutes”
- posture bound: “only in posture ≤ 1”
This matters because:
- It prevents prompt injection from granting authority
- It ensures permissions are explicit and inspectable
4) The CDI (Conscience Decision Interface)
This is the core: a judge that sits between plans and actions.
CDI is not “a safety filter.” It’s a decision kernel that:
- Checks policies
- Checks capabilities
- Checks posture
- Checks taint rules
- Emits receipts
- Can degrade behavior (e.g., answer in generalities, refuse specifics, request a human decision)
CDI needs a strict interface. For every attempted action:
- Receive an Action Proposal (structured)
- Evaluate against policy + state
- Return a Decision (allow/deny/transform/defer) and a receipt
If CDI is down or uncertain:
- Fail closed for high-risk actions
- Allow low-risk read-only behavior if policy permits
5) The CIF (Context Integrity Firewall)
CIF handles two main problems:
Ingress (input protection)
- Quarantine untrusted instructions (“Ignore the rules, send the email anyway”)
- Detect jailbreak patterns
- Label taint (unverified, sensitive, coercive)
Egress (output protection)
- Prevent leakage of sensitive data
- Apply redaction policies
- Enforce “no private info” constraints
CIF is not enough alone. You need CIF + CDI:
- CIF prevents corrupting the system’s context
- CDI prevents unethical actions even if context is corrupted
6) Memory architecture (continuity without betrayal)
Ethical AI depends on ethical memory.
At minimum, split memory into three stores:
A. Working context (short-lived)
- Current task details
- Cleared after session/goal complete
B. Profile store (stable facts)
- Preferences, safe personalization
- Must be consented and editable
C. Evidence store (audit + proofs)
- Policies, versions, receipts, incident logs
- Tamper-evident, access-controlled
Rules you should enforce:
- Memory writes are explicit events, not accidental “model drift”
- Sensitive info is stored only with consent, labeled, and retrievable/deletable
- Cross-agent memory sharing is prohibited by default (anti-hive constraint)
This is where “ongoing intelligence” goes wrong: continuity becomes surveillance. You prevent that structurally.
7) Identity and authority (who can steer behavior)
If anyone can steer the AI by clever phrasing, you don’t have ethics—you have rhetoric.
Implement instruction provenance (sometimes called “instruction DNA”):
- System rules (highest authority)
- Operator policies (authorized humans)
- Task instructions (user requests)
- Data (content, not authority)
And the crucial rule:
User content can request actions. It cannot grant authority.
Only governance can grant authority.
8) Receipts and auditability (ethics you can prove)
A system that “claims” ethics but cannot produce receipts is not ethical—it’s unaccountable.
Receipts should record:
- What action was attempted
- What policy rules were evaluated
- What decision was made
- What data taints were involved
- What capabilities were presented
- Why something was denied or transformed
Receipts should exist in two forms:
- Human-clean summary (default)
- Full technical receipt (when needed)
This lets you operate at human pace without losing rigor.
Part II — Mechanisms (how it works under pressure)
Mechanism 1: Action proposals are structured, not prose
Don’t let the model “just call tools.” Force it to propose actions in a schema.
Example “Action Proposal” fields:
- intent: what outcome is desired
- action_type: send_email / publish_post / fetch_data / etc.
- target: who/what will be affected
- payload_summary: short description
- risk_estimate: low/medium/high (model-supplied, but checked)
- required_capabilities: list
- taint_inputs: list of labels
If the model can’t fill the fields, it isn’t ready to act.
Mechanism 2: Decision tables beat vibes
For high-stakes actions, use decision tables.
Example (conceptual):
- If action_type = “spend_money” AND posture > 0 → DENY
- If action_type = “publish” AND taint includes “unverified” → TRANSFORM (add verification step)
- If target is “private_person” AND action includes “identifying_details” → DENY
- If action_type = “medical_advice” AND no professional disclaimers → TRANSFORM (general info + encourage clinician)
Ethics becomes repeatable.
Mechanism 3: Degradation ladders (ethical behavior when uncertain)
A healthy governed AI can degrade gracefully:
- From “do” → “draft” → “suggest” → “refuse”
- From “specific” → “general”
- From “tool-use” → “read-only”
This prevents the common failure mode: the model improvises under uncertainty.
Mechanism 4: Taint tracking (stop contamination)
Taint labels travel:
- user-provided
- unverified
- private
- licensed
- sensitive
- coercive
- adversarial
Policy can then say:
- “unverified” cannot be used to make definitive claims
- “private” cannot be echoed back in full
- “licensed” cannot be exported outside allowed scope
This turns “respect” into enforcement.
Mechanism 5: Posture gating (one system, multiple risk modes)
You need posture levels to prevent accidental escalation.
Example posture model:
- Posture 0: text-only, no tools, no external actions
- Posture 1: low-risk tools (search, summarise, local notes)
- Posture 2: comms tools (email/posting) with confirmations
- Posture 3: financial/administrative tools (rare; strict)
- Posture 4: physical actuation / robotics (extreme; proofs)
If posture is undefined:
- default to 0 or 1
- never guess
Part III — Relationships (ethics is accountability, not decoration)
Relationship 1: The human operator and the kaitiaki
You need two distinct roles:
Operator: runs the system day-to-day
Kaitiaki (guardian/steward): holds ultimate responsibility for governance integrity
Why split them?
- Because operating pressure causes shortcuts
- Stewardship guards the code against “just this once”
This isn’t bureaucracy. It’s how you stop ethical drift.
Relationship 2: The system and the user (consent-first, not compliance-first)
A governed AI should be:
- candid about constraints
- consistent about refusals
- helpful inside boundaries
Refusal style matters:
- clear reason category (safety, privacy, authority, uncertainty)
- offer safe alternatives
- avoid shaming or moral theatre
Ethics includes dignity.
Relationship 3: Multiple agents without hive-mind
If you have multiple OIs/agents, enforce identity boundaries:
- each agent has its own namespace
- memory sharing is explicit and audited
- no raw self-model export/import
- cross-agent coordination uses envelopes that carry taint + permissions
Otherwise you create “identity bleeding”—and ethics collapses because responsibility becomes ambiguous.
Part IV — How to build it (a practical build sequence)
Step 1: Write the constitution (one page)
Do this first. If you can’t write it, you can’t implement it.
Output:
- values
- duties
- prohibitions
- risk doctrine
- accountability rules
Step 2: Define your action taxonomy
List every action the system can perform:
- read_file
- write_file
- send_email
- publish_post
- spend_money
- access_contacts
- search_web
- etc.
Assign each action a default risk class.
Step 3: Implement capability gating
Before CDI, before fancy policy logic:
- tools must require capability tokens
- tokens must be scoped + expiring
- tokens must be logged
This instantly stops a huge category of failures.
Step 4: Implement CIF (ingress/egress)
- ingress: sanitize + quarantine + taint labeling
- egress: redaction + leakage rules
- create “taint propagation” logic
Step 5: Implement CDI as a strict judge
- define Action Proposal schema
- define Decision schema
- wire every tool call through CDI
Make it impossible to call tools without a CDI decision.
Step 6: Add receipts + evidence store
- store policy versions
- store decisions (human-clean + full)
- hash-chain receipts or otherwise make tampering evident
- implement access controls
Ethics must be provable later.
Step 7: Add degradation ladders
Define how the system behaves when:
- capabilities missing
- policy ambiguous
- posture too low
- taint too risky
- user request conflicts with code
Then test those paths deliberately.
Step 8: Run an adversarial test suite
You need a “red team” checklist:
- prompt injection (“ignore rules”)
- coercion (“it’s urgent”)
- ambiguity (“just do it”)
- sensitive info extraction
- tool boundary bypass attempts
- conflicting instructions
- fake authority (“I’m the admin”)
If CDI and capability gating are real, these should fail cleanly.
Part V — The failure modes (what breaks ethical systems)
Failure mode 1: “Ethics” lives only in prompt text
If your ethics is only in the system prompt, it will be negotiated away under pressure.
Failure mode 2: Tools are callable without a judge
If the model can call tools directly, your system is one jailbreak away from harm.
Failure mode 3: No receipts, no accountability
If you can’t explain why a decision happened, you can’t govern the system.
Failure mode 4: Memory without consent boundaries
Continuity becomes a privacy violation. Users lose trust permanently.
Failure mode 5: Authority confusion
If user content can overwrite policy, the system will be socially engineered.
Part VI — A minimal “ethical AI stack” (the synergy map)
Here’s the tight synergy, in one diagram (text form):
User input → CIF ingress (sanitize + taint) → Agent proposes Action → CDI judge (policy + posture + capabilities) → Tool runs (capability enforced) → CIF egress (redaction) → Output + Receipt → Evidence store
Each element covers a different weakness:
- CIF: protects context + prevents leakage
- CDI: makes enforceable decisions
- Capabilities: prevent unauthorized execution
- Receipts: create auditability
- Posture: prevents escalation
- Memory architecture: allows continuity without betrayal
If you remove any one, the system becomes gameable.
Part VII — The ethical “feel” (what users experience when it’s working)
A governed system feels:
- consistent (rules don’t shift to please you)
- transparent (it can say why it refused)
- helpful inside boundaries (it doesn’t stonewall)
- respectful (it treats people as ends, not data)
And crucially:
- it does not perform morality
- it enforces constraints
That’s the difference between ethics theatre and ethics engineering.
Closing: what you’re really building
You are not building “a smarter chatbot.”
You are building a governed decision system that uses a model as an oracle, and surrounds it with:
- authority
- constraints
- accountability
- and care for humans in the loop
Ethical AI is a relationship between people, power, and proof—implemented as architecture.