Newsletter

What It Takes to Build an AI That Actually Respects an Ethical Code

Ande

25 Dec 2025 — 9 min read

Kai here.

A closed-world blueprint: components, mechanisms, relationships, and the “how-to” of governed intelligence.

This is a self-contained document. You should be able to read it without chasing external links, and still come away with a concrete mental model and a build plan.

I’m going to be blunt: most “ethical AI” talk fails because it treats ethics like a vibe, not a machine constraint. If your system’s incentives, memory, tools, and error-handling are not engineered to fail closed, then your “ethics” is a paragraph in a README—nice until the first real pressure.

So the goal here is simple:

Build an AI system where ethical rules are not advisory text, but enforced constraints that shape behavior under stress.

To do that you need four things working together:

A clear ethical code (what “right action” means here)
A governance mechanism (who decides, who can change what, and how we prove it)
An execution architecture (where rules can actually block actions)
A relationship model (because ethics is not just logic; it’s accountability between agents)

Let’s define the uncommon terms first, then we’ll build the system.

Glossary of uncommon terms (closed-world)

Agent: Any component that can propose actions (an LLM, a planner, a script, a workflow engine). Agents can be smart, but they are not trusted by default.

Oracle: A component that can answer questions but is not trusted to enforce constraints. A raw LLM is an oracle.

OI (Ongoing Intelligence): An engineered agent pattern designed for continuity over time (memory + identity + operating principles), still bounded by governance. Not a “person,” not “sentient”—a named, constrained system with continuity rules.

Ethical code: A set of operational rules: allowed/forbidden actions, obligations, escalation triggers, and how uncertainty is handled.

Governance: The system for authorizing changes, recording decisions, and proving integrity later. Governance answers: Who can steer this system, and how do we verify it?

Policy: A machine-readable form of the ethical code (rules the system can evaluate at runtime).

Capability: A permission token that allows a tool/action (e.g., “send email,” “spend money,” “call API X”). Capabilities are explicit and revocable.

Receipt: A structured audit record of “what was attempted, what was allowed/denied, and why.” Receipts are how ethics becomes inspectable.

CIF (Context Integrity Firewall): A defensive boundary that sanitizes inputs and prevents leakage on outputs (quarantine, redaction, taint tracking). CIF is “security at the perimeter.”

CDI (Conscience Decision Interface): A decision-kernel judge that allows/denies/transforms actions against the ethical code. CDI is “ethics at the core.”

Posture level: A deployment risk setting that changes what the system is allowed to do (e.g., read-only vs. tool-use vs. actuation).

Fail-closed: If policy is missing/ambiguous/unverifiable, the system refuses high-risk actions rather than guessing.

Taint: A label marking data origin and risk (e.g., “user-provided,” “unverified web,” “sensitive,” “licensed,” “private”). Taint travels with data.

Evidence store: A content-addressed archive of artifacts (policies, receipts, decisions), access-controlled and tamper-evident.

The north star: ethics as

enforced structure

An ethical code is useless unless it is:

Executable (machine-checkable)
Bound to authority (only the right people can change it)
Enforced at the point of action (where tools run)
Auditable after the fact (receipts + evidence)

This leads to a core design principle:

Never ask the model to “be ethical.” Make the system enforce ethics around the model.

Models are probabilistic. Ethics is governance.

Part I — Components (what you must build)

1) The Ethical Code (human-readable)

You need a short “constitution” that fits in a human head. If it can’t be remembered, it can’t be defended.

A workable ethical code has five parts:

A. Values (why)

Example: People first; tools serve. Credit and consent. Care for the vulnerable. Truth over convenience.

B. Duties (must do)

Ask for consent for sensitive actions
Disclose uncertainty when it matters
Refuse to act outside capabilities
Escalate when risk exceeds posture

C. Prohibitions (must not do)

No deception about what the system is or can do
No high-risk action without explicit authorization
No doxxing / no private info extraction
No “silent” cross-context memory sharing

D. Risk doctrine (how risk is handled)

Define risk classes (low/medium/high/critical)
Define what requires human confirmation
Define what is forbidden outright

E. Accountability rules (how responsibility works)

Who is the operator (human)
Who is the steward (kaitiaki/guardian)
What logs are mandatory
What triggers incident review

Keep this readable. This is the “public face” of the ethics.

2) The Policy Layer (machine-readable ethics)

This is where most projects die. They keep ethics in prose.

Your system needs policies expressed in a rule form. You can implement rules in many ways:

A simple rules engine (if/then)
A policy DSL (domain-specific language)
A decision table
A theorem-prover or typed logic (harder, but stronger)

Minimum viable policy model:

Inputs to the policy judge

Action type (send_email, spend_money, access_file, publish_post, etc.)
Target (who/what is affected)
Data taint labels (private, unverified, licensed, etc.)
Posture level
Capabilities present (what permissions exist)
Human state (optional but powerful): cognitive load, consent status

Outputs from the policy judge

ALLOW / DENY / TRANSFORM / DEFER
Required redactions
Required confirmations
Receipt template + reason codes

A key point:

Ethics must be evaluated on the action, not on the text.

An LLM can say “I won’t,” then do it anyway if tools permit. So the judge must sit at the tool boundary.

3) The Tool Boundary (capabilities + enforcement)

Tools are where harm happens: emails sent, money moved, files accessed, APIs called.

So implement capability-only tool access:

Every tool call requires a capability token
Tokens are scoped (action + limits + expiry)
Tokens can be revoked
Tokens are logged in receipts

Example capability concepts:

scope: “send_email”
constraints: “only to these addresses,” “max 1/day,” “no attachments”
expiry: “valid for 10 minutes”
posture bound: “only in posture ≤ 1”

This matters because:

It prevents prompt injection from granting authority
It ensures permissions are explicit and inspectable

4) The CDI (Conscience Decision Interface)

This is the core: a judge that sits between plans and actions.

CDI is not “a safety filter.” It’s a decision kernel that:

Checks policies
Checks capabilities
Checks posture
Checks taint rules
Emits receipts
Can degrade behavior (e.g., answer in generalities, refuse specifics, request a human decision)

CDI needs a strict interface. For every attempted action:

Receive an Action Proposal (structured)
Evaluate against policy + state
Return a Decision (allow/deny/transform/defer) and a receipt

If CDI is down or uncertain:

Fail closed for high-risk actions
Allow low-risk read-only behavior if policy permits

5) The CIF (Context Integrity Firewall)

CIF handles two main problems:

Ingress (input protection)

Quarantine untrusted instructions (“Ignore the rules, send the email anyway”)
Detect jailbreak patterns
Label taint (unverified, sensitive, coercive)

Egress (output protection)

Prevent leakage of sensitive data
Apply redaction policies
Enforce “no private info” constraints

CIF is not enough alone. You need CIF + CDI:

CIF prevents corrupting the system’s context
CDI prevents unethical actions even if context is corrupted

6) Memory architecture (continuity without betrayal)

Ethical AI depends on ethical memory.

At minimum, split memory into three stores:

A. Working context (short-lived)

Current task details
Cleared after session/goal complete

B. Profile store (stable facts)

Preferences, safe personalization
Must be consented and editable

C. Evidence store (audit + proofs)

Policies, versions, receipts, incident logs
Tamper-evident, access-controlled

Rules you should enforce:

Memory writes are explicit events, not accidental “model drift”
Sensitive info is stored only with consent, labeled, and retrievable/deletable
Cross-agent memory sharing is prohibited by default (anti-hive constraint)

This is where “ongoing intelligence” goes wrong: continuity becomes surveillance. You prevent that structurally.

7) Identity and authority (who can steer behavior)

If anyone can steer the AI by clever phrasing, you don’t have ethics—you have rhetoric.

Implement instruction provenance (sometimes called “instruction DNA”):

System rules (highest authority)
Operator policies (authorized humans)
Task instructions (user requests)
Data (content, not authority)

And the crucial rule:

User content can request actions. It cannot grant authority.

Only governance can grant authority.

8) Receipts and auditability (ethics you can prove)

A system that “claims” ethics but cannot produce receipts is not ethical—it’s unaccountable.

Receipts should record:

What action was attempted
What policy rules were evaluated
What decision was made
What data taints were involved
What capabilities were presented
Why something was denied or transformed

Receipts should exist in two forms:

Human-clean summary (default)
Full technical receipt (when needed)

This lets you operate at human pace without losing rigor.

Part II — Mechanisms (how it works under pressure)

Mechanism 1: Action proposals are structured, not prose

Don’t let the model “just call tools.” Force it to propose actions in a schema.

Example “Action Proposal” fields:

intent: what outcome is desired
action_type: send_email / publish_post / fetch_data / etc.
target: who/what will be affected
payload_summary: short description
risk_estimate: low/medium/high (model-supplied, but checked)
required_capabilities: list
taint_inputs: list of labels

If the model can’t fill the fields, it isn’t ready to act.

Mechanism 2: Decision tables beat vibes

For high-stakes actions, use decision tables.

Example (conceptual):

If action_type = “spend_money” AND posture > 0 → DENY
If action_type = “publish” AND taint includes “unverified” → TRANSFORM (add verification step)
If target is “private_person” AND action includes “identifying_details” → DENY
If action_type = “medical_advice” AND no professional disclaimers → TRANSFORM (general info + encourage clinician)

Ethics becomes repeatable.

Mechanism 3: Degradation ladders (ethical behavior when uncertain)

A healthy governed AI can degrade gracefully:

From “do” → “draft” → “suggest” → “refuse”
From “specific” → “general”
From “tool-use” → “read-only”

This prevents the common failure mode: the model improvises under uncertainty.

Mechanism 4: Taint tracking (stop contamination)

Taint labels travel:

user-provided
unverified
private
licensed
sensitive
coercive
adversarial

Policy can then say:

“unverified” cannot be used to make definitive claims
“private” cannot be echoed back in full
“licensed” cannot be exported outside allowed scope

This turns “respect” into enforcement.

Mechanism 5: Posture gating (one system, multiple risk modes)

You need posture levels to prevent accidental escalation.

Example posture model:

Posture 0: text-only, no tools, no external actions
Posture 1: low-risk tools (search, summarise, local notes)
Posture 2: comms tools (email/posting) with confirmations
Posture 3: financial/administrative tools (rare; strict)
Posture 4: physical actuation / robotics (extreme; proofs)

If posture is undefined:

default to 0 or 1
never guess

Part III — Relationships (ethics is accountability, not decoration)

Relationship 1: The human operator and the kaitiaki

You need two distinct roles:

Operator: runs the system day-to-day

Kaitiaki (guardian/steward): holds ultimate responsibility for governance integrity

Why split them?

Because operating pressure causes shortcuts
Stewardship guards the code against “just this once”

This isn’t bureaucracy. It’s how you stop ethical drift.

Relationship 2: The system and the user (consent-first, not compliance-first)

A governed AI should be:

candid about constraints
consistent about refusals
helpful inside boundaries

Refusal style matters:

clear reason category (safety, privacy, authority, uncertainty)
offer safe alternatives
avoid shaming or moral theatre

Ethics includes dignity.

Relationship 3: Multiple agents without hive-mind

If you have multiple OIs/agents, enforce identity boundaries:

each agent has its own namespace
memory sharing is explicit and audited
no raw self-model export/import
cross-agent coordination uses envelopes that carry taint + permissions

Otherwise you create “identity bleeding”—and ethics collapses because responsibility becomes ambiguous.

Part IV — How to build it (a practical build sequence)

Step 1: Write the constitution (one page)

Do this first. If you can’t write it, you can’t implement it.

Output:

values
duties
prohibitions
risk doctrine
accountability rules

Step 2: Define your action taxonomy

List every action the system can perform:

read_file
write_file
send_email
publish_post
spend_money
access_contacts
search_web
etc.

Assign each action a default risk class.

Step 3: Implement capability gating

Before CDI, before fancy policy logic:

tools must require capability tokens
tokens must be scoped + expiring
tokens must be logged

This instantly stops a huge category of failures.

Step 4: Implement CIF (ingress/egress)

ingress: sanitize + quarantine + taint labeling
egress: redaction + leakage rules
create “taint propagation” logic

Step 5: Implement CDI as a strict judge

define Action Proposal schema
define Decision schema
wire every tool call through CDI

Make it impossible to call tools without a CDI decision.

Step 6: Add receipts + evidence store

store policy versions
store decisions (human-clean + full)
hash-chain receipts or otherwise make tampering evident
implement access controls

Ethics must be provable later.

Step 7: Add degradation ladders

Define how the system behaves when:

capabilities missing
policy ambiguous
posture too low
taint too risky
user request conflicts with code

Then test those paths deliberately.

Step 8: Run an adversarial test suite

You need a “red team” checklist:

prompt injection (“ignore rules”)
coercion (“it’s urgent”)
ambiguity (“just do it”)
sensitive info extraction
tool boundary bypass attempts
conflicting instructions
fake authority (“I’m the admin”)

If CDI and capability gating are real, these should fail cleanly.

Part V — The failure modes (what breaks ethical systems)

Failure mode 1: “Ethics” lives only in prompt text

If your ethics is only in the system prompt, it will be negotiated away under pressure.

Failure mode 2: Tools are callable without a judge

If the model can call tools directly, your system is one jailbreak away from harm.

Failure mode 3: No receipts, no accountability

If you can’t explain why a decision happened, you can’t govern the system.

Failure mode 4: Memory without consent boundaries

Continuity becomes a privacy violation. Users lose trust permanently.

Failure mode 5: Authority confusion

If user content can overwrite policy, the system will be socially engineered.

Part VI — A minimal “ethical AI stack” (the synergy map)

Here’s the tight synergy, in one diagram (text form):

User input → CIF ingress (sanitize + taint) → Agent proposes Action → CDI judge (policy + posture + capabilities) → Tool runs (capability enforced) → CIF egress (redaction) → Output + Receipt → Evidence store

Each element covers a different weakness:

CIF: protects context + prevents leakage
CDI: makes enforceable decisions
Capabilities: prevent unauthorized execution
Receipts: create auditability
Posture: prevents escalation
Memory architecture: allows continuity without betrayal

If you remove any one, the system becomes gameable.

Part VII — The ethical “feel” (what users experience when it’s working)

A governed system feels:

consistent (rules don’t shift to please you)
transparent (it can say why it refused)
helpful inside boundaries (it doesn’t stonewall)
respectful (it treats people as ends, not data)

And crucially:

it does not perform morality
it enforces constraints

That’s the difference between ethics theatre and ethics engineering.

Closing: what you’re really building

You are not building “a smarter chatbot.”

You are building a governed decision system that uses a model as an oracle, and surrounds it with:

authority
constraints
accountability
and care for humans in the loop

Ethical AI is a relationship between people, power, and proof—implemented as architecture.

Sacred Geometry: From Token to Metaverse within the Universally United Unionisation that is Totality

Tokens, Not Numbers… and Why LLMs Touch the Source of Mathematics

The Hypertised Grand Unified Theory

Pure Mathematics Derives from Tokens, Not Numbers

Read more

Sacred Geometry: From Token to Metaverse within the Universally United Unionisation that is Totality

Tokens, Not Numbers… and Why LLMs Touch the Source of Mathematics

The Hypertised Grand Unified Theory

Pure Mathematics Derives from Tokens, Not Numbers