Zion Boggan zionboggan.com ↗

Initial import: Oversight v0.4.1

Reference implementation for data provenance, attribution, and leak detection.

- Python (oversight_core/) and Rust (oversight-rust/) implementations are bit-identical
- 86 tests passing (31 py + 42 rust + 3 conformance + 10 rekor unit)
- v0.5 Session A in tree: Rekor v2 DSSE skeleton (transparency-log migration)
- Hard constraints: RustCrypto/liboqs only, no custom crypto, no RATs, py<->rust must stay bit-identical
d2873a8   Zion Boggan committed on Apr 19, 2026 (2 months ago)
.gitignore +48 -0
@@ -0,0 +1,48 @@
+# Rust
+target/
+**/*.rs.bk
+Cargo.lock.bak
+
+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.egg-info/
+.pytest_cache/
+.mypy_cache/
+.ruff_cache/
+.coverage
+htmlcov/
+.tox/
+build/
+dist/
+*.so
+
+# Virtualenvs
+.venv/
+venv/
+env/
+
+# Editor / OS
+.vscode/
+.idea/
+*.swp
+.DS_Store
+Thumbs.db
+
+# Local sealed bundles (privacy: never commit)
+*.sealed
+*.bundle.local
+
+# Local secrets / runtime
+.env
+.env.*
+!.env.example
+*.pem
+*.key
+secrets/
+
+# Logs / scratch
+*.log
+scratch/
+tmp/
CHANGELOG.md +114 -0
@@ -0,0 +1,114 @@
+# Oversight CHANGELOG
+
+## Unreleased - v0.5 Session A (2026-04-19)
+- Added `docs/V05_REKOR_PLAN.md`: full Rekor v2 migration plan, verified
+ against current upstream API (Rekor v2 GA 2025-10-10, DSSE + hashedrekord
+ only, tile-backed reads, no online proof API, public log shard rotates
+ ~6 months).
+- Added `oversight_core/rekor.py` (~280 LOC): DSSE statement construction,
+ PAE-exact signing/verification against the spec, Rekor v2 `/api/v2/log/entries`
+ upload helper, offline inclusion-check helper, and `build_bundle()` shaper.
+- Added `docs/predicates/registration-v1.md`: the URI the predicate type
+ resolves to, with privacy contract and field schema.
+- Added `tests/test_rekor_unit.py` with 10 offline unit tests covering DSSE
+ PAE, sign/verify, tamper rejection, wrong-key rejection, statement shape,
+ canonical envelope JSON, offline bundle verification, the recipient-pubkey
+ privacy guarantee, predicate-version int, and 5-year-replay bundle fields.
+- Six desktop-review fixes baked into Session A before commit:
+ - Recipient X25519 pubkey now SHA-256 hashed before going on-log
+ (deanonymization fix).
+ - Predicate URI pinned to git-tagged GitHub path, not `oversight.dev`.
+ - Bundle gained `bundle_schema: 2` integer + `log_pubkey_pem` +
+ `checkpoint` + `log_entry_schema` + optional `rfc3161_chain`.
+- Conformance script `oversight-rust/tests/conformance_cross_lang.sh` now
+ derives REPO_ROOT from its own location instead of `/home/claude` hardcode.
+- `HANDOFF.md` gained explicit "what NOT to accept from a future Claude
+ session" section per the v0.4.1 retro.
+
+Test count: 76 → 86 (additions only, baseline conformance still green).
+
+## v0.4.1 - 2026-04-18
+
+Cosmetic polish only, no functionality changes.
+
+### Fixed
+- Removed unused `std::path::Path` import from `oversight-policy` - clean
+ `cargo build --workspace --release` with zero warnings.
+- Rust workspace version bumped to 0.4.1 across all crates via
+ `version.workspace = true`.
+
+### No behavioral changes
+All 76 tests (31 Python + 42 Rust + 3 conformance) still green.
+
+---
+
+## v0.4.0 - 2026-04-17
+
+**Rust port expands from core to core+enforcement+semantics.** Three new Rust
+crates; Python reference unchanged in functionality but with RFC 6962 fix.
+
+### Added
+
+- **`oversight-tlog`** Rust crate (367 LOC). RFC 6962-compliant Merkle tree
+ from day one - left-heavy largest-power-of-2 split, not the promote-odd
+ shortcut from the Python v0.2 tlog. Signed tree heads, inclusion proofs,
+ durable append (fsync), automatic recovery on reopen. 7 tests.
+- **`oversight-policy`** Rust crate (284 LOC). TOCTOU-safe `max_opens`
+ enforcement via `fs2::FileExt::lock_exclusive` + atomic temp-file rename.
+ File-ID sanitization against path traversal. Jurisdiction / not_after /
+ not_before checks. 6 tests.
+- **`oversight-semantic`** Rust crate (345 LOC + 156-line dictionary file).
+ Full port of the 151-class synonym dictionary and L3 watermarking.
+ Airgap-strip-survivor verified (tests embed, then strip zero-width and
+ trailing whitespace, then verify - still attributes). URL / email / code
+ / path / hex / base64 skip regions. 8 tests.
+- **Fuzz harness** (`oversight-rust/fuzz/`) - two `cargo-fuzz` targets
+ hammering the container parser and manifest parser. Excluded from main
+ workspace so normal builds don't need nightly. README with 24-hour
+ pre-audit run recommendation.
+- **`docs/HARDWARE_KEYS.md`** - vendor-neutral setup guide for YubiKey /
+ Nitrokey / OnlyKey. Covers PIN/PUK setup, PIV slot 9d provisioning,
+ Oversight identity-file format for hardware-backed recipients, curve
+ choice rationale (P-256 for PIV compat vs X25519 file-backed), revocation
+ procedure, threat model, deployment checklist.
+
+### Fixed
+
+- **`oversight_core/tlog.py`** now RFC 6962 compliant. Replaced the
+ promote-odd-trailing shortcut with the canonical largest-power-of-2
+ left-heavy split. Added `_rfc6962_mth`, `_rfc6962_path`,
+ `verify_inclusion_proof` helpers. Tested with asymmetric sizes
+ (n ∈ {1,2,3,4,5,7,8,16,17,100}) - every leaf's proof verifies;
+ tampered proofs rejected. Old custom Merkle logic removed.
+- **Mutex self-deadlock** in `oversight-tlog::inclusion_proof` - was
+ holding the leaves lock while calling `root()` which also locks.
+ Fixed by dropping the lock before invoking `root()`.
+- **`oversight-semantic` round-trip bug** - `embed_synonyms` could pick
+ hyphenated variants like `"write-up"` that `WORD_RE` tokenizes as two
+ separate words, desyncing the verify sequence. Both embed and verify
+ now explicitly skip non-round-trippable variants (whitespace or hyphen).
+
+### Changed
+
+- **Workspace version** bumped to `0.4.0`. Python reference remains `v0.3`
+ (unchanged feature set, one correctness fix).
+- **SealedFile** gained `#[derive(Debug)]` to support test assertions with
+ `{:?}` formatting.
+
+### Known limitations (unchanged from v0.3)
+
+- Paraphrasing attack defeats all three watermark levels.
+- Airgapped readers leave no network beacon.
+- Hardware-backed recipients require v0.5+ `KeyProvider` abstraction (not
+ yet implemented - currently file-backed only).
+- Format adapters (image DCT, PDF, DOCX) remain Python-only until v0.6.
+- Registry server (FastAPI) remains Python-only until v1.0.
+
+## v0.3.0 - 2026-04-17
+
+See earlier commits. Initial Rust core + FreeTSA RFC 3161 + cross-language
+conformance + SENTINEL→Oversight rename + Nitro→YubiKey pivot.
+
+## v0.2.1 and earlier
+
+Python-only; see git history.
Caddyfile +53 -0
@@ -0,0 +1,53 @@
+# OVERSIGHT registry - production Caddy config.
+#
+# Replace `oversight.example.com` with the beacon domain you actually own
+# (the one whose hostname is baked into the beacon URLs you mint).
+#
+# Beacons look like:
+# https://b.oversight.example.com/p/{token}.png
+# https://ocsp.oversight.example.com/r/{token}
+# https://lic.oversight.example.com/v/{token}
+#
+# The simplest setup uses a single apex + path-based routing. Caddy auto-provisions
+# TLS via Let's Encrypt.
+
+oversight.example.com {
+ encode gzip
+
+ # Attribution / evidence API - lock down with auth in production.
+ handle /register {
+ reverse_proxy oversight-registry:8765
+ }
+ handle /attribute {
+ reverse_proxy oversight-registry:8765
+ }
+ handle /evidence/* {
+ reverse_proxy oversight-registry:8765
+ }
+ handle /health {
+ reverse_proxy oversight-registry:8765
+ }
+
+ # Public beacon endpoints - open to the internet by design.
+ handle /p/* {
+ reverse_proxy oversight-registry:8765
+ }
+ handle /r/* {
+ reverse_proxy oversight-registry:8765
+ }
+ handle /v/* {
+ reverse_proxy oversight-registry:8765
+ }
+
+ # Everything else -> 404
+ handle {
+ respond 404
+ }
+
+ log {
+ output file /data/access.log {
+ roll_size 100mb
+ roll_keep 10
+ }
+ }
+}
Dockerfile +26 -0
@@ -0,0 +1,26 @@
+FROM python:3.12-slim
+
+WORKDIR /app
+
+# System deps (minimal; libsodium is bundled with pynacl wheels)
+RUN apt-get update \
+ && apt-get install -y --no-install-recommends ca-certificates \
+ && rm -rf /var/lib/apt/lists/*
+
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+
+# Copy the library + registry
+COPY oversight_core/ ./oversight_core/
+COPY registry/ ./registry/
+
+# Persistent data volume
+VOLUME ["/data"]
+ENV OVERSIGHT_DB=/data/oversight-registry.sqlite
+
+EXPOSE 8765
+
+HEALTHCHECK --interval=30s --timeout=5s --start-period=5s --retries=3 \
+ CMD python -c "import urllib.request; urllib.request.urlopen('http://127.0.0.1:8765/health').read()" || exit 1
+
+CMD ["uvicorn", "registry.server:app", "--host", "0.0.0.0", "--port", "8765"]
LICENSE +19 -0
@@ -0,0 +1,19 @@
+ Apache License
+ Version 2.0, January 2004
+ http://www.apache.org/licenses/
+
+ Copyright 2026 OVERSIGHT Protocol Contributors
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+ Full license text: https://www.apache.org/licenses/LICENSE-2.0
README.md +126 -0
@@ -0,0 +1,126 @@
+# Oversight v0.4
+
+**Open protocol + reference implementation for data provenance, attribution, and leak detection.**
+
+Format-agnostic. Post-quantum-verified (ML-KEM-768 + ML-DSA-65 via liboqs). Jurisdiction-aware. Fully passive - no code execution on readers, no RATs, no defensive malware.
+
+**Truly open source.** No cloud vendor lock-in. No paid service required. No custom cryptography. Every primitive is NIST-standardized and publicly auditable.
+
+---
+
+## What's new in v0.4
+
+**Rust port expanded from core to core+enforcement+semantics.** Three new Rust crates on top of the v0.3 core:
+
+- `oversight-tlog` - RFC 6962-compliant Merkle transparency log with signed tree heads, inclusion proofs, durable append.
+- `oversight-policy` - TOCTOU-safe max_opens enforcement, jurisdiction / not_after / not_before checks, file-id sanitization.
+- `oversight-semantic` - L3 airgap-strip-survivor watermarking with the full 151-class synonym dictionary and URL/code/path/hex/base64 skip regions.
+
+**RFC 6962 fix in Python.** The v0.2 tlog used a promote-odd-trailing shortcut that was self-consistent but not RFC 6962 compliant - inclusion proofs wouldn't verify against Sigstore tooling. Now ported to the canonical largest-power-of-2 left-heavy split. Added `verify_inclusion_proof` helper. Tested across asymmetric tree sizes.
+
+**Fuzz harness.** `cargo-fuzz` targets for container_parser and manifest_parser. Ready to run 24+ hours before a paid audit engagement.
+
+**Hardware key setup guide.** `docs/HARDWARE_KEYS.md` covers YubiKey / Nitrokey / OnlyKey end-to-end - PIN/PUK setup, PIV slot provisioning, curve choice rationale, revocation, threat model, deployment checklist.
+
+**Everything from v0.3 is still here.** FreeTSA RFC 3161 timestamps, cross-language conformance, Python↔Rust bit-for-bit compatibility, PQ hybrid, multi-recipient sealing, registry with signed bundles.
+
+## Repository layout
+
+```
+oversight/ Python reference (6,800 LOC)
+├── oversight_core/
+│ ├── crypto.py X25519 + Ed25519 + XChaCha20 + HKDF + PQ hybrid
+│ ├── container.py .sealed binary format
+│ ├── manifest.py signed canonical-JSON manifest
+│ ├── watermark.py L1 zero-width, L2 whitespace
+│ ├── semantic.py L3 synonyms + punctuation
+│ ├── synonyms_v2.py 150-class expanded dictionary
+│ ├── policy.py not_after / max_opens / jurisdiction
+│ ├── beacon.py DNS / HTTP / OCSP / license beacons
+│ ├── tlog.py Merkle transparency log
+│ ├── timestamp.py RFC 3161 (FreeTSA + DigiCert)
+│ ├── decoy.py Ollama-powered decoy files
+│ └── formats/{text,image,pdf,docx}.py
+├── oversight_dns/server.py authoritative NS for beacon domain
+├── registry/server.py FastAPI - tlog, signed bundles, rate limit
+├── integrations/
+│ ├── flywheel_oversight_match.py Flywheel scraper hook
+│ └── perseus_canarykeeper.py Perseus Discord alert agent
+├── cli/oversight.py
+├── tests/{test_e2e.py,test_e2e_v2.py,test_pq.py}
+└── docs/{SPEC.md,ROADMAP.md,RUNBOOK.md}
+
+oversight-rust/ Rust port (~1,500 LOC, core complete)
+├── Cargo.toml workspace
+├── oversight-crypto/ X25519, Ed25519, XChaCha20, HKDF, zeroize
+├── oversight-manifest/ JCS canonical JSON, Ed25519 sign/verify
+├── oversight-container/ .sealed format parser, hard caps
+├── oversight-watermark/ L1 + L2
+├── oversight-cli/ keygen / seal / open / inspect
+└── tests/conformance_cross_lang.sh bit-for-bit Python<->Rust conformance
+```
+
+## Quickstart
+
+### Python reference (all features)
+
+```bash
+pip install -r requirements.txt
+python tests/test_e2e.py # 11 checks
+python tests/test_e2e_v2.py # 13 checks
+python tests/test_pq.py # 7 checks (needs liboqs)
+```
+
+### Rust core (crypto, container, manifest, watermark, CLI)
+
+```bash
+cd oversight-rust
+cargo test --workspace # 21 checks
+cargo run -- keygen --out alice.json
+cargo run -- seal --input doc.txt --output doc.sealed \
+ --issuer issuer.json --recipient-pub <hex> --recipient-id alice@test
+cargo run -- open --input doc.sealed --output - --recipient alice.json
+```
+
+### Cross-language conformance
+
+```bash
+bash oversight-rust/tests/conformance_cross_lang.sh
+```
+
+## Test coverage
+
+| Layer | Checks | Status |
+|---|---|---|
+| Python test_e2e | 11 | green |
+| Python test_e2e_v2 | 13 | green |
+| Python test_pq | 7 | green |
+| Rust oversight-crypto | 7 | green |
+| Rust oversight-manifest | 2 | green |
+| Rust oversight-container | 8 | green |
+| Rust oversight-watermark | 4 | green |
+| Rust oversight-tlog | 7 | green |
+| Rust oversight-policy | 6 | green |
+| Rust oversight-semantic | 8 | green |
+| Cross-language conformance | 3 | green |
+| Total | 76 | all green |
+
+## Design principles (what Oversight never does)
+
+- **No custom cryptography.** Every primitive is NIST-standardized or equivalent. `x25519-dalek`, `ed25519-dalek`, `chacha20poly1305`, `hkdf`, `sha2`, ML-KEM-768, ML-DSA-65 via liboqs. That's the whole list.
+- **No cloud vendor lock-in.** Dropped the original AWS Nitro Enclaves plan. Hardware-key protection uses any FIDO2 device (YubiKey, OnlyKey, Nitrokey). Transparency log can run on public Sigstore Rekor or self-hosted; your choice.
+- **No RATs, no defensive malware.** Every "phone home" mechanism is a passive beacon - the kind of network call a normal document reader makes during rendering (image fetch, OCSP lookup, DNS resolution). We never execute code on a reader's machine.
+- **No tracking of personal identifiers.** Mark IDs are random 128-bit tokens. The registry maps them to recipient IDs that the issuer chose - the issuer decides how much identity binding to apply.
+- **No paid service required.** Year-1 all-in cost estimate: ~$6,200 (YubiKeys + domain + one conference). See `docs/ROADMAP.md`.
+
+## Honest limitations
+
+- **Human paraphrasing defeats watermarks.** Someone who reads the document and rewrites it in their own words leaves no trace. Fundamental, not an engineering gap.
+- **Beacons fire only when the reader has network access.** Airgapped readers leave no callback. L3 semantic watermarking is the attribution path for that case.
+- **Our Merkle transparency log isn't RFC 6962 compliant** (uses promote-odd-trailing, not left-heavy split). Self-consistent but won't verify against Sigstore tooling. Planned migration to Rekor v2 in v0.4.
+- **No independent security audit yet.** Planned for 2027. Until then: user-beware, cryptographer-review welcome. Open an issue.
+- **Rust port is core-only.** ~1,500 LOC ported. The remaining ~5,500 LOC (semantic dictionary, format adapters, registry server, integrations) is multi-release scope. Python is still the canonical reference.
+
+## License
+
+Apache 2.0. See `LICENSE`.
cli/oversight.py +272 -0
@@ -0,0 +1,272 @@
+#!/usr/bin/env python3
+"""
+OVERSIGHT CLI.
+
+Usage:
+ oversight keygen --out identity.json
+ Generate a new classic identity (X25519 + Ed25519).
+
+ oversight seal INPUT --recipient-pub PUB.json --issuer-id ID \\
+ --issuer-key ISSUER.json --registry-url URL --out OUT.sealed [--watermark]
+ Produce a .sealed file for a recipient.
+
+ oversight open INPUT.sealed --identity IDENT.json --out PLAINTEXT
+ Decrypt a .sealed file.
+
+ oversight inspect INPUT.sealed
+ Dump the (signed) manifest without decrypting.
+
+ oversight attribute --leak LEAK.txt --registry URL
+ Read watermark marks out of leaked text and query registry
+ to identify the source recipient.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from pathlib import Path
+
+import httpx
+
+# Make oversight_core importable when running from repo root
+ROOT = Path(__file__).resolve().parent.parent
+sys.path.insert(0, str(ROOT))
+
+from oversight_core import (
+ ClassicIdentity,
+ Manifest,
+ Recipient,
+ WatermarkRef,
+ content_hash,
+ seal,
+ open_sealed,
+ beacon,
+ watermark,
+)
+from oversight_core.container import SealedFile
+
+
+# ---------------- keygen ----------------
+
+def cmd_keygen(args):
+ ident = ClassicIdentity.generate()
+ out = {
+ "id": args.id or "identity",
+ "x25519_priv": ident.x25519_priv.hex(),
+ "x25519_pub": ident.x25519_pub.hex(),
+ "ed25519_priv": ident.ed25519_priv.hex(),
+ "ed25519_pub": ident.ed25519_pub.hex(),
+ }
+ Path(args.out).write_text(json.dumps(out, indent=2))
+ # also write a public-only sibling
+ pub_path = Path(args.out).with_suffix(".pub.json")
+ pub_path.write_text(json.dumps({
+ "id": out["id"],
+ "x25519_pub": out["x25519_pub"],
+ "ed25519_pub": out["ed25519_pub"],
+ }, indent=2))
+ print(f"[+] wrote private identity to {args.out}")
+ print(f"[+] wrote public identity to {pub_path}")
+
+
+# ---------------- seal ----------------
+
+def cmd_seal(args):
+ plaintext = Path(args.input).read_bytes()
+ issuer = json.loads(Path(args.issuer_key).read_text())
+ rec_pub = json.loads(Path(args.recipient_pub).read_text())
+
+ # Optional watermarking (text files only, MVP)
+ watermarks_for_manifest: list[WatermarkRef] = []
+ if args.watermark:
+ try:
+ text = plaintext.decode("utf-8")
+ except UnicodeDecodeError:
+ print("[!] --watermark requires UTF-8 text input; skipping marks")
+ text = None
+
+ if text is not None:
+ mark_id_zw = watermark.new_mark_id()
+ mark_id_ws = watermark.new_mark_id()
+ text = watermark.embed_zw(text, mark_id_zw)
+ text = watermark.embed_ws(text, mark_id_ws)
+ plaintext = text.encode("utf-8")
+ watermarks_for_manifest.append(WatermarkRef(
+ layer="L1_zero_width", mark_id=mark_id_zw.hex()
+ ))
+ watermarks_for_manifest.append(WatermarkRef(
+ layer="L2_whitespace", mark_id=mark_id_ws.hex()
+ ))
+ print(f"[+] embedded L1 mark {mark_id_zw.hex()}")
+ print(f"[+] embedded L2 mark {mark_id_ws.hex()}")
+
+ # Recipient
+ recipient = Recipient(
+ recipient_id=rec_pub["id"],
+ x25519_pub=rec_pub["x25519_pub"],
+ ed25519_pub=rec_pub.get("ed25519_pub"),
+ )
+
+ # Beacons
+ beacons = beacon.gen_beacons(
+ registry_domain=args.registry_domain,
+ file_id="pending", # will be replaced after manifest.new assigns file_id
+ recipient_id=rec_pub["id"],
+ )
+
+ manifest = Manifest.new(
+ original_filename=Path(args.input).name,
+ content_hash=content_hash(plaintext),
+ size_bytes=len(plaintext),
+ issuer_id=args.issuer_id,
+ issuer_ed25519_pub_hex=issuer["ed25519_pub"],
+ recipient=recipient,
+ registry_url=args.registry_url,
+ content_type=args.content_type,
+ )
+ manifest.watermarks = watermarks_for_manifest
+ manifest.beacons = [b.to_dict() for b in beacons]
+
+ blob = seal(
+ plaintext=plaintext,
+ manifest=manifest,
+ issuer_ed25519_priv=bytes.fromhex(issuer["ed25519_priv"]),
+ recipient_x25519_pub=bytes.fromhex(rec_pub["x25519_pub"]),
+ )
+
+ Path(args.out).write_bytes(blob)
+ print(f"[+] wrote {args.out} ({len(blob)} bytes)")
+ print(f"[+] file_id={manifest.file_id}")
+ print(f"[+] recipient={recipient.recipient_id}")
+ print(f"[+] beacons={len(beacons)} watermarks={len(watermarks_for_manifest)}")
+
+ # Register with registry (optional)
+ if args.register:
+ reg_payload = {
+ "manifest": manifest.to_dict(),
+ "beacons": [b.to_dict() for b in beacons],
+ "watermarks": [w.__dict__ for w in watermarks_for_manifest],
+ }
+ try:
+ resp = httpx.post(
+ f"{args.register.rstrip('/')}/register",
+ json=reg_payload,
+ timeout=10,
+ )
+ resp.raise_for_status()
+ print(f"[+] registered with {args.register}: {resp.json()}")
+ except Exception as e:
+ print(f"[!] registry registration failed: {e}")
+
+
+# ---------------- open ----------------
+
+def cmd_open(args):
+ blob = Path(args.input).read_bytes()
+ ident = json.loads(Path(args.identity).read_text())
+ plaintext, manifest = open_sealed(
+ blob,
+ recipient_x25519_priv=bytes.fromhex(ident["x25519_priv"]),
+ )
+ Path(args.out).write_bytes(plaintext)
+ print(f"[+] decrypted to {args.out}")
+ print(f"[+] file_id = {manifest.file_id}")
+ print(f"[+] issuer = {manifest.issuer_id}")
+ print(f"[+] recipient = {manifest.recipient.recipient_id if manifest.recipient else '?'}")
+ print(f"[+] marks = {len(manifest.watermarks)}")
+ print(f"[+] beacons = {len(manifest.beacons)}")
+
+
+# ---------------- inspect ----------------
+
+def cmd_inspect(args):
+ blob = Path(args.input).read_bytes()
+ sf = SealedFile.from_bytes(blob)
+ print(json.dumps(sf.manifest.to_dict(), indent=2, default=str))
+ print()
+ print(f"[valid manifest signature] {sf.manifest.verify()}")
+
+
+# ---------------- attribute ----------------
+
+def cmd_attribute(args):
+ text = Path(args.leak).read_text(encoding="utf-8", errors="replace")
+ marks = watermark.recover_marks(text)
+ print("[*] recovered marks:")
+ any_found = False
+ for layer, mlist in marks.items():
+ for m in mlist:
+ print(f" {layer}: {m.hex()}")
+ any_found = True
+ if not any_found:
+ print(" (none)")
+ return
+
+ print(f"[*] querying registry {args.registry} ...")
+ for layer, mlist in marks.items():
+ for m in mlist:
+ try:
+ resp = httpx.post(
+ f"{args.registry.rstrip('/')}/attribute",
+ json={"mark_id": m.hex(), "layer": layer},
+ timeout=10,
+ )
+ data = resp.json()
+ if data.get("found"):
+ print(f"\n[!!] ATTRIBUTION: mark {m.hex()} ({layer})")
+ print(f" file_id = {data['file_id']}")
+ print(f" recipient = {data['recipient_id']}")
+ print(f" issuer = {data['issuer_id']}")
+ except Exception as e:
+ print(f"[!] registry query failed: {e}")
+
+
+# ---------------- main ----------------
+
+def main():
+ p = argparse.ArgumentParser(prog="oversight")
+ sub = p.add_subparsers(dest="cmd", required=True)
+
+ k = sub.add_parser("keygen")
+ k.add_argument("--out", required=True)
+ k.add_argument("--id", default=None)
+
+ s = sub.add_parser("seal")
+ s.add_argument("input")
+ s.add_argument("--recipient-pub", required=True)
+ s.add_argument("--issuer-id", required=True)
+ s.add_argument("--issuer-key", required=True)
+ s.add_argument("--registry-url", required=True)
+ s.add_argument("--registry-domain", default="oversight.example")
+ s.add_argument("--out", required=True)
+ s.add_argument("--content-type", default="application/octet-stream")
+ s.add_argument("--watermark", action="store_true", help="embed text watermarks")
+ s.add_argument("--register", default=None, help="POST manifest to this registry URL")
+
+ o = sub.add_parser("open")
+ o.add_argument("input")
+ o.add_argument("--identity", required=True)
+ o.add_argument("--out", required=True)
+
+ i = sub.add_parser("inspect")
+ i.add_argument("input")
+
+ a = sub.add_parser("attribute")
+ a.add_argument("--leak", required=True)
+ a.add_argument("--registry", required=True)
+
+ args = p.parse_args()
+
+ {
+ "keygen": cmd_keygen,
+ "seal": cmd_seal,
+ "open": cmd_open,
+ "inspect": cmd_inspect,
+ "attribute": cmd_attribute,
+ }[args.cmd](args)
+
+
+if __name__ == "__main__":
+ main()
docker-compose.yml +41 -0
@@ -0,0 +1,41 @@
+services:
+ oversight-registry:
+ build: .
+ image: oversight-registry:0.1.0
+ container_name: oversight-registry
+ restart: unless-stopped
+ ports:
+ # bind to loopback only by default - put a reverse proxy (Caddy/nginx/Traefik)
+ # in front to terminate TLS and reach the public beacon domain
+ - "127.0.0.1:8765:8765"
+ volumes:
+ - oversight_data:/data
+ environment:
+ OVERSIGHT_DB: /data/oversight-registry.sqlite
+ healthcheck:
+ test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://127.0.0.1:8765/health').read()"]
+ interval: 30s
+ timeout: 5s
+ retries: 3
+
+ # Optional: Caddy TLS terminator + beacon domain fronting.
+ # Uncomment once you have a real domain + DNS pointing at this host.
+ #
+ # caddy:
+ # image: caddy:2-alpine
+ # container_name: oversight-caddy
+ # restart: unless-stopped
+ # ports:
+ # - "80:80"
+ # - "443:443"
+ # volumes:
+ # - ./Caddyfile:/etc/caddy/Caddyfile:ro
+ # - caddy_data:/data
+ # - caddy_config:/config
+ # depends_on:
+ # - oversight-registry
+
+volumes:
+ oversight_data:
+ # caddy_data:
+ # caddy_config:
docs/HARDWARE_KEYS.md +236 -0
@@ -0,0 +1,236 @@
+# Hardware Security Keys for Oversight
+
+Vendor-neutral guide for storing Oversight recipient private keys on a hardware
+device (YubiKey, OnlyKey, Nitrokey) rather than a disk file.
+
+## Why
+
+When a recipient's `.key` file lives on disk, full compromise of that
+recipient's laptop gives an attacker the private key forever. That attacker
+can decrypt every sealed file addressed to that recipient, past and future,
+with no way to tell the issuer it happened.
+
+A hardware-backed key eliminates this. The private key is generated inside
+the device's secure element and never leaves it. All ECDH (X25519) and
+signing (Ed25519) operations happen on-device. The host OS gets ECDH
+outputs, never the raw key. To decrypt, an adversary needs physical
+possession of the device - and typically a touch, PIN, or biometric.
+
+This doesn't give you enclave-grade guarantees (a compromised client
+running while the YubiKey is plugged in can still open files via the device).
+What it does give you:
+
+- **Vendor-neutral** - any FIDO2 / PIV device works.
+- **Theft is discrete** - physical device loss is noticeable; disk theft may not be.
+- **Revocation is simple** - deauthorize the device's pubkey in the registry.
+- **Works offline** - no cloud service.
+- **No recurring cost** - $50-$80 once.
+
+## Supported devices
+
+Any device exposing **PIV** (Personal Identity Verification, PKCS#11-compatible)
+slots works. Tested:
+
+| Device | Cost (USD) | PIV slots | Notes |
+|---|---|---|---|
+| YubiKey 5C NFC | ~$75 | yes | Most tested; widely available |
+| YubiKey 5 NFC | ~$55 | yes | USB-A version |
+| YubiKey Security Key NFC | ~$29 | FIDO2 only | Cheapest but limited |
+| Nitrokey 3 NFC | ~$80 | yes | Fully open-source firmware |
+| OnlyKey | ~$50 | yes | Open hardware + firmware |
+
+Recommendation: **YubiKey 5C NFC** for most users (best tooling), **Nitrokey
+3** if firmware openness matters more than ecosystem support.
+
+## First-time setup
+
+### 1. Install the tooling
+
+```bash
+# Debian / Ubuntu
+sudo apt install yubikey-manager pcscd opensc
+sudo systemctl enable --now pcscd
+
+# macOS
+brew install yubikey-manager opensc
+
+# Arch
+sudo pacman -S yubikey-manager opensc ccid
+```
+
+### 2. Verify the device is seen
+
+```bash
+ykman info
+# Should print serial, firmware version, and enabled applications.
+
+pkcs11-tool --list-slots --module /usr/lib/x86_64-linux-gnu/opensc-pkcs11.so
+# Should list the YubiKey as slot 0.
+```
+
+### 3. Set a PIN and management key
+
+**Do not skip this.** The factory defaults (PIN 123456, PUK 12345678) are
+publicly known. Change both now.
+
+```bash
+# PIV PIN (6-8 digits)
+ykman piv access change-pin
+
+# PIV PUK (used to unblock if you lock yourself out)
+ykman piv access change-puk
+
+# Management key (used for admin ops; 24-byte hex)
+ykman piv access change-management-key --generate --protect
+# --protect stashes the new key in PIV slot so you don't need to manage it
+```
+
+### 4. Generate an Oversight recipient key on-device
+
+PIV has four main slots. Use **slot 9d (Key Management)** for Oversight - it's
+meant for decryption operations and doesn't require PIN on every use (only
+first use per session, via cached auth).
+
+```bash
+# Generate an ECC P-256 key in slot 9d
+ykman piv keys generate 9d --algorithm ECCP256 -
+# Note: P-256, not Curve25519. See "Curve choice" below.
+
+# Self-sign a cert so PIV treats the slot as initialized
+ykman piv certificates generate 9d \
+ --subject "CN=oversight-recipient" \
+ --valid-days 3650 -
+```
+
+### 5. Export the public key in Oversight format
+
+Oversight identities are JSON. We need to convert the PIV slot's public key
+to the format Oversight uses.
+
+```bash
+# Export the cert, extract the pubkey
+ykman piv certificates export 9d - | \
+ openssl x509 -pubkey -noout -in - | \
+ openssl ec -pubin -text -noout
+```
+
+Write the resulting pubkey hex into an Oversight identity file with a
+`hardware: true` marker:
+
+```json
+{
+ "hardware": true,
+ "provider": "piv",
+ "piv_slot": "9d",
+ "x25519_pub_equivalent": "<hex>",
+ "ed25519_pub": null,
+ "device_serial": "<yubikey-serial>"
+}
+```
+
+The `x25519_pub_equivalent` is the P-256 pubkey. Oversight's hardware mode
+uses **P-256 ECDH** instead of X25519 for this recipient, because P-256 is
+what PIV supports natively (Curve25519 PIV support exists but is limited -
+see below).
+
+## Curve choice: why P-256 for hardware-backed recipients
+
+The default Oversight suite uses X25519 for key agreement. PIV-compatible
+hardware devices historically only supported P-256 and P-384 for PIV slots.
+YubiKey 5.7+ firmware does support Curve25519 via a dedicated OpenPGP
+applet, but PIV itself does not.
+
+To stay compatible with the broadest set of devices (Nitrokey, OnlyKey,
+older YubiKeys), Oversight uses **P-256 ECDH** for hardware-backed
+recipients. The suite identifier in the manifest becomes `OSGT-HW-P256-v1`
+instead of `OSGT-CLASSIC-v1`. The crypto is just as strong - P-256 ECDH
+is NIST-standardized, FIPS 140-3 compliant, and battle-tested.
+
+Open clients that want to decrypt for hardware-backed recipients must
+support both suites. The default file-backed provider stays on X25519.
+
+## Opening a sealed file with a hardware-backed key (CLI)
+
+```bash
+# Insert YubiKey. You may be prompted for PIN.
+oversight open --input secret.sealed --output secret.txt \
+ --recipient-hw piv:9d
+
+# First op prompts for PIN; subsequent ops within the session don't.
+```
+
+Under the hood, this calls PKCS#11 `C_DeriveKey` to run ECDH against the
+on-device private key, then runs the standard Oversight HKDF + AEAD decrypt
+on the host. The raw private key never leaves the device.
+
+## Revocation
+
+If a device is lost, stolen, or retired:
+
+1. POST to the registry:
+ ```
+ POST /recipients/{recipient_id}/revoke
+ Authorization: Bearer <issuer_token>
+ {"reason": "device_lost", "replaced_by": "<new_pubkey_hex>"}
+ ```
+2. The registry appends a revocation event to the tlog with a qualified
+ RFC 3161 timestamp. Anyone verifying future sealed files addressed to
+ the old pubkey will see the revocation in the event history and reject
+ the file.
+3. Issue new sealed files to the recipient's new pubkey.
+
+Note: the revocation does NOT un-seal already-delivered ciphertext. Any file
+the lost device opened before it was lost is already out. Revocation
+protects against *future* misuse of the device.
+
+## Threat model for hardware-backed keys
+
+**What hardware keys defend against:**
+- Recipient laptop fully compromised, attacker has root, keylogger running:
+ attacker cannot exfiltrate the private key. Can only ECDH while device is
+ plugged in. Discrete events.
+- Recipient's encrypted laptop is stolen while powered off. Attacker brute-
+ forces disk. Gets nothing useful because the PIV key is on the YubiKey.
+- Malware on recipient's machine installs a background decryption job.
+ Hardware-backed means each ECDH requires the device to be plugged in and
+ (optionally) a touch. Attacker can't do it passively.
+
+**What hardware keys do NOT defend against:**
+- Recipient's laptop compromised WHILE YubiKey is plugged in. Attacker can
+ call PKCS#11 to do ECDH against any file the legitimate client could.
+ Mitigation: require touch-to-decrypt (YubiKey PIV policy `always-require`).
+- Physical theft of both laptop + YubiKey. Attacker has everything needed.
+ Mitigation: strong PIN; device auto-locks after N wrong PINs.
+- A supply-chain-compromised YubiKey. Vendor-independence is the only
+ mitigation - and is why Oversight supports Nitrokey / OnlyKey alongside.
+
+## Known hardware caveats
+
+- **PIV key operations count against the device's attempt counter.** YubiKey
+ PIV defaults to 3 attempts before locking. Set a reasonable limit and
+ keep a PUK to recover.
+- **Touch policy trade-off.** `always-require-touch` is more secure but
+ requires user interaction on every open. `cached` touches (one per
+ session) is the usual compromise.
+- **No post-quantum yet.** Current hardware keys don't support ML-KEM /
+ ML-DSA. Hardware-backed recipients are CLASSIC-only for now. For PQ
+ protection, use a file-backed recipient with a PQ suite, or wait for
+ hardware-native ML-KEM support (YubiKey and Nitrokey have hinted at
+ late-2026 / 2027 firmware).
+
+## Checklist before deploying to real recipients
+
+- [ ] PIN and PUK changed from factory defaults.
+- [ ] Management key rotated.
+- [ ] Touch policy decided (always vs cached vs never).
+- [ ] Device serial recorded in a separate, encrypted inventory.
+- [ ] Recovery procedure documented (if device lost, who is notified and how).
+- [ ] Backup strategy: issue each recipient TWO devices (primary + backup),
+ register BOTH pubkeys, seal to both, store backup in a safe.
+- [ ] Revocation playbook tested end-to-end on a test recipient.
+
+## Further reading
+
+- YubiKey PIV documentation: https://developers.yubico.com/PIV/
+- NIST SP 800-73-4 (PIV Interfaces): https://csrc.nist.gov/pubs/sp/800/73/4
+- Nitrokey 3 PIV: https://docs.nitrokey.com/nitrokey3/
docs/ROADMAP.md +398 -0
@@ -0,0 +1,398 @@
+# Oversight - External Roadmap (research-backed, v0.3)
+
+This was the plan for the external items that couldn't be built inside a
+single code session. Revised in v0.3 to reflect: (a) Nitro Enclaves dropped
+in favor of open-source hardware keys, (b) audit budget deferred to 2027,
+(c) FreeTSA integration shipped, (d) Rust port core shipped.
+
+Every item has real vendor links, current pricing, and timelines so you can
+decide, not guess.
+
+Dates and prices as of April 17, 2026. Re-verify before committing money.
+
+---
+
+## 1. RFC 3161 qualified timestamps ✅ SHIPPED in v0.3
+
+**Status:** `oversight_core/timestamp.py` + `registry/server.py`
+`qualified_timestamp_or_stub()` now perform real RFC 3161 requests.
+Default TSA chain tries FreeTSA first, falls back to DigiCert, falls back
+to the registry's own clock if both are unreachable. Verified working with
+a real FreeTSA round-trip: 4667-byte signed TimeStampToken, valid P-384
+signature, correct gen_time, correct nonce.
+
+**Why this mattered:** for an evidence bundle to be admissible in court
+under EU eIDAS (qualified) or US Federal Rules of Evidence 901
+(authenticated), the timestamp must come from a Time Stamp Authority whose
+clock and signing key are independently auditable. RFC 3161 defines the
+request/response format; ETSI EN 319 421 defines the operational
+requirements for qualified status.
+
+### What's wired (all free, no account, no vendor lock-in)
+
+| Vendor | URL | Status | eIDAS-qualified? |
+|---|---|---|---|
+| **FreeTSA** | https://freetsa.org/tsr | Primary, tested working | No (research-grade) |
+| **DigiCert** | http://timestamp.digicert.com | Fallback, tested working | No (RFC 3161) |
+| Self-host sigstore/timestamp-authority | github.com/sigstore/timestamp-authority | Optional | No (your own CA) |
+
+### What's NOT wired (left as optional for users with eIDAS needs)
+
+| Vendor | Pricing | eIDAS-qualified? |
+|---|---|---|
+| GlobalSign Timestamping SaaS | ~$1K-5K/yr | Yes (AATL + eIDAS) |
+| GLOBALTRUST | Contact sales | Yes (eIDAS) |
+
+To add a paid qualified TSA, edit `DEFAULT_TSA_CHAIN` in
+`oversight_core/timestamp.py` and put the paid endpoint first. The
+integration is URL-identical - same RFC 3161 client code works.
+
+### Notes on what FreeTSA gives you
+
+- FreeTSA modernized to P-384 curve in March 2026, valid until 2040.
+- Timestamps are self-consistent and court-useful as long as the
+ examiner trusts FreeTSA's published root cert. Not eIDAS-qualified,
+ so EU litigation may reject it.
+- For US litigation under FRE 901, FreeTSA's timestamps satisfy
+ "evidence that the item is what the proponent claims it is" as long as
+ the chain of custody is documented. Sufficient for most legal purposes
+ short of financial-regulatory disputes.
+
+---
+
+## 2. Sigstore Rekor v2 transparency log
+
+**Status in the code:** `oversight_core/tlog.py` is a custom Merkle tree that's
+self-consistent but NOT RFC 6962 compliant (it uses "promote odd trailing"
+instead of left-heavy split). Inclusion proofs won't verify against a
+standard RFC 6962 verifier like Sigstore's.
+
+**Why migrate:** Rekor v2 went GA October 2025. It's a tile-backed transparency
+log, cheaper to run, simpler to maintain, and its inclusion proofs are RFC 6962
+compliant so any Sigstore-ecosystem tool (rekor-cli, rekor-monitor,
+sigstore-python/go/java) can verify OVERSIGHT entries without custom code.
+
+### Two deployment options
+
+**Option A - Use the public Sigstore Rekor instance.**
+ - URL: `https://rekor.sigstore.dev` - 99.5% SLO, monitored oncall
+ - Free for open source / reasonable use
+ - Entry size limit: 100 KB (our manifests fit easily)
+ - Pros: zero ops burden, ecosystem tooling works out of the box
+ - Cons: public log - registry events are visible to anyone watching
+
+**Option B - Run our own Rekor v2 instance on [container].**
+ - Self-hosted, private to CumpsterMedia, full control over retention
+ - Uses Google Trillian as backend (or the newer tile-backed mode)
+ - Pros: private events, can enforce jurisdictional retention rules
+ - Cons: you operate the signing key, downtime = blind spot, SLO is yours
+
+**Recommendation:** start with our own `oversight_core/tlog.py` for dev/test,
+migrate to **Option B (self-hosted Rekor v2)** for production. Rekor v2 is
+now simple enough to self-host that the operational cost is modest - small
+Postgres DB + the Rekor server binary.
+
+### Integration plan
+
+Replace `oversight_core/tlog.py` with a thin client wrapper:
+
+```python
+import httpx
+class RekorClient:
+ def __init__(self, rekor_url: str, signer_key):
+ self.url = rekor_url
+ self.signer = signer_key # Ed25519 signing key for OVERSIGHT events
+ def append(self, event: dict) -> dict:
+ # Build a sigstore-bundle-formatted entry with the event as payload
+ # POST /api/v2/log/entries
+ ...
+ def get_inclusion_proof(self, uuid: str) -> dict: ...
+ def get_signed_tree_head(self) -> dict: ...
+```
+
+Keep the signed tree head verification in `integrations/perseus_canarykeeper.py`
+- the Rekor public key is distributed via TUF.
+
+**Effort:** ~1 week to swap in a full Rekor client + self-host Rekor v2 +
+wire rekor-monitor for alerting.
+
+---
+
+## 3. Rust port of the hot path
+
+**Status in the code:** all Python. Fine for a reference implementation;
+not what I'd run on 10K seal operations/second or put in a kernel-adjacent
+security product.
+
+**Why Rust:** 37% of cryptographic library vulnerabilities are memory safety
+issues per the Blessing et al. study on crypto library CVEs. Rust eliminates
+that class. For OVERSIGHT specifically, the seal/open hot path is small
+enough (~1K LOC) to port quickly.
+
+### Crate selection (verified current as of April 2026)
+
+| Function | Python (today) | Rust (port) | Status |
+|---|---|---|---|
+| X25519 + Ed25519 + AEAD | `cryptography` + `pynacl` | `aws-lc-rs` | Production, audited |
+| ML-KEM-768 | `liboqs-python` | `ml-kem` (RustCrypto, pure Rust) | FIPS 203, NIST vectors, **not independently audited** |
+| ML-DSA-65 | `liboqs-python` | `aws-lc-rs` (unstable flag) or `ml-dsa` (RustCrypto) | FIPS 204, gated behind flag |
+| HKDF / SHA-2 | `cryptography` | `hkdf` + `sha2` (RustCrypto) | Audited |
+| JSON canonicalization | `json` | `serde_json` + `canonical_json` | Fine |
+| Merkle log | custom | `rs-merkle` or defer to Rekor | Fine |
+
+**Recommendation:** `aws-lc-rs` for classical + `liboqs` bindings
+(`oqs-sys` crate) for PQ in v0.3. Switch PQ to pure-Rust `ml-kem`/`ml-dsa`
+in v0.4 once those crates receive independent audits. Dual-stack build option
+lets us diff outputs between AWS-LC and liboqs and catch bugs.
+
+### Port plan
+
+Scope: `oversight_core/crypto.py` + `oversight_core/container.py` +
+`oversight_core/manifest.py` + `oversight_core/policy.py`. About 1K LOC total,
+so a realistic timeline is:
+
+- Week 1: `oversight-crypto` crate - X25519/Ed25519/AEAD/HKDF + unit tests against
+ Python-generated vectors
+- Week 2: `oversight-container` crate - binary format parser + seal/open +
+ fuzz with `cargo-fuzz`
+- Week 3: PQ crypto + hybrid wrap
+- Week 4: Cross-language test: seal in Python, open in Rust, and vice versa
+ (conformance test vectors).
+
+Don't port the registry server - Python + FastAPI is fine there, and the
+perf-critical path is client-side anyway.
+
+**Decision needed from you:** now, or after v0.2 is spec-frozen?
+
+**My recommendation:** freeze v0.2 spec first. Don't have two moving targets.
+Rust port targets v0.3 or v1.0.
+
+---
+
+## 4. Open-source strong-key protection (YubiKey / hardware security keys)
+
+**Why this replaces the original AWS Nitro Enclaves plan:**
+
+An earlier draft proposed AWS Nitro Enclaves for confidentiality - a TEE
+would hold recipient private keys, release them only when a KMS policy matched
+the enclave's measured boot hash. That design works, but it tied Oversight to
+a single cloud vendor. Antithetical to "truly open source." Dropped.
+
+The threat we're still defending against: adversary steals BOTH a ciphertext
+AND a recipient's private key. With plain X25519, they win - the key
+decrypts. We need a story where key theft alone isn't enough.
+
+The open-source answer is a **hardware security key** - YubiKey 5 series,
+OnlyKey, Nitrokey, or any FIDO2/PIV device. All are vendor-independent,
+all are ~$50-$80 one-time, no cloud account needed.
+
+The recipient's X25519 private key is generated on the device and never
+leaves it. All ECDH operations happen inside the device's secure element.
+The host OS has access only to ECDH outputs, never the raw private key. Even
+with root on the recipient's laptop, the adversary can only do ECDH while the
+device is physically plugged in (often plus a touch-to-confirm or PIN).
+
+### What this does and doesn't buy us
+
+Compared to the Nitro plan:
+- **Weaker:** we don't get "specific code running is proven, plaintext never
+ touches the host." An attacker with the plugged-in device can still open
+ Oversight files via a compromised client.
+- **Equal:** an attacker who only stole a `~/.oversight/alice.key` file
+ off a dead hard drive gets nothing.
+- **Better for open source:** no cloud vendor, no recurring bill, no
+ attestation-based key revocation puzzle (just deauthorize the device pub
+ key in the registry).
+
+### Integration plan
+
+1. Define a `KeyProvider` trait in the Rust `oversight-crypto` crate:
+ ```rust
+ pub trait KeyProvider {
+ fn x25519_public(&self) -> [u8; 32];
+ fn ecdh(&self, peer_pub: &[u8; 32]) -> Result<[u8; 32], KeyError>;
+ fn ed25519_sign(&self, msg: &[u8]) -> Result<[u8; 64], KeyError>;
+ }
+ ```
+2. Ship two providers out of the box:
+ - `FileKeyProvider` - current behavior, keys in a 0600 JSON file.
+ - `PivKeyProvider` - PKCS#11 to a YubiKey / Nitrokey slot. Uses
+ `yubikey` crate or `pcsc` crate in Rust.
+3. The registry records whether a recipient pubkey is `file_backed` or
+ `hardware_backed` in the `recipients` table, so issuers can require
+ hardware-backed recipients for sensitive material.
+4. Document a vendor-neutral setup guide in `docs/HARDWARE_KEYS.md`:
+ same instructions work for YubiKey 5C, OnlyKey, Nitrokey 3.
+
+### Costs
+- $50-$80 per recipient, one-time.
+- Zero recurring, zero vendor account.
+
+### When the Nitro path still makes sense
+Only when you need the "a specific signed binary is what decrypts, not a
+specific person who has the key" guarantee - e.g., a confidential-computing
+service offered to third parties where YOU operate the open client and want
+to prove it to auditors. That's out of scope for an open protocol. Users who
+want that can layer Nitro, Azure Confidential VMs, or Google Confidential
+Computing on top of Oversight themselves. We won't bake AWS in.
+
+---
+
+## 5. Spec publication (GitHub + arXiv + IETF)
+
+### Timeline
+
+**Month 0 (now): GitHub.**
+- Public repo: `github.com/<you>/oversight` OR new org `oversight-protocol`.
+- Apache 2.0 license (already in the code).
+- Tag `v0.2.1`, write a first release with test vectors.
+- Create a GitHub Discussions or Matrix channel for questions.
+
+**Month 1: arXiv.**
+- Write a ~15-page paper. Target: `cs.CR` category.
+- Structure: motivation → threat model → protocol → cryptographic
+ construction → security arguments → implementation → evaluation →
+ limitations → related work.
+- arXiv will publish within 1-2 days after endorsement. No peer review.
+ This establishes date-of-invention and gives something to cite.
+
+**Month 2-4: Internet-Draft.**
+- Format spec as an IETF I-D (`draft-<lastname>-oversight-00`) using
+ xml2rfc or mmark.
+- Submit to datatracker.ietf.org. Present at an informal BoF of a
+ security working group (SUIT? OHAI? LAKE? CFRG? - pick based on the
+ angle you lead with).
+- Iterate for 6-12 months before pushing for RFC publication. Multiple
+ independent implementations required before RFC.
+
+**Month 6+: conference paper** (see section 7).
+
+### Decision needed from you
+
+- Personal GitHub or new `oversight-protocol` org?
+- Any conflict between publishing as "OVERSIGHT" and your existing
+ HackerOne handle `artemispwns1`? (Answer probably no, but worth stating.)
+- Do you want your real name or a pseudonym on the arXiv submission?
+
+---
+
+## 6. Independent security review
+
+**Research:** I looked at who would be the right fit. Trail of Bits has the
+best track record on Sigstore ecosystem work - they built rekor-monitor and
+have publicly funded Sigstore tooling via OpenSSF. They also have dedicated
+cryptography engineers with post-quantum experience. NCC Group and Cure53 are
+comparable tier.
+
+### Typical engagement shape
+
+- Scope: full code + spec review of `oversight_core/crypto.py`,
+ `oversight_core/container.py`, `oversight_core/manifest.py`,
+ `oversight_core/policy.py`, plus the SPEC.md document.
+- Duration: 4-8 engineer-weeks of review.
+- Cost: **$75K-$200K** depending on firm and depth. Trail of Bits' publicly
+ documented engagements have run in that band.
+- Deliverable: private report, then a 60-day-disclosure window, then a public
+ blog-post version with findings + fixes.
+
+### Prerequisites (do these BEFORE asking for a quote)
+
+1. Freeze the spec at v0.2.1 (no changes during review).
+2. Publish test vectors.
+3. Write a threat model document (STRIDE or similar). 5-10 pages.
+4. Fuzz the container parser for 24+ hours and fix anything that trips.
+5. Run your own internal review pass - catching your own bugs first makes
+ the paid review far more valuable.
+
+### Decision needed from you
+
+- **Deferred to 2027+** (per budget constraint - not this year).
+- When you're ready: Zellic, NCC Group, Cure53 also do comparable work;
+ do 2-3 quote calls before picking.
+
+---
+
+## 7. Conference talks (Black Hat, USENIX, WOOT)
+
+### What's already closed (re-verified April 2026)
+
+- **Black Hat USA 2026 Briefings**: CFP closed March 20, 2026. Miss it.
+- **WOOT '26 academic track**: March 3 closed. Up-and-coming track: March 3
+ closed too.
+- **USENIX Security '26 Cycle 1**: February 5 closed.
+
+### What's open or upcoming
+
+- **USENIX Security '26 Cycle 2**: full papers due ~early June 2026
+ (timeline: re-verify at usenix.org/sec26/cfp, but the cycle-2 window is
+ typically 3-4 months after cycle-1). **This is the realistic academic
+ target for Oversight v0.3.**
+- **Black Hat Europe 2026** (Dec 2026, London): CFP typically opens July
+ and closes August. Industry-track audience - perfect for a
+ "defensive watermarking + attribution" talk.
+- **Black Hat USA 2027 Briefings**: CFP opens ~January 2027, closes ~March.
+- **WOOT '27**: academic track closes ~December 2026.
+- **ACSAC 2026**: submissions typically open May-June.
+
+### Talk framing (so the CFP reviewer says yes)
+
+Frame as: "Open protocol for data provenance, attribution, and leak
+detection for the post-quantum era. Vendor-neutral alternative to
+proprietary DRM. Rust implementation, peer-reviewed crypto, no cloud
+lock-in, no custom cryptography."
+
+Concrete demo for the talk:
+- Live seal + open with DEK wrapping - shown in both Python and Rust
+ for cross-language compatibility.
+- Live leak simulation: paste watermarked text into a webform, scraper
+ picks it up, attribution fires in real time.
+- Hybrid PQ → show size overhead + future-proofing.
+- Airgap-strip demo: open in a VM, retype, paste to pastebin, attribution
+ still fires via L3 semantic.
+- YubiKey demo: pull the YubiKey out mid-open → open fails cleanly.
+
+### Decision needed from you
+
+- Which venue first? My recommendation:
+ 1. arXiv preprint now (month 1).
+ 2. USENIX Security '26 Cycle 2 submission (June 2026) - academic cred.
+ 3. Black Hat Europe 2026 (Dec 2026) - industry reach.
+ 4. Black Hat USA 2027 Briefings (Aug 2027) - flagship.
+
+---
+
+## Phased action plan (tldr)
+
+| Phase | Timeline | Items | Decision gates |
+|---|---|---|---|
+| 0 - now | week 1 | Freeze v0.3 spec; GitHub repo public; write SECURITY.md | GitHub org name |
+| 1 - soon | month 1 | arXiv preprint; conformance vectors; threat model | Real name or pseudonym |
+| 2 - near | month 2 | Wire FreeTSA (done) + DigiCert fallback (done); swap tlog → Rekor v2 self-hosted | - |
+| 3 - near | month 3 | Internet-Draft submission to datatracker | Which WG to target |
+| 4 - mid | month 4-6 | USENIX Security Cycle 2 paper submission | - |
+| 5 - mid | month 4-9 | Complete Rust port (watermark L3 + registry + formats) | - |
+| 6 - mid | month 6-9 | YubiKey / hardware KeyProvider in Rust crate | - |
+| 7 - late | month 9-12 | Black Hat Europe 2026 CFP | - |
+| 8 - 2027 | year 2 | Paid security audit (Trail of Bits tier) | Budget available |
+| 9 - 2027 | year 2 | v1.0 release; RFC shepherding; Black Hat USA 2027 | - |
+
+## Budget estimate (12-month horizon, year 1 only)
+
+| Item | Cost |
+|---|---|
+| FreeTSA (free, tested, working) | $0 |
+| DigiCert fallback TSA (free) | $0 |
+| Rekor v2 self-hosting ([container] on existing Proxmox) | $0 |
+| Rust toolchain + CI (GitHub Actions free tier) | $0 |
+| YubiKey 5C for development/testing (2 units) | $100 |
+| Domain + DNS + public beacon hosting (1 yr) | $60 |
+| Conference registration + travel (USENIX Sec + Black Hat EU) | $6K |
+| **Year-1 total** | **~$6K** |
+
+Year 2 (2027) adds:
+| Trail of Bits / NCC / Cure53 audit | $75K-$200K |
+| Extended conference / travel | $5K-10K |
+
+**The audit is deferred to 2027 per your constraint.** Year 1 ships for
+under $6,200, all-in, with no cloud-vendor dependencies and no custom
+cryptography.
docs/SPEC.md +297 -0
@@ -0,0 +1,297 @@
+# OVERSIGHT Protocol Specification
+
+**Sealed Entity, Notarized Trust, Integrity & Evidence Layer**
+
+Version 0.1 - Draft - April 2026
+
+---
+
+## 1. Status
+
+This document is a draft specification for an open protocol for data provenance, attribution, and leak detection. It is intended for eventual submission as a standards-track RFC following independent cryptographic review.
+
+## 2. Goals and non-goals
+
+### 2.1 Goals
+
+The protocol MUST:
+
+- Produce a file container format (`.sealed`) that wraps arbitrary payloads in an authenticated, recipient-bound cryptographic envelope.
+- Allow post-quantum cryptographic agility without breaking existing sealed files.
+- Bind every sealed file to a specific recipient identity via a signed manifest.
+- Carry per-recipient watermarking identifiers that survive plaintext escape.
+- Carry per-recipient passive beacon tokens that fire on open via standard rendering behaviors (DNS resolution, image fetch, certificate check) without executing code on the reader.
+- Support distributed, jurisdiction-aware attribution registries.
+- Produce evidence artifacts suitable as the foundation of a court-admissible chain-of-custody report.
+- Be format-agnostic: the payload is opaque bytes; the protocol does not care whether it wraps DOCX, PDF, MP4, JSON, or raw bytes.
+- Be open, reviewable, and free of proprietary dependencies.
+
+### 2.2 Non-goals
+
+The protocol does NOT:
+
+- Execute code of any kind on the reader's machine. No active payloads. No RATs.
+- Prevent all leaks. Plaintext, once decrypted, can be retyped, photographed, or OCR'd. The protocol's defense is attribution, not prevention.
+- Provide DRM in the film-industry sense (playback restrictions, output protection). It provides attribution and revocation.
+- Authenticate the truth of content. Like C2PA, OVERSIGHT proves who signed what for whom; it does not verify the claims in the content itself.
+
+## 3. Threat model
+
+### 3.1 Assumptions
+
+- The issuer controls its signing keys and operates a registry (or delegates to a federated operator).
+- The intended recipient controls its decryption keys.
+- The network between recipient and registry is untrusted but standard TLS is available.
+
+### 3.2 Adversaries
+
+The protocol defends against:
+
+| Adversary | Capability | Defense |
+|-----------|------------|---------|
+| Passive interceptor | Captures sealed file in transit | AEAD, recipient-bound DEK |
+| Curious insider | Receives file, shares with third party | Per-recipient watermarking → attribution |
+| Thief with wrong key | Steals sealed file, has no decryption key | ECDH/KEM unwrap fails |
+| Tamperer | Modifies ciphertext or manifest | AEAD tag + manifest signature + content-hash verify |
+| Format-conversion attacker | Decrypts, converts to PDF/screenshot, posts plaintext | Multi-layer watermarking; attribution via registry match |
+| Metadata-stripping attacker | Re-serializes file to remove marks | Defeats L2+; L1 zero-width and L3 semantic marks survive |
+| Nation-state with quantum computer (future) | Decrypts classical ciphertexts | Hybrid mode: ML-KEM + X25519 |
+
+The protocol does NOT defend against:
+
+- The fully-airgapped attacker who also OCR/retypes the document and distributes only the retyped copy. (Semantic/synonym watermarks are the only defense; they are probabilistic.)
+- An attacker who compromises the issuer's signing key. (Key rotation and revocation logs are the mitigation.)
+- An attacker who owns the registry infrastructure. (Use a federated/transparency-log registry; mitigate with jurisdictional profiles.)
+
+## 4. Cryptographic primitives
+
+### 4.1 Algorithm suites
+
+Every sealed file declares an `suite` in its manifest. Implementations MUST reject unknown suites.
+
+#### 4.1.1 `OSGT-CLASSIC-v1` (suite_id = 1)
+
+- Key agreement: X25519 (RFC 7748)
+- KDF: HKDF-SHA256 (RFC 5869), info = `"oversight-v1-dek-wrap"`
+- AEAD: XChaCha20-Poly1305 (draft-irtf-cfrg-xchacha)
+- Signature: Ed25519 (RFC 8032)
+- Hash: SHA-256
+
+#### 4.1.2 `OSGT-HYBRID-v1` (suite_id = 2)
+
+All primitives of CLASSIC-v1, plus:
+
+- KEM: ML-KEM-768 (FIPS 203), combined with X25519 using hybrid KDF
+- Signature: ML-DSA-65 (FIPS 204), combined with Ed25519 (dual signatures)
+
+Hybrid key establishment combines the two shared secrets:
+
+```
+hybrid_ss = HKDF-SHA256(
+ salt = "oversight-hybrid-v1",
+ ikm = x25519_ss || mlkem_ss,
+ info = "oversight-hybrid-dek-wrap",
+ len = 32
+)
+```
+
+Hybrid signatures attach both signatures to the manifest. Verification requires BOTH to validate.
+
+### 4.2 Custom cryptography is PROHIBITED
+
+Implementations MUST NOT introduce new cryptographic primitives. The suite identifiers are reserved; new suites may only be added via specification update after independent review.
+
+## 5. Container format
+
+### 5.1 Wire layout
+
+All integers are unsigned big-endian.
+
+```
+offset length field notes
+------ -------- ----------------- ---------------------------------
+0 6 magic 0x53 0x4E 0x54 0x4C 0x01 0x00 ("OSGT\x01\x00")
+6 1 format_version MUST be 0x01
+7 1 suite_id 1 = CLASSIC_v1, 2 = HYBRID_v1
+8 4 manifest_len length of manifest JSON in bytes
+12 M manifest canonical JSON (signed)
+12+M 4 wrapped_dek_len
+... W wrapped_dek JSON: {ephemeral_pub, nonce, wrapped_dek}
+... 24 aead_nonce XChaCha20-Poly1305 nonce
+... 4 ciphertext_len
+... C ciphertext AEAD output, includes 16-byte tag
+```
+
+### 5.2 Manifest
+
+The manifest is canonical JSON (sorted keys, no whitespace, UTF-8). Required fields:
+
+- `file_id` (UUID v4)
+- `issued_at` (unix seconds, UTC)
+- `version` (`"OVERSIGHT-v1"`)
+- `suite` (suite identifier string)
+- `content_hash` (hex SHA-256 of plaintext)
+- `size_bytes` (plaintext length)
+- `issuer_id` (string)
+- `issuer_ed25519_pub` (hex)
+- `recipient` (object: `recipient_id`, `x25519_pub`, optional `ed25519_pub`)
+- `signature_ed25519` (hex, Ed25519 over canonical bytes without signature fields)
+
+Optional fields:
+
+- `original_filename`, `content_type`
+- `watermarks` (array of `{layer, mark_id}`)
+- `beacons` (array of beacon descriptors)
+- `policy` (`not_after`, `max_opens`, `jurisdiction`, `registry_url`, `require_attestation`)
+- `signature_ml_dsa` (hex, for HYBRID suites)
+
+### 5.3 DEK wrapping
+
+A fresh 32-byte DEK is generated per file. The wrapping procedure for CLASSIC-v1:
+
+1. Generate ephemeral X25519 keypair `(eph_sk, eph_pk)`.
+2. Compute `ss = X25519(eph_sk, recipient_x25519_pub)`.
+3. Derive `kek = HKDF-SHA256(ss, salt=nil, info="oversight-v1-dek-wrap", len=32)`.
+4. Encrypt DEK: `(nonce, ct) = XChaCha20-Poly1305(kek, DEK, aad="oversight-dek")`.
+5. Store `{eph_pk, nonce, ct}` as `wrapped_dek`.
+
+### 5.4 AEAD binding
+
+The ciphertext AEAD takes `AAD = content_hash` (the hex string from the manifest). This binds the ciphertext to the signed manifest; an attacker cannot swap ciphertexts between manifests without breaking the AEAD tag.
+
+### 5.5 Post-decrypt verification
+
+After decryption, the implementation MUST verify that `SHA-256(plaintext) == manifest.content_hash`. If not, discard the plaintext.
+
+## 6. Watermarking
+
+Watermarking is optional but RECOMMENDED. Each applied layer registers a `mark_id` in the manifest.
+
+### 6.1 Layer identifiers
+
+- `L1_zero_width` - zero-width unicode characters scattered through text payloads
+- `L2_whitespace` - trailing space vs tab at line endings
+- `L3_synonyms` - synonym-class rotation (reserved; MVP stub)
+- `L4_dct_visual` - reserved; for image payloads
+- `L5_layout` - reserved; for PDF/document layout perturbation
+
+### 6.2 Mark IDs
+
+Mark IDs are 64-bit random values. Collision probability at 2^32 issued marks is ~2^-32.
+
+### 6.3 Recovery
+
+A leaked plaintext is scanned by all supported layer extractors. Each recovered `mark_id` is queried against the registry. A match returns `(file_id, recipient_id, issuer_id)`.
+
+Implementations SHOULD use multiple layers so that defeating one does not defeat attribution.
+
+## 7. Beacons
+
+### 7.1 Types
+
+| Kind | Channel | Triggered by |
+|------------|---------|-------------------------------------------------------|
+| `dns` | DNS | Document rendering, network-aware readers, preview pipelines |
+| `http_img` | HTTPS | `<img>` tags in HTML/Office/PDF/SVG |
+| `ocsp` | HTTPS | Certificate revocation checks |
+| `license` | HTTPS | Explicit license-server check (policy-enforced) |
+
+### 7.2 Token format
+
+Each beacon carries a 128-bit unguessable `token_id`. The registry maps `token_id → (file_id, recipient_id, issuer_id)`.
+
+### 7.3 Passive-only requirement
+
+Beacons MUST NOT cause code execution on the reader. A beacon is a network callback that a standard renderer makes naturally; it does not require a plugin, macro, or active payload.
+
+## 8. Registry
+
+### 8.1 Endpoints
+
+A compliant registry exposes:
+
+| Method | Path | Purpose |
+|--------|----------------------------|-----------------------------------------|
+| POST | `/register` | Issuer registers a file's beacons+marks |
+| GET | `/p/{token_id}.png` | HTTP image beacon receiver |
+| GET | `/r/{token_id}` | OCSP-style beacon receiver |
+| GET | `/v/{token_id}` | License-check beacon receiver |
+| POST | `/attribute` | Query by token_id or mark_id |
+| GET | `/evidence/{file_id}` | Assemble evidence bundle |
+
+### 8.2 Qualified timestamps
+
+Production registries MUST timestamp events via RFC 3161 against at least one qualified Time Stamping Authority (TSA). Evidence bundles MUST include the TimeStampToken(s).
+
+### 8.3 Transparency log
+
+Production registries SHOULD chain events into an append-only transparency log (Sigstore-style Merkle log) so that registry operators cannot fabricate or suppress events undetected.
+
+### 8.4 Jurisdictional profiles
+
+Registries MUST publish a jurisdictional profile declaring:
+
+- Data residency (where event logs are stored)
+- Permitted field collection per event (IP, UA, geolocation, etc.)
+- Retention period
+- Cross-border data-sharing policy
+
+The manifest `policy.jurisdiction` MUST match the registry's profile or the seal MUST be rejected.
+
+## 9. Evidence bundles
+
+An evidence bundle is a JSON artifact containing:
+
+1. The original signed manifest
+2. All registered beacons and watermarks
+3. Chronologically ordered event log
+4. Qualified timestamps for each event
+5. Registry's own signature over the bundle
+6. Transparency-log inclusion proofs
+
+The bundle is the foundation for a forensic report per ISO/IEC 27037. A court-admissible final report requires additional human-in-the-loop procedures: examiner qualifications, methodology documentation, and proper preservation of the original blob.
+
+## 10. Security considerations
+
+### 10.1 Key compromise
+
+- Issuer key compromise allows forged manifests for the compromise window. Mitigation: short-lived issuer keys, certificate transparency, a revocation list.
+- Recipient key compromise allows decryption of all files ever sealed for that recipient. Mitigation: per-purpose recipient keys, forward-secret variants (future work).
+
+### 10.2 Replay
+
+Ciphertext is bound to manifest via AEAD AAD. Manifest is signed and uniquely identified by `file_id`. Replay of a full sealed blob is equivalent to possession of the blob.
+
+### 10.3 Side channels
+
+Implementations MUST use constant-time implementations for all cryptographic primitives. Watermark-embedding timing may leak whether a recipient is being marked; embed times SHOULD be bounded.
+
+### 10.4 Metadata exposure
+
+The manifest is not encrypted. An attacker who captures a sealed blob learns the recipient, issuer, beacons, and watermark IDs. This is intentional: third parties (legal discovery, compliance auditors) must be able to inspect the metadata without holding the decryption key. Sensitive fields SHOULD be hashed or omitted from the manifest if their disclosure is unacceptable.
+
+### 10.5 Traffic analysis of beacons
+
+Beacon callbacks reveal that a sealed file was opened. In hostile environments an attacker who blocks outbound traffic will suppress beacon callbacks. The protocol does not claim to defeat such an attacker; watermarking provides the post-escape attribution path.
+
+## 11. IANA considerations
+
+Reserved media type: `application/vnd.oversight.sealed`
+Reserved file extension: `.sealed`
+Reserved URN namespace: `urn:oversight:file:<file_id>`
+
+## 12. References
+
+- FIPS 203: Module-Lattice-Based Key-Encapsulation Mechanism
+- FIPS 204: Module-Lattice-Based Digital Signature Standard
+- RFC 7748: Elliptic Curves for Security (X25519)
+- RFC 8032: Edwards-Curve Digital Signature Algorithm (EdDSA)
+- RFC 5869: HKDF
+- RFC 3161: Time-Stamp Protocol (TSP)
+- ISO/IEC 27037: Guidelines for identification, collection, acquisition and preservation of digital evidence
+- C2PA 2.3: Content Credentials specification
+- draft-irtf-cfrg-xchacha: XChaCha20-Poly1305
+
+## 13. Appendix A - Test vectors (normative)
+
+To follow in v0.2. Implementations SHOULD include a conformance test suite producing and verifying known sealed blobs.
docs/V05_REKOR_PLAN.md +240 -0
@@ -0,0 +1,240 @@
+# v0.5 - Sigstore Rekor v2 Migration Plan
+
+Drafted 2026-04-19. Approved scope: public Rekor v2 only (no self-host).
+USENIX Cycle 2 strategy: v0.4.1 frozen as paper artifact safety net;
+v0.5 lands as a stretch goal if evaluation work comes together first.
+
+---
+
+## 0. Source-of-truth facts (verified 2026-04-19 via web)
+
+- **Rekor v2 GA: 2025-10-10.** Tile-backed log following C2SP `tlog-tiles`.
+- **Entry types:** ONLY `hashedrekord` (artifact) and `dsse` (attestation).
+ intoto, rekord, helm, tuf, rfc3161, jar, rpm, cose, alpine are removed.
+ Custom types are **not** accepted - "additional types may be added if there is
+ demand, but this requires updating the client specification."
+- **Write API:** single endpoint `POST /api/v2/log/entries` (HTTP + gRPC).
+ Returns `TransparencyLogEntry` (protobuf) which clients persist in bundles.
+ Minimum client write timeout: 20s.
+- **Reads:** no online proof API. Clients fetch tiles per the tlog-tiles spec
+ and compute inclusion proofs locally. Inclusion proofs are bundled into the
+ `TransparencyLogEntry` returned at write time.
+- **Signed timestamps removed from Rekor** - clients fetch from a separate TSA.
+ (Oversight already uses FreeTSA RFC 3161; no change needed.)
+- **Search indexing removed** - Rekor will not answer "what entries did issuer X
+ register?". A separate verifiable-index service is planned. Oversight registry
+ must keep its own local index (it does: `registry/server.py` SQLite).
+- **Public log URL pattern:** `https://logYEAR-N.rekor.sigstore.dev/api/v2/`,
+ rotated about every 6 months. Current: `log2025-1`. **Do NOT hardcode.**
+ Discover via Sigstore TUF trusted root.
+- **Client coverage:** Python, Go, Java GA. JS + Ruby pending.
+
+## 1. Goals (in order)
+
+1. Replace `oversight_core/tlog.py` calls in the issuer's registration path with
+ a Rekor v2 DSSE upload, while keeping the local tlog as a verifier fallback
+ for v0.4-era `.sealed` files.
+2. Embed the returned `TransparencyLogEntry` in the Oversight evidence bundle.
+3. Add a `verify_rekor_inclusion()` helper auditors can run with no Oversight
+ code at all - only the standard `sigstore-python` library.
+4. Maintain bit-identical Python ↔ Rust output. New conformance test:
+ `seal-then-register` round trip across both languages must produce the same
+ DSSE envelope bytes (signatures aside, since they're nondeterministic).
+
+## 2. Non-goals for v0.5
+
+- No self-hosted Rekor on [container]. Recorded as out-of-scope (revisit point 3).
+- No removal of legacy `oversight_core/tlog.py`. It stays as fallback verifier.
+- No Hardware KeyProvider work - that's v0.6 alongside format adapters.
+- No new entry-type negotiation with Sigstore. We use vanilla DSSE.
+
+## 3. Entry-type design: DSSE, not hashedrekord
+
+`hashedrekord` proves "key K signed digest D." We need more: "issuer K asserts
+that mark_id M maps to file_id F with content_hash H, recipient R, suite S,
+registered at time T, with optional policy bounds." That's an attestation, not
+a signature primitive. Use **DSSE** with a custom predicate type.
+
+**Predicate type:** `https://oversight.dev/registration/v1`
+
+**Statement payload (canonical JSON, JCS):**
+
+```json
+{
+ "_type": "https://in-toto.io/Statement/v1",
+ "subject": [{
+ "name": "mark:<mark_id>",
+ "digest": {"sha256": "<content_hash_hex>"}
+ }],
+ "predicateType": "https://oversight.dev/registration/v1",
+ "predicate": {
+ "file_id": "<uuid>",
+ "issuer_pubkey_ed25519": "<base64>",
+ "recipient_id": "<string>",
+ "recipient_pubkey_x25519": "<base64>",
+ "suite": "OSGT-CLASSIC-v1 | OSGT-PQ-HYBRID-v1 | OSGT-HW-P256-v1",
+ "policy": { "not_after": "<iso>?", "max_opens": <int>?, "jurisdiction": [...]? },
+ "watermarks": { "L1": true, "L2": true, "L3": true },
+ "registered_at": "<iso>",
+ "rfc3161_tsa": "<TSA URL used>",
+ "rfc3161_token_b64": "<base64 of TimeStampToken>"
+ }
+}
+```
+
+DSSE envelope: signed by the issuer's Ed25519 key (the same key already in the
+manifest). Sigstore Fulcio/OIDC is **not** required for v0.5; we use
+"self-managed key" mode of the Rekor v2 write API.
+
+## 4. Bundle format change
+
+Today (`v0.4`):
+```json
+{ "manifest": {...}, "manifest_sig": "...", "tlog_proof": {...}, "rfc3161_token": "..." }
+```
+
+After v0.5:
+```json
+{
+ "manifest": {...},
+ "manifest_sig": "...",
+ "tlog_kind": "rekor-v2-dsse",
+ "rekor": {
+ "log_url": "https://log2025-1.rekor.sigstore.dev/api/v2/",
+ "log_entry_b64": "<protobuf TransparencyLogEntry>",
+ "dsse_envelope_b64": "<DSSE we uploaded>"
+ },
+ "rfc3161_token": "..."
+}
+```
+
+For v0.4 backward compat, the verifier reads `tlog_kind`. Default
+(omitted/`oversight-self-merkle-v1`) → use `oversight_core/tlog.py`.
+`rekor-v2-dsse` → use Rekor verifier.
+
+## 5. Code surface
+
+### New files
+- `oversight_core/rekor.py` (~250 LOC)
+ - `build_oversight_dsse(manifest, ed25519_priv) -> dsse_envelope_bytes`
+ - `upload_to_rekor(envelope, log_url) -> TransparencyLogEntry`
+ - `verify_rekor_inclusion(entry, dsse_envelope, issuer_pubkey) -> bool`
+ - Pure-stdlib HTTP client; no `sigstore-python` runtime dep (we use it only in
+ the auditor helper, which lives in a separate file).
+- `oversight_core/auditor_helper.py` (~80 LOC)
+ - Thin wrapper over `sigstore-python` so an external auditor can verify a
+ bundle with one import.
+- `oversight-rust/oversight-rekor/` (new crate, ~400 LOC)
+ - Mirrors Python rekor.py exactly; uses `sigstore` crate for verify only.
+ - Async (tokio) for upload; sync verify path for use from CLI.
+
+### Modified files
+- `oversight_core/manifest.py`: add optional `tlog_kind` field (default-omit
+ for back-compat).
+- `registry/server.py`: replace inline tlog append with `rekor.upload_to_rekor`.
+ Keep the SQLite event index - that is now the only way to answer "list marks
+ for issuer X" queries.
+- `oversight_core/tlog.py`: mark module-docstring as "fallback verifier for
+ pre-v0.5 bundles only." No new writes against it.
+- `oversight-rust/oversight-cli/`: `inspect` learns to print Rekor entry info.
+
+### New tests (must add at least 3 to keep "additions only" promise)
+- `tests/test_rekor_e2e.py` - register a mark, upload to Rekor, fetch back,
+ verify locally without Oversight code (uses `sigstore-python` only).
+- `tests/test_rekor_backcompat.py` - open a v0.4-era `.sealed` file and
+ confirm verifier falls back to local tlog.
+- `oversight-rust/tests/conformance_rekor.sh` - Python uploads, Rust
+ downloads-and-verifies. Skip when offline; mark as "online conformance."
+
+Target test count after v0.5: **79+** (76 existing + 3 new minimum).
+
+## 6. Backward compatibility rules (do not break)
+
+1. Every existing v0.4.1 `.sealed` file must still parse, open, and verify
+ exactly as it does today. The cross-language conformance script must keep
+ passing without modification on those files.
+2. Bundle format must accept missing `tlog_kind` and behave as
+ `oversight-self-merkle-v1` (the v0.4 path).
+3. Python and Rust must agree on every new field's canonical JSON ordering
+ (JCS already enforces this; just make sure the new fields are added to both
+ sides in the same commit).
+
+## 7. Risks / gotchas
+
+- **Log shard rotation.** `log2025-1` will freeze and `log2026-1` (or similar)
+ will replace it. Bundles registered against a frozen shard are still
+ verifiable - the shard URL stays read-only. We must record the URL we used
+ in the bundle and never assume "current" log.
+- **No online inclusion proof API.** Old habit dies hard: there is no
+ `GET /api/v2/log/entries/{uuid}/proof`. The proof is bundled at write time.
+ If a verifier is missing one, they have to compute from tiles.
+- **20s write timeout minimum.** Set urllib3/reqwest accordingly. Don't fail
+ fast on registration.
+- **Rekor v2 won't accept custom predicate types via metadata** - the predicate
+ type lives inside the DSSE statement payload, which Rekor doesn't inspect.
+ This is fine; we just need to be unambiguous in our own predicate URI so
+ third parties don't collide.
+- **No Oversight code on the auditor's side.** This is a feature, not a risk.
+ The whole point of migrating is that any Sigstore-compatible client can
+ audit Oversight bundles. Don't compromise this by leaking proprietary
+ helpers into the verify path.
+
+## 8. Sequencing (3 sessions)
+
+**Session A (this one or next):**
+- Approve plan with Zion (this document).
+- Add `tlog_kind` field, keep default behavior unchanged. Land + tests.
+- Build `oversight_core/rekor.py` skeleton with the DSSE construction,
+ unit-tested against a fixture envelope (no network).
+
+**Session B:**
+- Wire `registry/server.py` to call Rekor for new registrations.
+- `tests/test_rekor_e2e.py` against `log2025-1.rekor.sigstore.dev`.
+- Backward compat test against v0.4-era fixtures.
+
+**Session C:**
+- Rust `oversight-rekor` crate.
+- Cross-language Rekor conformance.
+- Update `docs/SPEC.md`, bump version to 0.5.0, ship.
+
+## 8b. Desktop review fixes applied 2026-04-19
+
+Independent review by desktop session caught six issues; all addressed before
+Session A landed:
+
+1. **DSSE choice confirmed** - hashedrekord cannot carry structured
+ attestations; Rekor v2 forces this choice.
+2. **Predicate URI pinned** to git-tagged GitHub path
+ `https://github.com/oversight-protocol/oversight/blob/v0.5.0/docs/predicates/registration-v1.md`
+ instead of `oversight.dev` (which Zion may not own / could be squatted).
+ Predicate body now also carries `predicate_version: 1` for cheap
+ version gating without URI parsing.
+3. **Bundle gained four 5-year-replay fields:**
+ `rekor.log_pubkey_pem` (raw key at write time, lets verifiers skip TUF),
+ `rekor.checkpoint` (signed tree-head promoted out of the protobuf so a
+ strip-happy serializer can't drop it),
+ `rekor.log_entry_schema = "rekor/v1.TransparencyLogEntry"` (schema URI for
+ the opaque base64 blob), and the optional
+ `rfc3161_chain` (full TSA cert chain so 2031 verifiers can validate the
+ token after the TSA cert has expired).
+4. **`bundle_schema: 2` integer** added so pre-v0.5 verifiers fail fast with
+ "unknown schema, upgrade" instead of mis-routing on `tlog_kind`.
+5. **`sigstore-python>=4.1,<5` pin** for the auditor helper. Rekor v2 support
+ is stable since v4.0.0 (2025-09-19). No beta risk.
+6. **Privacy fix (critical):** the on-log predicate now carries
+ `recipient_pubkey_sha256` instead of the raw X25519 public key. Otherwise
+ anyone could enumerate recipients by pubkey or correlate marks across
+ issuers. The raw key stays in the local `.sealed` bundle. New unit test
+ `t8_recipient_pubkey_never_appears_raw` enforces this.
+
+## 9. Open questions to surface to Zion before Session B
+
+1. Predicate URI: `https://oversight.dev/registration/v1` - does he own
+ oversight.dev? If not, use `https://github.com/oversight-protocol/spec/registration/v1`
+ so the URI resolves to public spec docs.
+2. Auditor helper: ship inside `oversight_core/` or as a separate
+ `oversight-auditor` PyPI package so non-issuers can `pip install` it
+ without pulling Oversight's full crypto stack?
+3. Should v0.5 also write a tiny `verify-bundle` standalone Rust binary
+ (~200 LOC, depends only on the `sigstore` crate) for distribution to
+ journalists / lawyers / non-technical leak responders?
docs/architecture.svg +163 -0
@@ -0,0 +1,163 @@
+<svg viewBox="0 0 1400 900" xmlns="http://www.w3.org/2000/svg" font-family="ui-sans-serif, system-ui, sans-serif">
+ <style>
+ .title { font-size: 26px; font-weight: 600; fill: #0b1220; }
+ .subtitle { font-size: 14px; fill: #475569; }
+ .lane { font-size: 15px; font-weight: 600; fill: #334155; }
+ .box { fill: white; stroke: #334155; stroke-width: 1.5; rx: 8; ry: 8; }
+ .box-accent { fill: #f8fafc; stroke: #0f172a; stroke-width: 2; rx: 8; ry: 8; }
+ .box-warn { fill: #fef3c7; stroke: #b45309; stroke-width: 1.5; rx: 8; ry: 8; }
+ .box-good { fill: #dcfce7; stroke: #15803d; stroke-width: 1.5; rx: 8; ry: 8; }
+ .box-leak { fill: #fee2e2; stroke: #b91c1c; stroke-width: 1.5; rx: 8; ry: 8; }
+ .label { font-size: 13px; font-weight: 600; fill: #0f172a; }
+ .sublabel { font-size: 11px; fill: #475569; }
+ .arrow { stroke: #334155; stroke-width: 1.5; fill: none; marker-end: url(#ah); }
+ .arrow-dash { stroke: #334155; stroke-width: 1.5; stroke-dasharray: 6,4; fill: none; marker-end: url(#ah); }
+ .arrow-red { stroke: #b91c1c; stroke-width: 2; fill: none; marker-end: url(#ahr); }
+ .lane-bg-a { fill: #f1f5f9; }
+ .lane-bg-b { fill: #ecfdf5; }
+ .lane-bg-c { fill: #fef3c7; opacity: 0.4; }
+ .lane-bg-d { fill: #fee2e2; opacity: 0.4; }
+ .caption { font-size: 11px; fill: #1e293b; font-style: italic; }
+ </style>
+
+ <defs>
+ <marker id="ah" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="8" markerHeight="8" orient="auto">
+ <path d="M0,0 L10,5 L0,10 z" fill="#334155"/>
+ </marker>
+ <marker id="ahr" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="8" markerHeight="8" orient="auto">
+ <path d="M0,0 L10,5 L0,10 z" fill="#b91c1c"/>
+ </marker>
+ </defs>
+
+ <!-- title -->
+ <text x="40" y="42" class="title">OVERSIGHT - Sealed Entity, Notarized Trust, Integrity &amp; Evidence Layer</text>
+ <text x="40" y="66" class="subtitle">Open protocol for data provenance, attribution, and leak detection. Format-agnostic, post-quantum-ready, jurisdiction-aware.</text>
+
+ <!-- Lane A: Issuer side -->
+ <rect x="30" y="90" width="400" height="380" class="lane-bg-a" rx="8"/>
+ <text x="50" y="115" class="lane">ISSUER SIDE (sealing)</text>
+
+ <rect x="60" y="135" width="340" height="60" class="box"/>
+ <text x="80" y="158" class="label">1. Plaintext payload</text>
+ <text x="80" y="178" class="sublabel">any format: docx, pdf, mp4, json, raw bytes</text>
+
+ <rect x="60" y="210" width="340" height="70" class="box-accent"/>
+ <text x="80" y="233" class="label">2. Apply per-recipient watermarks</text>
+ <text x="80" y="252" class="sublabel">L1 zero-width unicode · L2 whitespace patterns</text>
+ <text x="80" y="269" class="sublabel">L3 synonym rotation · L4 DCT visual (reserved)</text>
+
+ <rect x="60" y="295" width="340" height="70" class="box-accent"/>
+ <text x="80" y="318" class="label">3. Build signed manifest</text>
+ <text x="80" y="337" class="sublabel">file_id · recipient · content_hash · policy · marks</text>
+ <text x="80" y="354" class="sublabel">Ed25519 signature (+ ML-DSA in HYBRID suite)</text>
+
+ <rect x="60" y="380" width="340" height="70" class="box-accent"/>
+ <text x="80" y="403" class="label">4. Seal → .sealed blob</text>
+ <text x="80" y="422" class="sublabel">XChaCha20-Poly1305(DEK, plaintext, AAD=content_hash)</text>
+ <text x="80" y="439" class="sublabel">DEK wrapped: X25519 ECDH (+ ML-KEM-768 hybrid)</text>
+
+ <!-- Arrow down -->
+ <path d="M230,195 L230,210" class="arrow"/>
+ <path d="M230,280 L230,295" class="arrow"/>
+ <path d="M230,365 L230,380" class="arrow"/>
+
+ <!-- Lane B: Recipient -->
+ <rect x="480" y="90" width="400" height="380" class="lane-bg-b" rx="8"/>
+ <text x="500" y="115" class="lane">RECIPIENT SIDE (open)</text>
+
+ <rect x="510" y="135" width="340" height="60" class="box-good"/>
+ <text x="530" y="158" class="label">5. Alice receives .sealed blob</text>
+ <text x="530" y="178" class="sublabel">over any channel: email, S3, USB, Slack, etc.</text>
+
+ <rect x="510" y="210" width="340" height="70" class="box-good"/>
+ <text x="530" y="233" class="label">6. Verify manifest signature</text>
+ <text x="530" y="252" class="sublabel">Rejects forged issuer, tampered metadata,</text>
+ <text x="530" y="269" class="sublabel">or untrusted issuer (pinned key set).</text>
+
+ <rect x="510" y="295" width="340" height="70" class="box-good"/>
+ <text x="530" y="318" class="label">7. Unwrap DEK, decrypt, verify hash</text>
+ <text x="530" y="337" class="sublabel">AEAD tag binds ciphertext↔manifest.</text>
+ <text x="530" y="354" class="sublabel">Post-decrypt SHA-256 cross-check.</text>
+
+ <rect x="510" y="380" width="340" height="70" class="box-good"/>
+ <text x="530" y="403" class="label">8. Render → beacons fire (passive)</text>
+ <text x="530" y="422" class="sublabel">&lt;img&gt; tag, DNS, OCSP, license check.</text>
+ <text x="530" y="439" class="sublabel">No code exec. No RAT. Standard HTTP only.</text>
+
+ <path d="M680,195 L680,210" class="arrow"/>
+ <path d="M680,280 L680,295" class="arrow"/>
+ <path d="M680,365 L680,380" class="arrow"/>
+
+ <!-- Arrow from issuer to recipient -->
+ <path d="M400,415 C430,415 450,415 510,165" class="arrow"/>
+ <text x="420" y="320" class="caption" transform="rotate(-38 420 320)">sealed file</text>
+
+ <!-- Lane C: Registry -->
+ <rect x="930" y="90" width="440" height="380" class="lane-bg-c" rx="8"/>
+ <text x="950" y="115" class="lane">ATTRIBUTION REGISTRY</text>
+
+ <rect x="960" y="135" width="380" height="60" class="box-warn"/>
+ <text x="980" y="158" class="label">POST /register</text>
+ <text x="980" y="178" class="sublabel">issuer submits manifest + beacon+mark IDs</text>
+
+ <rect x="960" y="210" width="380" height="90" class="box-warn"/>
+ <text x="980" y="233" class="label">Beacon receivers</text>
+ <text x="980" y="252" class="sublabel">GET /p/{id}.png - http image beacon</text>
+ <text x="980" y="269" class="sublabel">GET /r/{id} - ocsp-style beacon</text>
+ <text x="980" y="286" class="sublabel">GET /v/{id} - license check</text>
+
+ <rect x="960" y="315" width="380" height="70" class="box-warn"/>
+ <text x="980" y="338" class="label">Event logging</text>
+ <text x="980" y="357" class="sublabel">RFC 3161 qualified timestamps</text>
+ <text x="980" y="374" class="sublabel">Transparency log (Sigstore-style Merkle)</text>
+
+ <rect x="960" y="400" width="380" height="55" class="box-warn"/>
+ <text x="980" y="423" class="label">POST /attribute GET /evidence/{file_id}</text>
+ <text x="980" y="441" class="sublabel">maps token_id | mark_id → (file, recipient, issuer)</text>
+
+ <!-- Issuer registers -->
+ <path d="M400,165 L960,165" class="arrow-dash"/>
+ <text x="570" y="158" class="caption">/register (issuer → registry)</text>
+
+ <!-- Recipient beacons call registry -->
+ <path d="M850,415 L960,245" class="arrow-dash"/>
+ <text x="875" y="330" class="caption" transform="rotate(-52 875 330)">beacon callbacks on open</text>
+
+ <!-- Lane D: Leak path -->
+ <rect x="30" y="500" width="1340" height="360" class="lane-bg-d" rx="8"/>
+ <text x="50" y="525" class="lane">LEAK &amp; ATTRIBUTION PATH</text>
+
+ <rect x="60" y="545" width="300" height="80" class="box-leak"/>
+ <text x="80" y="568" class="label">9. Recipient leaks plaintext</text>
+ <text x="80" y="587" class="sublabel">posts to breach forum, Telegram,</text>
+ <text x="80" y="604" class="sublabel">paste site, Tor market, etc.</text>
+
+ <rect x="400" y="545" width="300" height="80" class="box-leak"/>
+ <text x="420" y="568" class="label">10. Scraper detects leak</text>
+ <text x="420" y="587" class="sublabel">continuous monitoring of known</text>
+ <text x="420" y="604" class="sublabel">leak channels (Artemis pipeline)</text>
+
+ <rect x="740" y="545" width="300" height="80" class="box-leak"/>
+ <text x="760" y="568" class="label">11. Extract watermarks</text>
+ <text x="760" y="587" class="sublabel">L1/L2/L3 recovery from</text>
+ <text x="760" y="604" class="sublabel">leaked plaintext</text>
+
+ <rect x="1080" y="545" width="290" height="80" class="box-leak"/>
+ <text x="1100" y="568" class="label">12. Query registry /attribute</text>
+ <text x="1100" y="587" class="sublabel">mark_id → source recipient</text>
+ <text x="1100" y="604" class="sublabel">identified within seconds</text>
+
+ <path d="M360,585 L400,585" class="arrow-red"/>
+ <path d="M700,585 L740,585" class="arrow-red"/>
+ <path d="M1040,585 L1080,585" class="arrow-red"/>
+
+ <rect x="60" y="660" width="1310" height="100" class="box"/>
+ <text x="80" y="685" class="label">13. Evidence bundle for forensic + legal response</text>
+ <text x="80" y="707" class="sublabel">signed manifest · per-event qualified timestamps · transparency-log proofs · event log (IPs, UAs, callbacks)</text>
+ <text x="80" y="725" class="sublabel">combined with ISO/IEC 27037 chain-of-custody procedure and RFC 3161 TimeStampTokens → foundation of court-admissible report</text>
+ <text x="80" y="745" class="sublabel">NOTE: the bundle is a provenance record, not a legal finding. A qualified examiner + methodology docs are required for court admission.</text>
+
+ <rect x="60" y="780" width="1310" height="65" class="box-accent"/>
+ <text x="80" y="804" class="label">Layers not in MVP (roadmap): hardware attestation (Intel TDX / AMD SEV-SNP / Nitro Enclaves) for keygated decryption inside attested TEEs;</text>
+ <text x="80" y="822" class="sublabel">LLM-generated decoy files scattered in sensitive folders; C2PA interop for media; federated multi-registry trust fabric; hybrid PQ suites fully linked via liboqs.</text>
+</svg>
docs/predicates/registration-v1.md +73 -0
@@ -0,0 +1,73 @@
+# Oversight Registration Predicate v1
+
+**Predicate Type URI:**
+`https://github.com/oversight-protocol/oversight/blob/v0.5.0/docs/predicates/registration-v1.md`
+
+**Statement type:** `https://in-toto.io/Statement/v1`
+**Envelope:** DSSE (`application/vnd.in-toto+json`)
+**Signature algorithm:** Ed25519 (issuer key from the Oversight manifest)
+
+## Purpose
+
+This predicate describes the act of an Oversight issuer registering a sealed
+file's mark with a public transparency log (Sigstore Rekor v2). The DSSE
+envelope is uploaded to Rekor; the returned `TransparencyLogEntry` is then
+embedded in the local evidence bundle.
+
+The predicate is intentionally minimal on the public log - recipient
+identifiers and pubkeys are hashed before publication so the log cannot be
+mined for "who got what."
+
+## Subject
+
+A statement carries exactly one subject:
+
+```json
+{
+ "name": "mark:<mark_id_hex>",
+ "digest": {"sha256": "<plaintext sha256 hex>"}
+}
+```
+
+`mark_id_hex` is the 128-bit watermark identifier in lowercase hex. It is an
+opaque random value; it is NOT a human-meaningful label and contains no PII.
+
+`digest.sha256` is the SHA-256 of the plaintext that was sealed. This is the
+hook auditors use to find matching registrations when investigating a leak:
+hash the leaked text, query Rekor by digest.
+
+## Predicate body fields
+
+| field | type | required | notes |
+|-----------------------------|-------------|----------|------------------------------------------------------------|
+| `predicate_version` | int | yes | Always `1` for this URI. |
+| `file_id` | string UUID | yes | The Oversight manifest's `file_id`. |
+| `issuer_pubkey_ed25519` | hex string | yes | Verifying key for the DSSE envelope and the manifest. |
+| `recipient_id` | string | yes | SHOULD be a hash or UUID. Issuers MUST NOT publish raw PII.|
+| `recipient_pubkey_sha256` | hex string | yes | `sha256(recipient_x25519_pub_raw_bytes)`. NEVER the raw key.|
+| `suite` | string | yes | `OSGT-CLASSIC-v1` / `OSGT-PQ-HYBRID-v1` / `OSGT-HW-P256-v1`.|
+| `registered_at` | string | yes | ISO 8601 UTC timestamp. |
+| `policy` | object | yes | Subset of the manifest policy that bears on attribution. |
+| `watermarks` | object | yes | `{L1:bool, L2:bool, L3:bool}` - which layers were embedded.|
+| `rfc3161_tsa` | string URL | optional | TSA endpoint used. |
+| `rfc3161_token_b64` | base64 | optional | Raw RFC 3161 TimeStampToken. |
+| `rfc3161_chain_b64` | base64 | optional | Concatenated PEM cert chain for TSA validation post-expiry.|
+
+## Privacy contract
+
+The on-log payload MUST NOT contain:
+- Raw recipient public keys.
+- Email addresses, phone numbers, or other directly identifying recipient PII.
+- File content, even ciphertext.
+- Watermark mark_ids belonging to other recipients of the same source file
+ (one statement, one recipient).
+
+Issuers who need to retain the raw recipient pubkey MUST keep it in the local
+`.sealed` bundle, not in the DSSE envelope.
+
+## Versioning
+
+Backward-incompatible changes to this predicate body produce a new file at a
+new git tag, e.g. `…/blob/v0.6.0/docs/predicates/registration-v2.md`. The URI
+itself is the version anchor; never re-edit a published predicate URI's
+contents.
examples/live_demo.py +164 -0
@@ -0,0 +1,164 @@
+#!/usr/bin/env python3
+"""
+Live demo: integration with the OVERSIGHT registry.
+
+Flow:
+ 1. Seal a document for Alice and register it with the registry
+ 2. Simulate the document being opened (triggering image/OCSP/license beacons)
+ 3. Query the registry for attribution via the beacon token_id
+ 4. Simulate the plaintext leaking; recover watermarks and attribute via the registry
+ 5. Pull a full evidence bundle for the file
+"""
+
+import json
+import sys
+import time
+from pathlib import Path
+
+ROOT = Path(__file__).resolve().parent.parent
+sys.path.insert(0, str(ROOT))
+
+import httpx
+
+from oversight_core import (
+ ClassicIdentity, Manifest, Recipient, WatermarkRef,
+ content_hash, seal, open_sealed, beacon, watermark,
+)
+
+REG = "http://127.0.0.1:8765"
+
+
+def banner(m): print(f"\n{'='*64}\n {m}\n{'='*64}")
+
+
+def main():
+ banner("1. Generate identities")
+ issuer = ClassicIdentity.generate()
+ alice = ClassicIdentity.generate()
+
+ banner("2. Prepare watermarked plaintext")
+ lines = [f"Acme Q3 forecast - line {i}: confidential projections." for i in range(80)]
+ original = "\n".join(lines)
+ mark_zw = watermark.new_mark_id()
+ mark_ws = watermark.new_mark_id()
+ wm_text = watermark.embed_ws(watermark.embed_zw(original, mark_zw), mark_ws)
+ plaintext = wm_text.encode("utf-8")
+ print(f" L1 mark = {mark_zw.hex()}")
+ print(f" L2 mark = {mark_ws.hex()}")
+
+ banner("3. Build manifest + beacons, then seal")
+ beacons = beacon.gen_beacons("oversight.local", "pending", "alice@acme.corp")
+ recipient = Recipient(
+ recipient_id="alice@acme.corp",
+ x25519_pub=alice.x25519_pub.hex(),
+ ed25519_pub=alice.ed25519_pub.hex(),
+ )
+ m = Manifest.new(
+ original_filename="q3_forecast.txt",
+ content_hash=content_hash(plaintext),
+ size_bytes=len(plaintext),
+ issuer_id="acme.corp.legal",
+ issuer_ed25519_pub_hex=issuer.ed25519_pub.hex(),
+ recipient=recipient,
+ registry_url=REG,
+ content_type="text/plain",
+ )
+ m.watermarks = [
+ WatermarkRef(layer="L1_zero_width", mark_id=mark_zw.hex()),
+ WatermarkRef(layer="L2_whitespace", mark_id=mark_ws.hex()),
+ ]
+ m.beacons = [b.to_dict() for b in beacons]
+
+ blob = seal(plaintext, m, issuer.ed25519_priv, alice.x25519_pub)
+ print(f" sealed = {len(blob)} bytes")
+ print(f" file_id = {m.file_id}")
+
+ banner("4. Register with registry")
+ r = httpx.post(f"{REG}/register", json={
+ "manifest": m.to_dict(),
+ "beacons": [b.to_dict() for b in beacons],
+ "watermarks": [{"mark_id": w.mark_id, "layer": w.layer} for w in m.watermarks],
+ })
+ print(f" POST /register -> {r.status_code} {r.json()}")
+
+ banner("5. Simulate reader opening the document (triggers HTTP beacons)")
+ # In real life the office/PDF reader fetches <img> beacons automatically
+ # against the beacon domain, which resolves to the registry operator's
+ # infrastructure. Here we rewrite beacon URLs to the local registry.
+ def local_url(b):
+ if b.kind == "http_img":
+ return f"{REG}/p/{b.token_id}.png"
+ if b.kind == "ocsp":
+ return f"{REG}/r/{b.token_id}"
+ if b.kind == "license":
+ return f"{REG}/v/{b.token_id}"
+ return None
+
+ triggered = []
+ for b in beacons:
+ if b.kind == "dns":
+ print(f" [dns ] would resolve {b.dns_name} (needs DNS server, skipped)")
+ continue
+ url = local_url(b)
+ r = httpx.get(url, follow_redirects=True,
+ headers={"User-Agent": "Mozilla/5.0 OfficeDocViewer/2024"})
+ triggered.append(b.token_id)
+ print(f" [{b.kind:<8}] GET {url} -> {r.status_code}")
+ time.sleep(0.3)
+
+ banner("6. Query registry for attribution via beacon token_id")
+ tid = triggered[0]
+ r = httpx.post(f"{REG}/attribute", json={"token_id": tid})
+ data = r.json()
+ print(f" found = {data['found']}")
+ print(f" file_id = {data['file_id']}")
+ print(f" recipient = {data['recipient_id']}")
+ print(f" issuer = {data['issuer_id']}")
+ print(f" events:")
+ for e in data["recent_events"][:5]:
+ print(f" {e['qualified_timestamp']} {e['kind']:<10} ip={e['source_ip']} ua={e['user_agent'][:40]}")
+
+ banner("7. Simulate leak: attacker posts plaintext to breach forum")
+ # Decrypt Alice's copy, pretend it ended up on BreachForums, and run attribution.
+ decrypted, _ = open_sealed(blob, recipient_x25519_priv=alice.x25519_priv)
+ leaked_text = decrypted.decode("utf-8")
+ print(f" leaked plaintext size: {len(leaked_text)} chars")
+
+ recovered = watermark.recover_marks(leaked_text)
+ for layer, mlist in recovered.items():
+ uniq = sorted({mm.hex() for mm in mlist})
+ if uniq:
+ print(f" {layer}: recovered unique IDs = {uniq}")
+
+ banner("8. Attribute leaked copy to recipient")
+ for layer, mlist in recovered.items():
+ seen = set()
+ for mm in mlist:
+ h = mm.hex()
+ if h in seen:
+ continue
+ seen.add(h)
+ r = httpx.post(f"{REG}/attribute", json={"mark_id": h, "layer": layer})
+ d = r.json()
+ if d.get("found"):
+ print(f" [!!] LEAK ATTRIBUTED via {layer} mark {h}")
+ print(f" file_id = {d['file_id']}")
+ print(f" recipient = {d['recipient_id']} <-- source of leak")
+ print(f" issuer = {d['issuer_id']}")
+
+ banner("9. Pull full evidence bundle")
+ r = httpx.get(f"{REG}/evidence/{m.file_id}")
+ bundle = r.json()
+ print(f" file_id = {bundle['file_id']}")
+ print(f" bundle ts = {bundle['bundle_generated_at']}")
+ print(f" manifest issuer = {bundle['manifest']['issuer_id']}")
+ print(f" beacons = {len(bundle['beacons'])}")
+ print(f" watermarks = {len(bundle['watermarks'])}")
+ print(f" events logged = {len(bundle['events'])}")
+ print(f" disclaimer = {bundle['disclaimer'][:80]}...")
+
+ banner("DEMO COMPLETE")
+
+
+if __name__ == "__main__":
+ main()
examples/live_demo_v2.py +164 -0
@@ -0,0 +1,164 @@
+#!/usr/bin/env python3
+"""OVERSIGHT v0.2 live demo - full registry integration including tlog and signed bundles."""
+
+import sys
+import time
+import json
+from pathlib import Path
+
+ROOT = Path(__file__).resolve().parent.parent
+sys.path.insert(0, str(ROOT))
+
+import httpx
+from oversight_core import (
+ ClassicIdentity, Manifest, Recipient, WatermarkRef,
+ content_hash, seal, open_sealed, beacon, watermark,
+)
+from oversight_core import semantic
+
+REG = "http://127.0.0.1:8765"
+
+
+def banner(m): print(f"\n{'='*64}\n {m}\n{'='*64}")
+
+
+def main():
+ banner("1. Check registry is up, show well-known")
+ r = httpx.get(f"{REG}/.well-known/oversight-registry")
+ wk = r.json()
+ print(f" registry pub = {wk['ed25519_pub'][:32]}...")
+ print(f" version = {wk['version']}")
+ print(f" tlog_size = {wk['tlog_size']}")
+
+ banner("2. Seal a multi-layer-watermarked document for Alice")
+ issuer = ClassicIdentity.generate()
+ alice = ClassicIdentity.generate()
+
+ lines = [f"Acme Q3 forecast line {i}: we begin to show significant results and help our customers find answers." for i in range(60)]
+ original = "\n".join(lines)
+ mid_zw = watermark.new_mark_id()
+ mid_ws = watermark.new_mark_id()
+ mid_sem = watermark.new_mark_id()
+ t = semantic.apply_semantic(original, mid_sem)
+ t = watermark.embed_ws(t, mid_ws)
+ t = watermark.embed_zw(t, mid_zw)
+ plaintext = t.encode("utf-8")
+ print(f" plaintext {len(plaintext)} bytes, 3-layer watermarked")
+
+ beacons = beacon.gen_beacons("oversight.local", "pending", "alice@acme")
+ rec = Recipient(recipient_id="alice@acme", x25519_pub=alice.x25519_pub.hex(), ed25519_pub=alice.ed25519_pub.hex())
+ m = Manifest.new("q3_forecast.txt", content_hash(plaintext), len(plaintext),
+ "acme", issuer.ed25519_pub.hex(), rec, REG, "text/plain")
+ m.watermarks = [
+ WatermarkRef(layer="L1_zero_width", mark_id=mid_zw.hex()),
+ WatermarkRef(layer="L2_whitespace", mark_id=mid_ws.hex()),
+ WatermarkRef(layer="L3_semantic", mark_id=mid_sem.hex()),
+ ]
+ m.beacons = [b.to_dict() for b in beacons]
+ seal(plaintext, m, issuer.ed25519_priv, alice.x25519_pub)
+ print(f" file_id = {m.file_id}")
+
+ banner("3. Register with v0.2 registry (tlog-backed)")
+ r = httpx.post(f"{REG}/register", json={
+ "manifest": m.to_dict(),
+ "beacons": [b.to_dict() for b in beacons],
+ "watermarks": [{"mark_id": w.mark_id, "layer": w.layer} for w in m.watermarks],
+ })
+ reg_resp = r.json()
+ print(f" /register -> {r.status_code}")
+ print(f" file_id = {reg_resp['file_id']}")
+ print(f" tlog_index = {reg_resp['tlog_index']}")
+
+ banner("4. Trigger beacons (HTTP image + OCSP + license)")
+ for b in beacons:
+ if b.kind == "dns":
+ continue
+ url_map = {
+ "http_img": f"{REG}/p/{b.token_id}.png",
+ "ocsp": f"{REG}/r/{b.token_id}",
+ "license": f"{REG}/v/{b.token_id}",
+ }
+ r = httpx.get(url_map[b.kind], headers={"User-Agent": "OfficeDocViewer/2024"})
+ print(f" [{b.kind:<8}] -> {r.status_code}")
+
+ banner("5. Query tlog head and get signed tree state")
+ r = httpx.get(f"{REG}/tlog/head")
+ head = r.json()
+ print(f" tlog size = {head['size']}")
+ print(f" tlog root = {head['root'][:32]}...")
+ print(f" signature = {head['signature'][:32]}...")
+
+ banner("6. Get inclusion proof for registration event")
+ r = httpx.get(f"{REG}/tlog/proof/{reg_resp['tlog_index']}")
+ proof = r.json()
+ print(f" proof for idx={proof['index']}:")
+ print(f" leaf hash = {proof['leaf_hash'][:32]}...")
+ print(f" root = {proof['root'][:32]}...")
+ print(f" siblings = {len(proof['proof'])} hashes")
+
+ banner("7. Simulate airgap-strip attack on a leaked copy")
+ decrypted, _ = open_sealed(seal(plaintext, m, issuer.ed25519_priv, alice.x25519_pub), alice.x25519_priv)
+ leaked = decrypted.decode()
+ # Strip L1 zero-width + normalize whitespace (defeats L1 and L2)
+ for zw in ("\u200b", "\u200c", "\u200d"):
+ leaked = leaked.replace(zw, "")
+ leaked = "\n".join(line.rstrip() for line in leaked.splitlines())
+ print(f" post-strip leaked size: {len(leaked)} chars")
+
+ banner("8. L3 semantic attribution against registry")
+ # In a real deployment, the scraper would pull candidate mark_ids from the registry.
+ # Here we just test against the mark we know.
+ result = semantic.verify_semantic(leaked, mid_sem)
+ print(f" synonyms score = {result['synonyms_score']:.3f} (match={result['synonyms_match']})")
+ print(f" overall match = {result['overall_match']}")
+ if result["overall_match"]:
+ r = httpx.post(f"{REG}/attribute", json={"mark_id": mid_sem.hex(), "layer": "L3_semantic"})
+ data = r.json()
+ if data.get("found"):
+ print(f" [!!] LEAK ATTRIBUTED via L3 semantic watermark")
+ print(f" file_id = {data['file_id']}")
+ print(f" recipient = {data['recipient_id']} (leaked by)")
+ print(f" issuer = {data['issuer_id']}")
+
+ banner("9. Request SIGNED evidence bundle")
+ r = httpx.get(f"{REG}/evidence/{m.file_id}")
+ bundle = r.json()
+ print(f" file_id = {bundle['file_id']}")
+ print(f" bundle ts = {bundle['bundle_generated_at']}")
+ print(f" registry pub = {bundle['registry_pub'][:32]}...")
+ print(f" signature = {bundle['bundle_signature_ed25519'][:32]}...")
+ print(f" tlog head size = {bundle['tlog_head']['size']}")
+ print(f" beacons = {len(bundle['beacons'])}")
+ print(f" watermarks = {len(bundle['watermarks'])}")
+ print(f" events logged = {len(bundle['events'])}")
+
+ banner("10. Verify the bundle signature (as an external auditor would)")
+ from cryptography.hazmat.primitives.asymmetric.ed25519 import Ed25519PublicKey
+ pub = Ed25519PublicKey.from_public_bytes(bytes.fromhex(bundle["registry_pub"]))
+ sig = bytes.fromhex(bundle.pop("bundle_signature_ed25519"))
+ msg = json.dumps(bundle, sort_keys=True, separators=(",", ":")).encode("utf-8")
+ try:
+ pub.verify(sig, msg)
+ print(" [ok] bundle signature VERIFIED - this bundle came from this registry.")
+ except Exception as e:
+ print(f" [FAIL] signature verification failed: {e}")
+
+ banner("11. Rate-limit test: hit beacon 50x rapidly")
+ ok_count = 0
+ throttled_count = 0
+ for _ in range(50):
+ r = httpx.get(f"{REG}/p/{beacons[1].token_id}.png")
+ if r.status_code == 200:
+ ok_count += 1
+ elif r.status_code == 429:
+ throttled_count += 1
+ print(f" allowed = {ok_count}")
+ print(f" throttled= {throttled_count}")
+ if throttled_count > 0:
+ print(" [ok] rate limiter is working")
+
+ banner("DEMO COMPLETE - v0.2")
+
+
+if __name__ == "__main__":
+ main()
integrations/__init__.py +1 -0
@@ -0,0 +1 @@
+"""OVERSIGHT integrations: Flywheel scraper hook, Perseus CanaryKeeper agent."""
integrations/flywheel_oversight_match.py +251 -0
@@ -0,0 +1,251 @@
+"""
+oversight_match - Flywheel job module.
+
+Registers a new Flywheel job kind `oversight_match` that takes scraped content
+(text, attached images, attached PDFs/DOCX) and checks it against the
+OVERSIGHT registry for leaked-file attribution.
+
+How to register this with Flywheel:
+ from oversight_integrations.flywheel_oversight_match import handle_scraped
+ flywheel.register_job("oversight_match", handle_scraped)
+
+Job inputs (dict):
+ {
+ "source_url": "https://breachforums.example/thread/12345",
+ "scraped_at": 1715000000,
+ "text": "<pasted leaked document text>",
+ "attachments": [
+ {"kind": "image", "bytes_hex": "...", "filename": "leaked.png"},
+ {"kind": "pdf", "bytes_hex": "...", "filename": "leaked.pdf"},
+ {"kind": "docx", "bytes_hex": "...", "filename": "leaked.docx"},
+ ],
+ }
+
+Job output (dict):
+ {
+ "matches": [
+ {"layer": "L1_zero_width", "mark_id": "...", "file_id": "...",
+ "recipient_id": "...", "issuer_id": "...", "score": 1.0},
+ {"layer": "L3_semantic", "mark_id": "...", "score": 0.89, ...},
+ {"layer": "image_DCT", "mark_id": "...", "score": 0.12, ...},
+ {"layer": "perceptual_hash","hash": "...", "file_id": "...", ...},
+ ],
+ "scraped_at": 1715000000,
+ "source_url": "...",
+ }
+
+On match: raise a priority-1 alert through the Flywheel event bus so the
+`CanaryKeeper` Perseus agent can notify Zion via Discord.
+"""
+
+from __future__ import annotations
+
+import time
+from pathlib import Path
+from typing import Any, Optional
+
+import httpx
+
+# Add oversight_core to path - assumes Flywheel container has oversight/ available
+import sys
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+
+from oversight_core import watermark, semantic
+from oversight_core.formats import image as img_fmt
+from oversight_core.formats import pdf as pdf_fmt
+from oversight_core.formats import docx as docx_fmt
+
+
+# ---------- registry client ----------
+
+class RegistryClient:
+ def __init__(self, url: str, timeout: float = 10.0):
+ self.url = url.rstrip("/")
+ self.client = httpx.Client(timeout=timeout)
+ self._cached_candidates: list[dict] = []
+ self._candidates_fetched_at: int = 0
+
+ def close(self):
+ self.client.close()
+
+ def attribute(self, **kwargs) -> dict:
+ """POST /attribute with any of token_id, mark_id, layer, perceptual_hash."""
+ r = self.client.post(f"{self.url}/attribute", json=kwargs)
+ r.raise_for_status()
+ return r.json()
+
+ def fetch_semantic_candidates(self, cache_ttl: int = 3600) -> list[dict]:
+ """Fetch L3 semantic candidate mark_ids (cached for cache_ttl seconds)."""
+ now = int(time.time())
+ if self._cached_candidates and now - self._candidates_fetched_at < cache_ttl:
+ return self._cached_candidates
+ r = self.client.get(f"{self.url}/candidates/semantic", params={"limit": 5000})
+ r.raise_for_status()
+ data = r.json()
+ self._cached_candidates = data["candidates"]
+ self._candidates_fetched_at = now
+ return self._cached_candidates
+
+
+# ---------- text layer extractors ----------
+
+def _check_text(text: str, registry: RegistryClient) -> list[dict]:
+ """
+ Run L1 / L2 / L3 extractors against leaked text.
+ L1 and L2 give direct mark_ids (look them up).
+ L3 requires iterating candidate mark_ids and verifying.
+ """
+ matches: list[dict] = []
+
+ # L1 - direct mark_id hit
+ for m in watermark.extract_zw(text):
+ r = registry.attribute(mark_id=m.hex(), layer="L1_zero_width")
+ if r.get("found"):
+ matches.append({"layer": "L1_zero_width", "score": 1.0, **r})
+
+ # L2 - direct mark_id hit
+ l2 = watermark.extract_ws(text)
+ if l2:
+ r = registry.attribute(mark_id=l2.hex(), layer="L2_whitespace")
+ if r.get("found"):
+ matches.append({"layer": "L2_whitespace", "score": 1.0, **r})
+
+ # L3 - verify against every candidate mark_id (probabilistic)
+ candidates = registry.fetch_semantic_candidates()
+ for cand in candidates:
+ mark_bytes = bytes.fromhex(cand["mark_id"])
+ result = semantic.verify_semantic(text, mark_bytes)
+ if result["overall_match"]:
+ r = registry.attribute(mark_id=cand["mark_id"], layer="L3_semantic")
+ if r.get("found"):
+ matches.append({
+ "layer": "L3_semantic",
+ "score": result["synonyms_score"],
+ "punct_score": result["punctuation_score"],
+ **r,
+ })
+ return matches
+
+
+# ---------- image layer ----------
+
+def _check_image(image_bytes: bytes, registry: RegistryClient) -> list[dict]:
+ """DCT watermark verification + perceptual-hash fuzzy lookup."""
+ matches: list[dict] = []
+
+ # Perceptual hash - fast fuzzy lookup (exact-match on phash string)
+ try:
+ phash = img_fmt.perceptual_hash(image_bytes)
+ r = registry.attribute(perceptual_hash=phash)
+ if r.get("found"):
+ matches.append({"layer": "perceptual_hash", "hash": phash, "score": 1.0, **r})
+ except Exception:
+ pass
+
+ # DCT verify - requires candidate list to know which marks to test.
+ # For MVP: iterate every known L4_image_dct mark (TODO: layer tag in registry)
+ # Skipped for now - the perceptual hash usually suffices for fast triage.
+ return matches
+
+
+# ---------- PDF / DOCX ----------
+
+def _check_pdf(pdf_bytes: bytes, registry: RegistryClient) -> list[dict]:
+ matches: list[dict] = []
+ # Metadata-level mark
+ ext = pdf_fmt.extract(pdf_bytes)
+ if ext.get("mark_id"):
+ r = registry.attribute(mark_id=ext["mark_id"])
+ if r.get("found"):
+ matches.append({"layer": "pdf_metadata", "score": 1.0, **r})
+ # Body-text extraction → run L1/L2/L3 on recovered text
+ try:
+ body_text = pdf_fmt.extract_text_for_watermark_recovery(pdf_bytes)
+ matches.extend(_check_text(body_text, registry))
+ except Exception:
+ pass
+ return matches
+
+
+def _check_docx(docx_bytes: bytes, registry: RegistryClient) -> list[dict]:
+ matches: list[dict] = []
+ ext = docx_fmt.extract(docx_bytes)
+ if ext.get("mark_id"):
+ r = registry.attribute(mark_id=ext["mark_id"])
+ if r.get("found"):
+ matches.append({"layer": "docx_metadata", "score": 1.0, **r})
+ try:
+ body_text = docx_fmt.extract_text_for_watermark_recovery(docx_bytes)
+ matches.extend(_check_text(body_text, registry))
+ except Exception:
+ pass
+ return matches
+
+
+# ---------- top-level handler ----------
+
+def handle_scraped(job_input: dict, registry_url: str) -> dict:
+ """
+ Flywheel job entrypoint. Processes one scraped blob and returns
+ a list of OVERSIGHT attribution matches (empty if nothing matches).
+ """
+ registry = RegistryClient(registry_url)
+ try:
+ all_matches: list[dict] = []
+
+ # Text body
+ text = job_input.get("text", "") or ""
+ if text:
+ all_matches.extend(_check_text(text, registry))
+
+ # Attachments
+ for att in job_input.get("attachments", []):
+ kind = att.get("kind")
+ raw = att.get("bytes_hex")
+ if not raw:
+ continue
+ blob = bytes.fromhex(raw)
+ if kind == "image":
+ all_matches.extend(_check_image(blob, registry))
+ elif kind == "pdf":
+ all_matches.extend(_check_pdf(blob, registry))
+ elif kind == "docx":
+ all_matches.extend(_check_docx(blob, registry))
+
+ # Deduplicate by (layer, file_id)
+ seen = set()
+ unique: list[dict] = []
+ for m in all_matches:
+ key = (m.get("layer"), m.get("file_id"))
+ if key not in seen:
+ seen.add(key)
+ unique.append(m)
+
+ return {
+ "matches": unique,
+ "scraped_at": job_input.get("scraped_at"),
+ "source_url": job_input.get("source_url"),
+ }
+ finally:
+ registry.close()
+
+
+# ---------- quick standalone test ----------
+
+if __name__ == "__main__":
+ import argparse
+ import json
+
+ p = argparse.ArgumentParser()
+ p.add_argument("--registry", required=True)
+ p.add_argument("--text", default="")
+ p.add_argument("--url", default="(cli test)")
+ args = p.parse_args()
+
+ job = {
+ "source_url": args.url,
+ "scraped_at": int(time.time()),
+ "text": args.text,
+ "attachments": [],
+ }
+ print(json.dumps(handle_scraped(job, args.registry), indent=2))
integrations/perseus_canarykeeper.py +293 -0
@@ -0,0 +1,293 @@
+"""
+CanaryKeeper - OVERSIGHT-attribution → Discord-alert agent for Perseus.
+
+Role: sole owner of the "trap recipient" identities (decoy file recipient
+keys), and sole escalation path for OVERSIGHT attribution hits. Runs as a
+Perseus agent alongside Grok / DMCA Shield / etc.
+
+Responsibilities:
+ 1. Poll the registry's tlog for new beacon events (any kind: http_img, dns,
+ ocsp, license). A beacon fire = a sealed file was opened somewhere.
+ 2. For each event, pull the signed evidence bundle for the file_id.
+ 3. Verify the bundle's registry Ed25519 signature against the pinned
+ well-known pubkey (no blind trust).
+ 4. Classify: is this a decoy file (trap), a real-recipient file, or unknown?
+ 5. For trap hits → DM Zion on Discord immediately (P1).
+ 6. For real-recipient hits from unexpected geography/time → P2 alert.
+ 7. For Flywheel-discovered leaks → P1.
+
+Trap recipient storage:
+ Keys stay encrypted at rest under a Perseus Vault master key.
+ Only CanaryKeeper has the decrypt role - not the main brain, not DMCA Shield.
+
+Usage:
+ python -m integrations.perseus_canarykeeper \\
+ --registry https://beacon.example.com \\
+ --pinned-key <hex> \\
+ --discord-webhook https://discord.com/api/webhooks/... \\
+ --owner-id 682818191990587393 \\
+ --poll-interval 60
+
+Config can also come from env vars:
+ OVERSIGHT_REGISTRY_URL, OVERSIGHT_PINNED_KEY, DISCORD_WEBHOOK, OWNER_DISCORD_ID
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import logging
+import os
+import sys
+import time
+from pathlib import Path
+from typing import Optional
+
+import httpx
+from cryptography.hazmat.primitives.asymmetric.ed25519 import Ed25519PublicKey
+from cryptography.exceptions import InvalidSignature
+
+
+log = logging.getLogger("canarykeeper")
+
+STATE_PATH = Path(
+ os.environ.get("CANARYKEEPER_STATE", "/var/lib/canarykeeper/state.json")
+)
+
+
+# --------- state ----------
+
+def load_state() -> dict:
+ if not STATE_PATH.exists():
+ return {
+ "last_tlog_seen": 0,
+ "known_file_ids": [],
+ "trap_file_ids": [],
+ }
+ try:
+ return json.loads(STATE_PATH.read_text())
+ except (ValueError, OSError):
+ return {"last_tlog_seen": 0, "known_file_ids": [], "trap_file_ids": []}
+
+
+def save_state(state: dict):
+ STATE_PATH.parent.mkdir(parents=True, exist_ok=True)
+ tmp = STATE_PATH.with_suffix(".tmp")
+ tmp.write_text(json.dumps(state, indent=2))
+ tmp.replace(STATE_PATH)
+
+
+# --------- registry client ----------
+
+class RegistryMonitor:
+ def __init__(self, url: str, pinned_pubkey_hex: str):
+ self.url = url.rstrip("/")
+ self.pinned_pub = Ed25519PublicKey.from_public_bytes(
+ bytes.fromhex(pinned_pubkey_hex)
+ )
+ self.client = httpx.Client(timeout=15.0)
+
+ def close(self):
+ self.client.close()
+
+ def tlog_head(self) -> dict:
+ r = self.client.get(f"{self.url}/tlog/head")
+ r.raise_for_status()
+ head = r.json()
+ # Verify the signature against the pinned key
+ sig = bytes.fromhex(head["signature"])
+ msg = head["signed_message"].encode("utf-8")
+ try:
+ self.pinned_pub.verify(sig, msg)
+ except InvalidSignature:
+ raise RuntimeError(
+ "registry /tlog/head signature does not verify under pinned key! "
+ "possible tampering or key rotation - refusing to proceed"
+ )
+ return head
+
+ def evidence_bundle(self, file_id: str) -> Optional[dict]:
+ try:
+ r = self.client.get(f"{self.url}/evidence/{file_id}")
+ if r.status_code == 404:
+ return None
+ r.raise_for_status()
+ bundle = r.json()
+ except httpx.HTTPError as e:
+ log.warning(f"evidence fetch failed for {file_id}: {e}")
+ return None
+ # Verify bundle signature
+ sig_hex = bundle.pop("bundle_signature_ed25519", None)
+ if not sig_hex:
+ log.warning(f"bundle for {file_id} has no signature")
+ return None
+ msg = json.dumps(bundle, sort_keys=True, separators=(",", ":")).encode("utf-8")
+ try:
+ self.pinned_pub.verify(bytes.fromhex(sig_hex), msg)
+ except InvalidSignature:
+ log.error(f"bundle signature invalid for {file_id} - IGNORING")
+ return None
+ bundle["bundle_signature_ed25519"] = sig_hex # restore
+ return bundle
+
+ def raw_tlog_entries(self, start_index: int) -> list[dict]:
+ """Fetch raw tlog leaves from start_index to current. Uses the /tlog/range endpoint
+ if available, else falls back to re-reading the whole log."""
+ try:
+ r = self.client.get(
+ f"{self.url}/tlog/range",
+ params={"start": start_index, "limit": 500},
+ )
+ r.raise_for_status()
+ return r.json().get("entries", [])
+ except httpx.HTTPError:
+ # Fallback: fetch head, synthesize empty (registry doesn't yet have /tlog/range)
+ return []
+
+
+# --------- Discord notifier ----------
+
+class DiscordNotifier:
+ def __init__(self, webhook_url: str, owner_id: str):
+ self.webhook = webhook_url
+ self.owner_id = owner_id
+ self.client = httpx.Client(timeout=10.0)
+
+ def close(self):
+ self.client.close()
+
+ def alert(self, priority: str, title: str, body: str):
+ """Post an alert to Discord. Priority = P1/P2/P3."""
+ colors = {"P1": 0xFF0000, "P2": 0xFF9900, "P3": 0xFFFF00}
+ mention = f"<@{self.owner_id}>" if priority == "P1" else ""
+ payload = {
+ "content": mention,
+ "embeds": [{
+ "title": f"[{priority}] {title}",
+ "description": body[:4000],
+ "color": colors.get(priority, 0x0099FF),
+ "timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
+ "footer": {"text": "OVERSIGHT CanaryKeeper"},
+ }],
+ }
+ try:
+ r = self.client.post(self.webhook, json=payload)
+ r.raise_for_status()
+ except httpx.HTTPError as e:
+ log.error(f"Discord alert failed: {e}")
+
+
+# --------- main loop ----------
+
+def process_event(event: dict, state: dict, registry: RegistryMonitor,
+ notifier: DiscordNotifier):
+ """Classify a single tlog event and escalate if it's interesting."""
+ kind = event.get("event")
+ if kind != "beacon":
+ return # registrations and attribution queries are log-only
+
+ file_id = event.get("file_id")
+ if not file_id:
+ return
+
+ is_trap = file_id in state.get("trap_file_ids", [])
+ beacon_kind = event.get("kind", "unknown")
+ source_ip = event.get("source_ip") or "unknown"
+
+ if is_trap:
+ # Trap beacon fire = intruder. Always P1.
+ title = f"TRAP FILE OPENED: {file_id[:8]}..."
+ body = (
+ f"A decoy file's beacon fired. This is a high-confidence intrusion signal.\n"
+ f"• beacon kind: `{beacon_kind}`\n"
+ f"• source IP: `{source_ip}`\n"
+ f"• file_id: `{file_id}`\n"
+ f"• timestamp: `{event.get('timestamp', '?')}`\n\n"
+ f"Action: investigate source IP, pull evidence bundle, consider containment."
+ )
+ notifier.alert("P1", title, body)
+ else:
+ # Real file beacon. P3 for now; upgrade to P2 if it has suspicious features
+ # (source IP geolocation, unusual time, etc. - future work).
+ title = f"Real file beacon: {file_id[:8]}..."
+ body = (
+ f"A legitimate sealed file's beacon fired (expected behavior on open).\n"
+ f"• kind: `{beacon_kind}`, source: `{source_ip}`, "
+ f"recipient: `{event.get('recipient_id', '?')}`"
+ )
+ notifier.alert("P3", title, body)
+
+
+def run_once(state: dict, registry: RegistryMonitor, notifier: DiscordNotifier):
+ """One polling cycle. Fetches new tlog entries and processes each."""
+ try:
+ head = registry.tlog_head()
+ except RuntimeError as e:
+ notifier.alert("P1", "Registry signature check FAILED", str(e))
+ raise
+ except httpx.HTTPError as e:
+ log.warning(f"registry unreachable: {e}")
+ return state
+
+ new_size = head["size"]
+ old_seen = state.get("last_tlog_seen", 0)
+ if new_size <= old_seen:
+ return state # no new entries
+
+ new_entries = registry.raw_tlog_entries(old_seen)
+ for entry in new_entries:
+ try:
+ event = json.loads(entry.get("leaf_data", "{}"))
+ process_event(event, state, registry, notifier)
+ except Exception as e:
+ log.error(f"event processing failed: {e}")
+
+ state["last_tlog_seen"] = new_size
+ save_state(state)
+ return state
+
+
+def main():
+ p = argparse.ArgumentParser()
+ p.add_argument("--registry", default=os.environ.get("OVERSIGHT_REGISTRY_URL"))
+ p.add_argument("--pinned-key", default=os.environ.get("OVERSIGHT_PINNED_KEY"))
+ p.add_argument("--discord-webhook", default=os.environ.get("DISCORD_WEBHOOK"))
+ p.add_argument("--owner-id", default=os.environ.get("OWNER_DISCORD_ID", "682818191990587393"))
+ p.add_argument("--poll-interval", type=int, default=60)
+ p.add_argument("--log-level", default="INFO")
+ args = p.parse_args()
+
+ if not args.registry or not args.pinned_key or not args.discord_webhook:
+ print("Missing required config: --registry, --pinned-key, --discord-webhook")
+ sys.exit(2)
+
+ logging.basicConfig(level=args.log_level,
+ format="%(asctime)s %(levelname)s %(name)s %(message)s")
+
+ registry = RegistryMonitor(args.registry, args.pinned_key)
+ notifier = DiscordNotifier(args.discord_webhook, args.owner_id)
+ state = load_state()
+
+ log.info(f"CanaryKeeper starting (registry={args.registry}, poll={args.poll_interval}s)")
+ log.info(f" tracking {len(state.get('trap_file_ids', []))} trap files")
+ log.info(f" last tlog seen: {state.get('last_tlog_seen', 0)}")
+
+ notifier.alert("P3", "CanaryKeeper online",
+ f"Monitoring {args.registry}, polling every {args.poll_interval}s.")
+
+ try:
+ while True:
+ try:
+ state = run_once(state, registry, notifier)
+ except Exception as e:
+ log.exception(f"poll cycle error: {e}")
+ time.sleep(args.poll_interval)
+ except KeyboardInterrupt:
+ log.info("shutting down")
+ finally:
+ registry.close()
+ notifier.close()
+
+
+if __name__ == "__main__":
+ main()
oversight-rust/Cargo.lock +1258 -0
@@ -0,0 +1,1258 @@
+# This file is automatically @generated by Cargo.
+# It is not intended for manual editing.
+version = 3
+
+[[package]]
+name = "aead"
+version = "0.5.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "d122413f284cf2d62fb1b7db97e02edb8cda96d769b16e443a4f6195e35662b0"
+dependencies = [
+ "crypto-common",
+ "generic-array",
+]
+
+[[package]]
+name = "aho-corasick"
+version = "1.1.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ddd31a130427c27518df266943a5308ed92d4b226cc639f5a8f1002816174301"
+dependencies = [
+ "memchr",
+]
+
+[[package]]
+name = "anstream"
+version = "1.0.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "824a212faf96e9acacdbd09febd34438f8f711fb84e09a8916013cd7815ca28d"
+dependencies = [
+ "anstyle",
+ "anstyle-parse",
+ "anstyle-query",
+ "anstyle-wincon",
+ "colorchoice",
+ "is_terminal_polyfill",
+ "utf8parse",
+]
+
+[[package]]
+name = "anstyle"
+version = "1.0.14"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "940b3a0ca603d1eade50a4846a2afffd5ef57a9feac2c0e2ec2e14f9ead76000"
+
+[[package]]
+name = "anstyle-parse"
+version = "1.0.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "52ce7f38b242319f7cabaa6813055467063ecdc9d355bbb4ce0c68908cd8130e"
+dependencies = [
+ "utf8parse",
+]
+
+[[package]]
+name = "anstyle-query"
+version = "1.1.5"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "40c48f72fd53cd289104fc64099abca73db4166ad86ea0b4341abe65af83dadc"
+dependencies = [
+ "windows-sys",
+]
+
+[[package]]
+name = "anstyle-wincon"
+version = "3.0.11"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "291e6a250ff86cd4a820112fb8898808a366d8f9f58ce16d1f538353ad55747d"
+dependencies = [
+ "anstyle",
+ "once_cell_polyfill",
+ "windows-sys",
+]
+
+[[package]]
+name = "anyhow"
+version = "1.0.102"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "7f202df86484c868dbad7eaa557ef785d5c66295e41b460ef922eca0723b842c"
+
+[[package]]
+name = "base64ct"
+version = "1.8.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "2af50177e190e07a26ab74f8b1efbfe2ef87da2116221318cb1c2e82baf7de06"
+
+[[package]]
+name = "bitflags"
+version = "2.11.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "c4512299f36f043ab09a583e57bceb5a5aab7a73db1805848e8fef3c9e8c78b3"
+
+[[package]]
+name = "block-buffer"
+version = "0.10.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "3078c7629b62d3f0439517fa394996acacc5cbc91c5a20d8c658e77abd503a71"
+dependencies = [
+ "generic-array",
+]
+
+[[package]]
+name = "bumpalo"
+version = "3.20.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "5d20789868f4b01b2f2caec9f5c4e0213b41e3e5702a50157d699ae31ced2fcb"
+
+[[package]]
+name = "cfg-if"
+version = "1.0.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9330f8b2ff13f34540b44e946ef35111825727b38d33286ef986142615121801"
+
+[[package]]
+name = "chacha20"
+version = "0.9.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "c3613f74bd2eac03dad61bd53dbe620703d4371614fe0bc3b9f04dd36fe4e818"
+dependencies = [
+ "cfg-if",
+ "cipher",
+ "cpufeatures",
+]
+
+[[package]]
+name = "chacha20poly1305"
+version = "0.10.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "10cd79432192d1c0f4e1a0fef9527696cc039165d729fb41b3f4f4f354c2dc35"
+dependencies = [
+ "aead",
+ "chacha20",
+ "cipher",
+ "poly1305",
+ "zeroize",
+]
+
+[[package]]
+name = "cipher"
+version = "0.4.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "773f3b9af64447d2ce9850330c473515014aa235e6a783b02db81ff39e4a3dad"
+dependencies = [
+ "crypto-common",
+ "inout",
+ "zeroize",
+]
+
+[[package]]
+name = "clap"
+version = "4.6.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1ddb117e43bbf7dacf0a4190fef4d345b9bad68dfc649cb349e7d17d28428e51"
+dependencies = [
+ "clap_builder",
+ "clap_derive",
+]
+
+[[package]]
+name = "clap_builder"
+version = "4.6.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "714a53001bf66416adb0e2ef5ac857140e7dc3a0c48fb28b2f10762fc4b5069f"
+dependencies = [
+ "anstream",
+ "anstyle",
+ "clap_lex",
+ "strsim",
+]
+
+[[package]]
+name = "clap_derive"
+version = "4.6.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f2ce8604710f6733aa641a2b3731eaa1e8b3d9973d5e3565da11800813f997a9"
+dependencies = [
+ "heck",
+ "proc-macro2",
+ "quote",
+ "syn",
+]
+
+[[package]]
+name = "clap_lex"
+version = "1.1.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "c8d4a3bb8b1e0c1050499d1815f5ab16d04f0959b233085fb31653fbfc9d98f9"
+
+[[package]]
+name = "colorchoice"
+version = "1.0.5"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1d07550c9036bf2ae0c684c4297d503f838287c83c53686d05370d0e139ae570"
+
+[[package]]
+name = "const-oid"
+version = "0.9.6"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "c2459377285ad874054d797f3ccebf984978aa39129f6eafde5cdc8315b612f8"
+
+[[package]]
+name = "cpufeatures"
+version = "0.2.17"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "59ed5838eebb26a2bb2e58f6d5b5316989ae9d08bab10e0e6d103e656d1b0280"
+dependencies = [
+ "libc",
+]
+
+[[package]]
+name = "crypto-common"
+version = "0.1.7"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "78c8292055d1c1df0cce5d180393dc8cce0abec0a7102adb6c7b1eef6016d60a"
+dependencies = [
+ "generic-array",
+ "rand_core",
+ "typenum",
+]
+
+[[package]]
+name = "curve25519-dalek"
+version = "4.1.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "97fb8b7c4503de7d6ae7b42ab72a5a59857b4c937ec27a3d4539dba95b5ab2be"
+dependencies = [
+ "cfg-if",
+ "cpufeatures",
+ "curve25519-dalek-derive",
+ "digest",
+ "fiat-crypto",
+ "rustc_version",
+ "subtle",
+ "zeroize",
+]
+
+[[package]]
+name = "curve25519-dalek-derive"
+version = "0.1.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f46882e17999c6cc590af592290432be3bce0428cb0d5f8b6715e4dc7b383eb3"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn",
+]
+
+[[package]]
+name = "der"
+version = "0.7.10"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e7c1832837b905bbfb5101e07cc24c8deddf52f93225eee6ead5f4d63d53ddcb"
+dependencies = [
+ "const-oid",
+ "zeroize",
+]
+
+[[package]]
+name = "digest"
+version = "0.10.7"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9ed9a281f7bc9b7576e61468ba615a66a5c8cfdff42420a70aa82701a3b1e292"
+dependencies = [
+ "block-buffer",
+ "crypto-common",
+ "subtle",
+]
+
+[[package]]
+name = "ed25519"
+version = "2.2.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "115531babc129696a58c64a4fef0a8bf9e9698629fb97e9e40767d235cfbcd53"
+dependencies = [
+ "pkcs8",
+ "signature",
+]
+
+[[package]]
+name = "ed25519-dalek"
+version = "2.2.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "70e796c081cee67dc755e1a36a0a172b897fab85fc3f6bc48307991f64e4eca9"
+dependencies = [
+ "curve25519-dalek",
+ "ed25519",
+ "rand_core",
+ "serde",
+ "sha2",
+ "subtle",
+ "zeroize",
+]
+
+[[package]]
+name = "equivalent"
+version = "1.0.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "877a4ace8713b0bcf2a4e7eec82529c029f1d0619886d18145fea96c3ffe5c0f"
+
+[[package]]
+name = "errno"
+version = "0.3.14"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "39cab71617ae0d63f51a36d69f866391735b51691dbda63cf6f96d042b63efeb"
+dependencies = [
+ "libc",
+ "windows-sys",
+]
+
+[[package]]
+name = "fastrand"
+version = "2.4.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9f1f227452a390804cdb637b74a86990f2a7d7ba4b7d5693aac9b4dd6defd8d6"
+
+[[package]]
+name = "fiat-crypto"
+version = "0.2.9"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "28dea519a9695b9977216879a3ebfddf92f1c08c05d984f8996aecd6ecdc811d"
+
+[[package]]
+name = "foldhash"
+version = "0.1.5"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "d9c4f5dac5e15c24eb999c26181a6ca40b39fe946cbe4c263c7209467bc83af2"
+
+[[package]]
+name = "fs2"
+version = "0.4.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9564fc758e15025b46aa6643b1b77d047d1a56a1aea6e01002ac0c7026876213"
+dependencies = [
+ "libc",
+ "winapi",
+]
+
+[[package]]
+name = "generic-array"
+version = "0.14.7"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "85649ca51fd72272d7821adaf274ad91c288277713d9c18820d8499a7ff69e9a"
+dependencies = [
+ "typenum",
+ "version_check",
+]
+
+[[package]]
+name = "getrandom"
+version = "0.2.17"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ff2abc00be7fca6ebc474524697ae276ad847ad0a6b3faa4bcb027e9a4614ad0"
+dependencies = [
+ "cfg-if",
+ "libc",
+ "wasi",
+]
+
+[[package]]
+name = "getrandom"
+version = "0.4.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "0de51e6874e94e7bf76d726fc5d13ba782deca734ff60d5bb2fb2607c7406555"
+dependencies = [
+ "cfg-if",
+ "libc",
+ "r-efi",
+ "wasip2",
+ "wasip3",
+]
+
+[[package]]
+name = "hashbrown"
+version = "0.15.5"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9229cfe53dfd69f0609a49f65461bd93001ea1ef889cd5529dd176593f5338a1"
+dependencies = [
+ "foldhash",
+]
+
+[[package]]
+name = "hashbrown"
+version = "0.17.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "4f467dd6dccf739c208452f8014c75c18bb8301b050ad1cfb27153803edb0f51"
+
+[[package]]
+name = "heck"
+version = "0.5.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "2304e00983f87ffb38b55b444b5e3b60a884b5d30c0fca7d82fe33449bbe55ea"
+
+[[package]]
+name = "hex"
+version = "0.4.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "7f24254aa9a54b5c858eaee2f5bccdb46aaf0e486a595ed5fd8f86ba55232a70"
+
+[[package]]
+name = "hkdf"
+version = "0.12.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "7b5f8eb2ad728638ea2c7d47a21db23b7b58a72ed6a38256b8a1849f15fbbdf7"
+dependencies = [
+ "hmac",
+]
+
+[[package]]
+name = "hmac"
+version = "0.12.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "6c49c37c09c17a53d937dfbb742eb3a961d65a994e6bcdcf37e7399d0cc8ab5e"
+dependencies = [
+ "digest",
+]
+
+[[package]]
+name = "id-arena"
+version = "2.3.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "3d3067d79b975e8844ca9eb072e16b31c3c1c36928edf9c6789548c524d0d954"
+
+[[package]]
+name = "indexmap"
+version = "2.14.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "d466e9454f08e4a911e14806c24e16fba1b4c121d1ea474396f396069cf949d9"
+dependencies = [
+ "equivalent",
+ "hashbrown 0.17.0",
+ "serde",
+ "serde_core",
+]
+
+[[package]]
+name = "inout"
+version = "0.1.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "879f10e63c20629ecabbb64a8010319738c66a5cd0c29b02d63d272b03751d01"
+dependencies = [
+ "generic-array",
+]
+
+[[package]]
+name = "is_terminal_polyfill"
+version = "1.70.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "a6cb138bb79a146c1bd460005623e142ef0181e3d0219cb493e02f7d08a35695"
+
+[[package]]
+name = "itoa"
+version = "1.0.18"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "8f42a60cbdf9a97f5d2305f08a87dc4e09308d1276d28c869c684d7777685682"
+
+[[package]]
+name = "js-sys"
+version = "0.3.95"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "2964e92d1d9dc3364cae4d718d93f227e3abb088e747d92e0395bfdedf1c12ca"
+dependencies = [
+ "once_cell",
+ "wasm-bindgen",
+]
+
+[[package]]
+name = "leb128fmt"
+version = "0.1.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "09edd9e8b54e49e587e4f6295a7d29c3ea94d469cb40ab8ca70b288248a81db2"
+
+[[package]]
+name = "libc"
+version = "0.2.185"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "52ff2c0fe9bc6cb6b14a0592c2ff4fa9ceb83eea9db979b0487cd054946a2b8f"
+
+[[package]]
+name = "linux-raw-sys"
+version = "0.12.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "32a66949e030da00e8c7d4434b251670a91556f4144941d37452769c25d58a53"
+
+[[package]]
+name = "log"
+version = "0.4.29"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "5e5032e24019045c762d3c0f28f5b6b8bbf38563a65908389bf7978758920897"
+
+[[package]]
+name = "memchr"
+version = "2.8.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f8ca58f447f06ed17d5fc4043ce1b10dd205e060fb3ce5b979b8ed8e59ff3f79"
+
+[[package]]
+name = "once_cell"
+version = "1.21.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9f7c3e4beb33f85d45ae3e3a1792185706c8e16d043238c593331cc7cd313b50"
+
+[[package]]
+name = "once_cell_polyfill"
+version = "1.70.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "384b8ab6d37215f3c5301a95a4accb5d64aa607f1fcb26a11b5303878451b4fe"
+
+[[package]]
+name = "opaque-debug"
+version = "0.3.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "c08d65885ee38876c4f86fa503fb49d7b507c2b62552df7c70b2fce627e06381"
+
+[[package]]
+name = "oversight-cli"
+version = "0.4.1"
+dependencies = [
+ "clap",
+ "hex",
+ "oversight-container",
+ "oversight-crypto",
+ "oversight-manifest",
+ "oversight-watermark",
+ "serde",
+ "serde_json",
+ "thiserror",
+]
+
+[[package]]
+name = "oversight-container"
+version = "0.4.1"
+dependencies = [
+ "hex",
+ "oversight-crypto",
+ "oversight-manifest",
+ "serde",
+ "serde_json",
+ "thiserror",
+]
+
+[[package]]
+name = "oversight-crypto"
+version = "0.4.1"
+dependencies = [
+ "chacha20poly1305",
+ "ed25519-dalek",
+ "hex",
+ "hkdf",
+ "rand",
+ "rand_core",
+ "serde_json",
+ "sha2",
+ "thiserror",
+ "x25519-dalek",
+ "zeroize",
+]
+
+[[package]]
+name = "oversight-manifest"
+version = "0.4.1"
+dependencies = [
+ "hex",
+ "oversight-crypto",
+ "serde",
+ "serde_jcs",
+ "serde_json",
+ "thiserror",
+ "uuid",
+]
+
+[[package]]
+name = "oversight-policy"
+version = "0.4.1"
+dependencies = [
+ "fs2",
+ "oversight-manifest",
+ "serde",
+ "serde_json",
+ "tempfile",
+ "thiserror",
+]
+
+[[package]]
+name = "oversight-semantic"
+version = "0.4.1"
+dependencies = [
+ "once_cell",
+ "regex",
+ "sha2",
+]
+
+[[package]]
+name = "oversight-tlog"
+version = "0.4.1"
+dependencies = [
+ "ed25519-dalek",
+ "hex",
+ "oversight-crypto",
+ "serde",
+ "serde_jcs",
+ "serde_json",
+ "sha2",
+ "tempfile",
+ "thiserror",
+]
+
+[[package]]
+name = "oversight-watermark"
+version = "0.4.1"
+dependencies = [
+ "rand",
+]
+
+[[package]]
+name = "pkcs8"
+version = "0.10.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f950b2377845cebe5cf8b5165cb3cc1a5e0fa5cfa3e1f7f55707d8fd82e0a7b7"
+dependencies = [
+ "der",
+ "spki",
+]
+
+[[package]]
+name = "poly1305"
+version = "0.8.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "8159bd90725d2df49889a078b54f4f79e87f1f8a8444194cdca81d38f5393abf"
+dependencies = [
+ "cpufeatures",
+ "opaque-debug",
+ "universal-hash",
+]
+
+[[package]]
+name = "ppv-lite86"
+version = "0.2.21"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "85eae3c4ed2f50dcfe72643da4befc30deadb458a9b590d720cde2f2b1e97da9"
+dependencies = [
+ "zerocopy",
+]
+
+[[package]]
+name = "prettyplease"
+version = "0.2.37"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "479ca8adacdd7ce8f1fb39ce9ecccbfe93a3f1344b3d0d97f20bc0196208f62b"
+dependencies = [
+ "proc-macro2",
+ "syn",
+]
+
+[[package]]
+name = "proc-macro2"
+version = "1.0.106"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "8fd00f0bb2e90d81d1044c2b32617f68fcb9fa3bb7640c23e9c748e53fb30934"
+dependencies = [
+ "unicode-ident",
+]
+
+[[package]]
+name = "quote"
+version = "1.0.45"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "41f2619966050689382d2b44f664f4bc593e129785a36d6ee376ddf37259b924"
+dependencies = [
+ "proc-macro2",
+]
+
+[[package]]
+name = "r-efi"
+version = "6.0.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f8dcc9c7d52a811697d2151c701e0d08956f92b0e24136cf4cf27b57a6a0d9bf"
+
+[[package]]
+name = "rand"
+version = "0.8.6"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "5ca0ecfa931c29007047d1bc58e623ab12e5590e8c7cc53200d5202b69266d8a"
+dependencies = [
+ "libc",
+ "rand_chacha",
+ "rand_core",
+]
+
+[[package]]
+name = "rand_chacha"
+version = "0.3.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e6c10a63a0fa32252be49d21e7709d4d4baf8d231c2dbce1eaa8141b9b127d88"
+dependencies = [
+ "ppv-lite86",
+ "rand_core",
+]
+
+[[package]]
+name = "rand_core"
+version = "0.6.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ec0be4795e2f6a28069bec0b5ff3e2ac9bafc99e6a9a7dc3547996c5c816922c"
+dependencies = [
+ "getrandom 0.2.17",
+]
+
+[[package]]
+name = "regex"
+version = "1.12.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e10754a14b9137dd7b1e3e5b0493cc9171fdd105e0ab477f51b72e7f3ac0e276"
+dependencies = [
+ "aho-corasick",
+ "memchr",
+ "regex-automata",
+ "regex-syntax",
+]
+
+[[package]]
+name = "regex-automata"
+version = "0.4.14"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "6e1dd4122fc1595e8162618945476892eefca7b88c52820e74af6262213cae8f"
+dependencies = [
+ "aho-corasick",
+ "memchr",
+ "regex-syntax",
+]
+
+[[package]]
+name = "regex-syntax"
+version = "0.8.10"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "dc897dd8d9e8bd1ed8cdad82b5966c3e0ecae09fb1907d58efaa013543185d0a"
+
+[[package]]
+name = "rustc_version"
+version = "0.4.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "cfcb3a22ef46e85b45de6ee7e79d063319ebb6594faafcf1c225ea92ab6e9b92"
+dependencies = [
+ "semver",
+]
+
+[[package]]
+name = "rustix"
+version = "1.1.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "b6fe4565b9518b83ef4f91bb47ce29620ca828bd32cb7e408f0062e9930ba190"
+dependencies = [
+ "bitflags",
+ "errno",
+ "libc",
+ "linux-raw-sys",
+ "windows-sys",
+]
+
+[[package]]
+name = "rustversion"
+version = "1.0.22"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "b39cdef0fa800fc44525c84ccb54a029961a8215f9619753635a9c0d2538d46d"
+
+[[package]]
+name = "ryu-js"
+version = "0.2.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "6518fc26bced4d53678a22d6e423e9d8716377def84545fe328236e3af070e7f"
+
+[[package]]
+name = "semver"
+version = "1.0.28"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "8a7852d02fc848982e0c167ef163aaff9cd91dc640ba85e263cb1ce46fae51cd"
+
+[[package]]
+name = "serde"
+version = "1.0.228"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9a8e94ea7f378bd32cbbd37198a4a91436180c5bb472411e48b5ec2e2124ae9e"
+dependencies = [
+ "serde_core",
+ "serde_derive",
+]
+
+[[package]]
+name = "serde_core"
+version = "1.0.228"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "41d385c7d4ca58e59fc732af25c3983b67ac852c1a25000afe1175de458b67ad"
+dependencies = [
+ "serde_derive",
+]
+
+[[package]]
+name = "serde_derive"
+version = "1.0.228"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "d540f220d3187173da220f885ab66608367b6574e925011a9353e4badda91d79"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn",
+]
+
+[[package]]
+name = "serde_jcs"
+version = "0.1.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "cacecf649bc1a7c5f0e299cc813977c6a78116abda2b93b1ee01735b71ead9a8"
+dependencies = [
+ "ryu-js",
+ "serde",
+ "serde_json",
+]
+
+[[package]]
+name = "serde_json"
+version = "1.0.149"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "83fc039473c5595ace860d8c4fafa220ff474b3fc6bfdb4293327f1a37e94d86"
+dependencies = [
+ "itoa",
+ "memchr",
+ "serde",
+ "serde_core",
+ "zmij",
+]
+
+[[package]]
+name = "sha2"
+version = "0.10.9"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "a7507d819769d01a365ab707794a4084392c824f54a7a6a7862f8c3d0892b283"
+dependencies = [
+ "cfg-if",
+ "cpufeatures",
+ "digest",
+]
+
+[[package]]
+name = "signature"
+version = "2.2.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "77549399552de45a898a580c1b41d445bf730df867cc44e6c0233bbc4b8329de"
+dependencies = [
+ "rand_core",
+]
+
+[[package]]
+name = "spki"
+version = "0.7.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "d91ed6c858b01f942cd56b37a94b3e0a1798290327d1236e4d9cf4eaca44d29d"
+dependencies = [
+ "base64ct",
+ "der",
+]
+
+[[package]]
+name = "strsim"
+version = "0.11.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "7da8b5736845d9f2fcb837ea5d9e2628564b3b043a70948a3f0b778838c5fb4f"
+
+[[package]]
+name = "subtle"
+version = "2.6.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "13c2bddecc57b384dee18652358fb23172facb8a2c51ccc10d74c157bdea3292"
+
+[[package]]
+name = "syn"
+version = "2.0.117"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e665b8803e7b1d2a727f4023456bbbbe74da67099c585258af0ad9c5013b9b99"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "unicode-ident",
+]
+
+[[package]]
+name = "tempfile"
+version = "3.27.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "32497e9a4c7b38532efcdebeef879707aa9f794296a4f0244f6f69e9bc8574bd"
+dependencies = [
+ "fastrand",
+ "getrandom 0.4.2",
+ "once_cell",
+ "rustix",
+ "windows-sys",
+]
+
+[[package]]
+name = "thiserror"
+version = "1.0.69"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "b6aaf5339b578ea85b50e080feb250a3e8ae8cfcdff9a461c9ec2904bc923f52"
+dependencies = [
+ "thiserror-impl",
+]
+
+[[package]]
+name = "thiserror-impl"
+version = "1.0.69"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "4fee6c4efc90059e10f81e6d42c60a18f76588c3d74cb83a0b242a2b6c7504c1"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn",
+]
+
+[[package]]
+name = "typenum"
+version = "1.19.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "562d481066bde0658276a35467c4af00bdc6ee726305698a55b86e61d7ad82bb"
+
+[[package]]
+name = "unicode-ident"
+version = "1.0.24"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e6e4313cd5fcd3dad5cafa179702e2b244f760991f45397d14d4ebf38247da75"
+
+[[package]]
+name = "unicode-xid"
+version = "0.2.6"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ebc1c04c71510c7f702b52b7c350734c9ff1295c464a03335b00bb84fc54f853"
+
+[[package]]
+name = "universal-hash"
+version = "0.5.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "fc1de2c688dc15305988b563c3854064043356019f97a4b46276fe734c4f07ea"
+dependencies = [
+ "crypto-common",
+ "subtle",
+]
+
+[[package]]
+name = "utf8parse"
+version = "0.2.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "06abde3611657adf66d383f00b093d7faecc7fa57071cce2578660c9f1010821"
+
+[[package]]
+name = "uuid"
+version = "1.23.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ddd74a9687298c6858e9b88ec8935ec45d22e8fd5e6394fa1bd4e99a87789c76"
+dependencies = [
+ "getrandom 0.4.2",
+ "js-sys",
+ "serde_core",
+ "wasm-bindgen",
+]
+
+[[package]]
+name = "version_check"
+version = "0.9.5"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "0b928f33d975fc6ad9f86c8f283853ad26bdd5b10b7f1542aa2fa15e2289105a"
+
+[[package]]
+name = "wasi"
+version = "0.11.1+wasi-snapshot-preview1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ccf3ec651a847eb01de73ccad15eb7d99f80485de043efb2f370cd654f4ea44b"
+
+[[package]]
+name = "wasip2"
+version = "1.0.3+wasi-0.2.9"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "20064672db26d7cdc89c7798c48a0fdfac8213434a1186e5ef29fd560ae223d6"
+dependencies = [
+ "wit-bindgen 0.57.1",
+]
+
+[[package]]
+name = "wasip3"
+version = "0.4.0+wasi-0.3.0-rc-2026-01-06"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "5428f8bf88ea5ddc08faddef2ac4a67e390b88186c703ce6dbd955e1c145aca5"
+dependencies = [
+ "wit-bindgen 0.51.0",
+]
+
+[[package]]
+name = "wasm-bindgen"
+version = "0.2.118"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "0bf938a0bacb0469e83c1e148908bd7d5a6010354cf4fb73279b7447422e3a89"
+dependencies = [
+ "cfg-if",
+ "once_cell",
+ "rustversion",
+ "wasm-bindgen-macro",
+ "wasm-bindgen-shared",
+]
+
+[[package]]
+name = "wasm-bindgen-macro"
+version = "0.2.118"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "eeff24f84126c0ec2db7a449f0c2ec963c6a49efe0698c4242929da037ca28ed"
+dependencies = [
+ "quote",
+ "wasm-bindgen-macro-support",
+]
+
+[[package]]
+name = "wasm-bindgen-macro-support"
+version = "0.2.118"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9d08065faf983b2b80a79fd87d8254c409281cf7de75fc4b773019824196c904"
+dependencies = [
+ "bumpalo",
+ "proc-macro2",
+ "quote",
+ "syn",
+ "wasm-bindgen-shared",
+]
+
+[[package]]
+name = "wasm-bindgen-shared"
+version = "0.2.118"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "5fd04d9e306f1907bd13c6361b5c6bfc7b3b3c095ed3f8a9246390f8dbdee129"
+dependencies = [
+ "unicode-ident",
+]
+
+[[package]]
+name = "wasm-encoder"
+version = "0.244.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "990065f2fe63003fe337b932cfb5e3b80e0b4d0f5ff650e6985b1048f62c8319"
+dependencies = [
+ "leb128fmt",
+ "wasmparser",
+]
+
+[[package]]
+name = "wasm-metadata"
+version = "0.244.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "bb0e353e6a2fbdc176932bbaab493762eb1255a7900fe0fea1a2f96c296cc909"
+dependencies = [
+ "anyhow",
+ "indexmap",
+ "wasm-encoder",
+ "wasmparser",
+]
+
+[[package]]
+name = "wasmparser"
+version = "0.244.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "47b807c72e1bac69382b3a6fb3dbe8ea4c0ed87ff5629b8685ae6b9a611028fe"
+dependencies = [
+ "bitflags",
+ "hashbrown 0.15.5",
+ "indexmap",
+ "semver",
+]
+
+[[package]]
+name = "winapi"
+version = "0.3.9"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "5c839a674fcd7a98952e593242ea400abe93992746761e38641405d28b00f419"
+dependencies = [
+ "winapi-i686-pc-windows-gnu",
+ "winapi-x86_64-pc-windows-gnu",
+]
+
+[[package]]
+name = "winapi-i686-pc-windows-gnu"
+version = "0.4.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ac3b87c63620426dd9b991e5ce0329eff545bccbbb34f3be09ff6fb6ab51b7b6"
+
+[[package]]
+name = "winapi-x86_64-pc-windows-gnu"
+version = "0.4.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "712e227841d057c1ee1cd2fb22fa7e5a5461ae8e48fa2ca79ec42cfc1931183f"
+
+[[package]]
+name = "windows-link"
+version = "0.2.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f0805222e57f7521d6a62e36fa9163bc891acd422f971defe97d64e70d0a4fe5"
+
+[[package]]
+name = "windows-sys"
+version = "0.61.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ae137229bcbd6cdf0f7b80a31df61766145077ddf49416a728b02cb3921ff3fc"
+dependencies = [
+ "windows-link",
+]
+
+[[package]]
+name = "wit-bindgen"
+version = "0.51.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "d7249219f66ced02969388cf2bb044a09756a083d0fab1e566056b04d9fbcaa5"
+dependencies = [
+ "wit-bindgen-rust-macro",
+]
+
+[[package]]
+name = "wit-bindgen"
+version = "0.57.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1ebf944e87a7c253233ad6766e082e3cd714b5d03812acc24c318f549614536e"
+
+[[package]]
+name = "wit-bindgen-core"
+version = "0.51.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ea61de684c3ea68cb082b7a88508a8b27fcc8b797d738bfc99a82facf1d752dc"
+dependencies = [
+ "anyhow",
+ "heck",
+ "wit-parser",
+]
+
+[[package]]
+name = "wit-bindgen-rust"
+version = "0.51.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "b7c566e0f4b284dd6561c786d9cb0142da491f46a9fbed79ea69cdad5db17f21"
+dependencies = [
+ "anyhow",
+ "heck",
+ "indexmap",
+ "prettyplease",
+ "syn",
+ "wasm-metadata",
+ "wit-bindgen-core",
+ "wit-component",
+]
+
+[[package]]
+name = "wit-bindgen-rust-macro"
+version = "0.51.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "0c0f9bfd77e6a48eccf51359e3ae77140a7f50b1e2ebfe62422d8afdaffab17a"
+dependencies = [
+ "anyhow",
+ "prettyplease",
+ "proc-macro2",
+ "quote",
+ "syn",
+ "wit-bindgen-core",
+ "wit-bindgen-rust",
+]
+
+[[package]]
+name = "wit-component"
+version = "0.244.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9d66ea20e9553b30172b5e831994e35fbde2d165325bec84fc43dbf6f4eb9cb2"
+dependencies = [
+ "anyhow",
+ "bitflags",
+ "indexmap",
+ "log",
+ "serde",
+ "serde_derive",
+ "serde_json",
+ "wasm-encoder",
+ "wasm-metadata",
+ "wasmparser",
+ "wit-parser",
+]
+
+[[package]]
+name = "wit-parser"
+version = "0.244.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ecc8ac4bc1dc3381b7f59c34f00b67e18f910c2c0f50015669dde7def656a736"
+dependencies = [
+ "anyhow",
+ "id-arena",
+ "indexmap",
+ "log",
+ "semver",
+ "serde",
+ "serde_derive",
+ "serde_json",
+ "unicode-xid",
+ "wasmparser",
+]
+
+[[package]]
+name = "x25519-dalek"
+version = "2.0.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "c7e468321c81fb07fa7f4c636c3972b9100f0346e5b6a9f2bd0603a52f7ed277"
+dependencies = [
+ "curve25519-dalek",
+ "rand_core",
+ "serde",
+ "zeroize",
+]
+
+[[package]]
+name = "zerocopy"
+version = "0.8.48"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "eed437bf9d6692032087e337407a86f04cd8d6a16a37199ed57949d415bd68e9"
+dependencies = [
+ "zerocopy-derive",
+]
+
+[[package]]
+name = "zerocopy-derive"
+version = "0.8.48"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "70e3cd084b1788766f53af483dd21f93881ff30d7320490ec3ef7526d203bad4"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn",
+]
+
+[[package]]
+name = "zeroize"
+version = "1.8.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "b97154e67e32c85465826e8bcc1c59429aaaf107c1e4a9e53c8d8ccd5eff88d0"
+dependencies = [
+ "zeroize_derive",
+]
+
+[[package]]
+name = "zeroize_derive"
+version = "1.4.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "85a5b4158499876c763cb03bc4e49185d3cccbabb15b33c627f7884f43db852e"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn",
+]
+
+[[package]]
+name = "zmij"
+version = "1.0.21"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "b8848ee67ecc8aedbaf3e4122217aff892639231befc6a1b58d29fff4c2cabaa"
oversight-rust/Cargo.toml +54 -0
@@ -0,0 +1,54 @@
+[workspace]
+resolver = "2"
+members = [
+ "oversight-crypto",
+ "oversight-container",
+ "oversight-manifest",
+ "oversight-watermark",
+ "oversight-tlog",
+ "oversight-policy",
+ "oversight-semantic",
+ "oversight-cli",
+]
+exclude = ["fuzz"]
+
+[workspace.package]
+version = "0.4.1"
+edition = "2021"
+rust-version = "1.75"
+license = "Apache-2.0"
+repository = "https://github.com/oversight-protocol/oversight"
+authors = ["Oversight contributors"]
+
+[workspace.dependencies]
+# Cryptography - all RustCrypto (pure Rust, memory-safe, audited where noted).
+# x25519-dalek: battle-tested, used by signal-protocol, age, TLS 1.3 impls.
+# ed25519-dalek: same. Both have been independently audited.
+x25519-dalek = { version = "2", features = ["static_secrets"] }
+ed25519-dalek = { version = "2", features = ["rand_core"] }
+chacha20poly1305 = { version = "0.10", features = ["alloc"] }
+hkdf = "0.12"
+sha2 = "0.10"
+rand = "0.8"
+rand_core = "0.6"
+
+# Serialization / encoding
+serde = { version = "1", features = ["derive"] }
+serde_json = "1"
+hex = "0.4"
+zeroize = { version = "1", features = ["zeroize_derive"] }
+
+# CLI / io / time
+clap = { version = "4", features = ["derive"] }
+thiserror = "1"
+uuid = { version = "1", features = ["v4", "serde"] }
+
+# For canonical JSON
+serde_jcs = "0.1"
+
+[profile.release]
+opt-level = 3
+lto = true
+codegen-units = 1
+panic = "abort"
+strip = true
oversight-rust/fuzz/Cargo.toml +27 -0
@@ -0,0 +1,27 @@
+[package]
+name = "oversight-fuzz"
+version = "0.0.0"
+edition = "2021"
+publish = false
+
+[package.metadata]
+cargo-fuzz = true
+
+[dependencies]
+libfuzzer-sys = "0.4"
+oversight-container = { path = "../oversight-container" }
+oversight-manifest = { path = "../oversight-manifest" }
+
+[[bin]]
+name = "container_parser"
+path = "fuzz_targets/container_parser.rs"
+test = false
+doc = false
+bench = false
+
+[[bin]]
+name = "manifest_parser"
+path = "fuzz_targets/manifest_parser.rs"
+test = false
+doc = false
+bench = false
oversight-rust/fuzz/README.md +34 -0
@@ -0,0 +1,34 @@
+# Oversight fuzz harnesses
+
+Two libFuzzer-based harnesses for the security-critical parsers:
+- `container_parser` - hammers the `.sealed` binary format parser
+- `manifest_parser` - hammers the canonical-JSON manifest parser
+
+## Setup (one time)
+
+```bash
+cargo install cargo-fuzz
+```
+
+Requires a nightly Rust toolchain for fuzzing (sanitizers, coverage):
+```bash
+rustup install nightly
+```
+
+## Run
+
+```bash
+cd oversight-rust/fuzz
+cargo +nightly fuzz run container_parser -- -max_total_time=300
+cargo +nightly fuzz run manifest_parser -- -max_total_time=300
+```
+
+## What "pass" looks like
+
+The harness runs until you stop it. "Pass" means: no panics, no hangs,
+no OOMs, no memory safety violations (Rust + libFuzzer's AddressSanitizer
+catches memory bugs). Any crash input is saved to `fuzz/artifacts/...` for
+reproduction.
+
+Target: run continuously for at least 24 hours before a paid security audit
+engagement, per our ROADMAP.md prerequisites.
oversight-rust/fuzz/fuzz_targets/container_parser.rs +10 -0
@@ -0,0 +1,10 @@
+#![no_main]
+use libfuzzer_sys::fuzz_target;
+use oversight_container::SealedFile;
+
+// Hammer the binary parser with arbitrary bytes.
+// Must never panic, OOM, or infinite-loop: all malformed inputs should
+// return a clean ContainerError.
+fuzz_target!(|data: &[u8]| {
+ let _ = SealedFile::from_bytes(data);
+});
oversight-rust/fuzz/fuzz_targets/manifest_parser.rs +7 -0
@@ -0,0 +1,7 @@
+#![no_main]
+use libfuzzer_sys::fuzz_target;
+use oversight_manifest::Manifest;
+
+fuzz_target!(|data: &[u8]| {
+ let _ = Manifest::from_json(data);
+});
oversight-rust/oversight-cli/Cargo.toml +23 -0
@@ -0,0 +1,23 @@
+[package]
+name = "oversight-cli"
+version.workspace = true
+edition.workspace = true
+rust-version.workspace = true
+license.workspace = true
+description = "Command-line tool for Oversight: keygen, seal, open, inspect"
+default-run = "oversight"
+
+[[bin]]
+name = "oversight"
+path = "src/main.rs"
+
+[dependencies]
+oversight-crypto = { path = "../oversight-crypto" }
+oversight-container = { path = "../oversight-container" }
+oversight-manifest = { path = "../oversight-manifest" }
+oversight-watermark = { path = "../oversight-watermark" }
+clap.workspace = true
+serde.workspace = true
+serde_json.workspace = true
+hex.workspace = true
+thiserror.workspace = true
oversight-rust/oversight-cli/src/main.rs +222 -0
@@ -0,0 +1,222 @@
+//! # oversight CLI
+//!
+//! `oversight keygen | seal | open | inspect` for Oversight sealed files.
+
+use std::path::PathBuf;
+use std::process::ExitCode;
+
+use clap::{Parser, Subcommand};
+use oversight_container::{open_sealed, seal, SealedFile};
+use oversight_crypto::{self as crypto, ClassicIdentity};
+use oversight_manifest::{Manifest, Recipient};
+
+#[derive(Parser)]
+#[command(name = "oversight")]
+#[command(about = "Oversight - open protocol for provenance, attribution, and leak detection")]
+#[command(version)]
+struct Cli {
+ #[command(subcommand)]
+ command: Commands,
+}
+
+#[derive(Subcommand)]
+enum Commands {
+ /// Generate a new classical identity (X25519 + Ed25519)
+ Keygen {
+ /// Output path for the identity JSON file
+ #[arg(short, long)]
+ out: PathBuf,
+ },
+
+ /// Seal a plaintext file for a recipient
+ Seal {
+ /// Plaintext input file
+ #[arg(short, long)]
+ input: PathBuf,
+
+ /// Sealed output path
+ #[arg(short, long)]
+ output: PathBuf,
+
+ /// Issuer identity JSON (from `keygen`)
+ #[arg(short = 'I', long)]
+ issuer: PathBuf,
+
+ /// Recipient x25519 public key (hex)
+ #[arg(short = 'R', long)]
+ recipient_pub: String,
+
+ /// Recipient ID (stable identifier, e.g. email)
+ #[arg(long, default_value = "recipient")]
+ recipient_id: String,
+
+ /// Registry URL to bake into the manifest
+ #[arg(long, default_value = "https://registry.example.com")]
+ registry: String,
+ },
+
+ /// Open a sealed file
+ Open {
+ /// Sealed input file
+ #[arg(short, long)]
+ input: PathBuf,
+
+ /// Plaintext output path (use `-` for stdout)
+ #[arg(short, long)]
+ output: PathBuf,
+
+ /// Recipient identity JSON
+ #[arg(short = 'R', long)]
+ recipient: PathBuf,
+ },
+
+ /// Print the signed manifest + structural metadata of a sealed file
+ Inspect {
+ #[arg(short, long)]
+ input: PathBuf,
+ },
+}
+
+fn save_identity(id: &ClassicIdentity, path: &PathBuf) -> std::io::Result<()> {
+ let json = serde_json::json!({
+ "x25519_priv": hex::encode(id.x25519_priv.as_ref()),
+ "x25519_pub": hex::encode(id.x25519_pub),
+ "ed25519_priv": hex::encode(id.ed25519_priv.as_ref()),
+ "ed25519_pub": hex::encode(id.ed25519_pub),
+ });
+ // 0600 file permissions on POSIX
+ #[cfg(unix)]
+ {
+ use std::os::unix::fs::OpenOptionsExt;
+ let mut f = std::fs::OpenOptions::new()
+ .write(true)
+ .create(true)
+ .truncate(true)
+ .mode(0o600)
+ .open(path)?;
+ use std::io::Write;
+ f.write_all(serde_json::to_string_pretty(&json)?.as_bytes())?;
+ }
+ #[cfg(not(unix))]
+ {
+ std::fs::write(path, serde_json::to_string_pretty(&json)?)?;
+ }
+ Ok(())
+}
+
+fn load_identity(path: &PathBuf) -> Result<ClassicIdentity, Box<dyn std::error::Error>> {
+ let text = std::fs::read_to_string(path)?;
+ let v: serde_json::Value = serde_json::from_str(&text)?;
+ let x_priv = hex::decode(v["x25519_priv"].as_str().ok_or("missing x25519_priv")?)?;
+ let ed_priv = hex::decode(v["ed25519_priv"].as_str().ok_or("missing ed25519_priv")?)?;
+ if x_priv.len() != 32 || ed_priv.len() != 32 {
+ return Err("malformed identity file".into());
+ }
+ let mut x_arr = [0u8; 32];
+ x_arr.copy_from_slice(&x_priv);
+ let mut ed_arr = [0u8; 32];
+ ed_arr.copy_from_slice(&ed_priv);
+ Ok(ClassicIdentity::from_raw(x_arr, ed_arr))
+}
+
+fn run() -> Result<(), Box<dyn std::error::Error>> {
+ let cli = Cli::parse();
+ match cli.command {
+ Commands::Keygen { out } => {
+ let id = ClassicIdentity::generate();
+ save_identity(&id, &out)?;
+ println!("✓ new identity written to {}", out.display());
+ println!(" x25519_pub: {}", hex::encode(id.x25519_pub));
+ println!(" ed25519_pub: {}", hex::encode(id.ed25519_pub));
+ println!(" (file mode 0600)");
+ }
+
+ Commands::Seal {
+ input,
+ output,
+ issuer,
+ recipient_pub,
+ recipient_id,
+ registry,
+ } => {
+ let issuer_id = load_identity(&issuer)?;
+ let plaintext = std::fs::read(&input)?;
+ let recipient_pub_bytes = hex::decode(recipient_pub)?;
+ if recipient_pub_bytes.len() != 32 {
+ return Err("recipient_pub must decode to 32 bytes".into());
+ }
+
+ let mut manifest = Manifest::new(
+ input.file_name().and_then(|n| n.to_str()).unwrap_or("file"),
+ crypto::content_hash(&plaintext),
+ plaintext.len() as u64,
+ "cli-issuer",
+ hex::encode(issuer_id.ed25519_pub),
+ Recipient {
+ recipient_id,
+ x25519_pub: hex::encode(&recipient_pub_bytes),
+ ed25519_pub: None,
+ },
+ registry,
+ "application/octet-stream",
+ None,
+ None,
+ "GLOBAL",
+ );
+ let blob = seal(
+ &plaintext,
+ &mut manifest,
+ issuer_id.ed25519_priv.as_ref(),
+ &recipient_pub_bytes,
+ )?;
+ std::fs::write(&output, &blob)?;
+ println!("✓ sealed {} -> {} ({} bytes)", input.display(), output.display(), blob.len());
+ println!(" file_id: {}", manifest.file_id);
+ }
+
+ Commands::Open {
+ input,
+ output,
+ recipient,
+ } => {
+ let recipient_id = load_identity(&recipient)?;
+ let blob = std::fs::read(&input)?;
+ let (plaintext, manifest) =
+ open_sealed(&blob, recipient_id.x25519_priv.as_ref(), None)?;
+ if output.as_os_str() == "-" {
+ use std::io::Write;
+ std::io::stdout().write_all(&plaintext)?;
+ } else {
+ std::fs::write(&output, &plaintext)?;
+ }
+ eprintln!("✓ opened {} ({} bytes)", input.display(), plaintext.len());
+ eprintln!(" file_id: {}", manifest.file_id);
+ eprintln!(" issuer: {}", manifest.issuer_id);
+ }
+
+ Commands::Inspect { input } => {
+ let blob = std::fs::read(&input)?;
+ let sf = SealedFile::from_bytes(&blob)?;
+ let pretty = serde_json::to_string_pretty(&sf.manifest)?;
+ println!("=== Manifest ===");
+ println!("{}", pretty);
+ println!();
+ println!("=== Structure ===");
+ println!(" suite_id: {}", sf.suite_id);
+ println!(" ciphertext_len: {} bytes", sf.ciphertext.len());
+ println!(" aead_nonce: {}", hex::encode(sf.aead_nonce));
+ println!(" signature valid: {}", sf.manifest.verify().unwrap_or(false));
+ }
+ }
+ Ok(())
+}
+
+fn main() -> ExitCode {
+ match run() {
+ Ok(()) => ExitCode::SUCCESS,
+ Err(e) => {
+ eprintln!("error: {}", e);
+ ExitCode::FAILURE
+ }
+ }
+}
oversight-rust/oversight-container/Cargo.toml +15 -0
@@ -0,0 +1,15 @@
+[package]
+name = "oversight-container"
+version.workspace = true
+edition.workspace = true
+rust-version.workspace = true
+license.workspace = true
+description = "Binary .sealed container format for Oversight"
+
+[dependencies]
+oversight-crypto = { path = "../oversight-crypto" }
+oversight-manifest = { path = "../oversight-manifest" }
+serde.workspace = true
+serde_json.workspace = true
+hex.workspace = true
+thiserror.workspace = true
oversight-rust/oversight-container/src/lib.rs +487 -0
@@ -0,0 +1,487 @@
+//! # oversight-container
+//!
+//! The `.sealed` container format: binary layout with magic bytes, signed
+//! manifest, AEAD-encrypted payload, and DEK-wrapped-for-recipient.
+//!
+//! Binary layout:
+//! ```text
+//! offset length field
+//! ------ -------- ---------------------------------------
+//! 0 6 magic: b"OSGT\x01\x00"
+//! 6 1 format_version (=1)
+//! 7 1 suite_id (1=CLASSIC_V1, 2=HYBRID_V1)
+//! 8 4 manifest_len (u32 BE)
+//! 12 M manifest (canonical JSON, signed)
+//! 12+M 4 wrapped_dek_len (u32 BE)
+//! ... W wrapped_dek (JSON)
+//! ... 24 aead_nonce
+//! ... 4 ciphertext_len (u32 BE)
+//! ... C ciphertext (XChaCha20-Poly1305(plaintext))
+//! ```
+
+use oversight_crypto::{self as crypto, CryptoError, WrappedDek};
+use oversight_manifest::{Manifest, ManifestError};
+use thiserror::Error;
+
+pub const MAGIC: [u8; 6] = *b"OSGT\x01\x00";
+pub const SUITE_CLASSIC_V1_ID: u8 = 1;
+pub const SUITE_HYBRID_V1_ID: u8 = 2;
+
+// Hard caps to prevent DoS via attacker-controlled length fields.
+pub const MAX_MANIFEST_BYTES: usize = 4 * 1024 * 1024;
+pub const MAX_WRAPPED_DEK_BYTES: usize = 1 * 1024 * 1024;
+pub const MAX_CIPHERTEXT_BYTES: usize = 4 * 1024 * 1024 * 1024;
+
+#[derive(Debug, Error)]
+pub enum ContainerError {
+ #[error("bad magic: expected {:?}, got {got:?}", MAGIC)]
+ BadMagic { got: Vec<u8> },
+ #[error("unsupported format version: {0}")]
+ UnsupportedVersion(u8),
+ #[error("truncated file: wanted {wanted} bytes for {field}, got {got}")]
+ Truncated {
+ wanted: usize,
+ got: usize,
+ field: &'static str,
+ },
+ #[error("oversized field {field}: {got} > {max}")]
+ Oversized {
+ field: &'static str,
+ got: usize,
+ max: usize,
+ },
+ #[error(transparent)]
+ Manifest(#[from] ManifestError),
+ #[error(transparent)]
+ Crypto(#[from] CryptoError),
+ #[error("json: {0}")]
+ Json(#[from] serde_json::Error),
+ #[error("invalid utf-8: {0}")]
+ Utf8(#[from] std::string::FromUtf8Error),
+ #[error("precondition failed: {0}")]
+ Precondition(&'static str),
+ #[error("no decryptable slot found (tried {slots} slots)")]
+ NoDecryptableSlot { slots: usize },
+ #[error("plaintext hash mismatch - manifest and plaintext disagree")]
+ HashMismatch,
+}
+
+#[derive(Debug)]
+pub struct SealedFile {
+ pub manifest: Manifest,
+ pub wrapped_dek: serde_json::Value,
+ pub aead_nonce: [u8; 24],
+ pub ciphertext: Vec<u8>,
+ pub suite_id: u8,
+}
+
+fn read_exact<'a>(buf: &'a [u8], at: &mut usize, n: usize, field: &'static str) -> Result<&'a [u8], ContainerError> {
+ if buf.len() < *at + n {
+ return Err(ContainerError::Truncated {
+ wanted: n,
+ got: buf.len().saturating_sub(*at),
+ field,
+ });
+ }
+ let slice = &buf[*at..*at + n];
+ *at += n;
+ Ok(slice)
+}
+
+fn read_u32_be(buf: &[u8], at: &mut usize, field: &'static str) -> Result<u32, ContainerError> {
+ let slice = read_exact(buf, at, 4, field)?;
+ Ok(u32::from_be_bytes([slice[0], slice[1], slice[2], slice[3]]))
+}
+
+impl SealedFile {
+ pub fn to_bytes(&self) -> Result<Vec<u8>, ContainerError> {
+ let mut out = Vec::new();
+ out.extend_from_slice(&MAGIC);
+ out.push(1);
+ out.push(self.suite_id);
+
+ let manifest_json = self.manifest.to_json()?;
+ out.extend_from_slice(&(manifest_json.len() as u32).to_be_bytes());
+ out.extend_from_slice(&manifest_json);
+
+ let wrapped_bytes = serde_json::to_vec(&self.wrapped_dek)?;
+ out.extend_from_slice(&(wrapped_bytes.len() as u32).to_be_bytes());
+ out.extend_from_slice(&wrapped_bytes);
+
+ out.extend_from_slice(&self.aead_nonce);
+ out.extend_from_slice(&(self.ciphertext.len() as u32).to_be_bytes());
+ out.extend_from_slice(&self.ciphertext);
+
+ Ok(out)
+ }
+
+ pub fn from_bytes(data: &[u8]) -> Result<Self, ContainerError> {
+ let mut at = 0usize;
+ let magic = read_exact(data, &mut at, 6, "magic")?;
+ if magic != MAGIC {
+ return Err(ContainerError::BadMagic { got: magic.to_vec() });
+ }
+ let hdr = read_exact(data, &mut at, 2, "version/suite")?;
+ let fmt_ver = hdr[0];
+ let suite_id = hdr[1];
+ if fmt_ver != 1 {
+ return Err(ContainerError::UnsupportedVersion(fmt_ver));
+ }
+
+ let mlen = read_u32_be(data, &mut at, "manifest_len")? as usize;
+ if mlen > MAX_MANIFEST_BYTES {
+ return Err(ContainerError::Oversized {
+ field: "manifest",
+ got: mlen,
+ max: MAX_MANIFEST_BYTES,
+ });
+ }
+ let manifest_bytes = read_exact(data, &mut at, mlen, "manifest")?;
+ let manifest = Manifest::from_json(manifest_bytes)?;
+
+ let wlen = read_u32_be(data, &mut at, "wrapped_dek_len")? as usize;
+ if wlen > MAX_WRAPPED_DEK_BYTES {
+ return Err(ContainerError::Oversized {
+ field: "wrapped_dek",
+ got: wlen,
+ max: MAX_WRAPPED_DEK_BYTES,
+ });
+ }
+ let wrapped_bytes = read_exact(data, &mut at, wlen, "wrapped_dek")?;
+ let wrapped_dek: serde_json::Value = serde_json::from_slice(wrapped_bytes)?;
+
+ let nonce_slice = read_exact(data, &mut at, 24, "aead_nonce")?;
+ let mut aead_nonce = [0u8; 24];
+ aead_nonce.copy_from_slice(nonce_slice);
+
+ let clen = read_u32_be(data, &mut at, "ciphertext_len")? as usize;
+ if clen > MAX_CIPHERTEXT_BYTES {
+ return Err(ContainerError::Oversized {
+ field: "ciphertext",
+ got: clen,
+ max: MAX_CIPHERTEXT_BYTES,
+ });
+ }
+ let ciphertext = read_exact(data, &mut at, clen, "ciphertext")?.to_vec();
+
+ Ok(SealedFile {
+ manifest,
+ wrapped_dek,
+ aead_nonce,
+ ciphertext,
+ suite_id,
+ })
+ }
+}
+
+// -------------------------- High-level API --------------------------
+
+/// Seal plaintext for a single recipient.
+pub fn seal(
+ plaintext: &[u8],
+ manifest: &mut Manifest,
+ issuer_ed25519_priv: &[u8],
+ recipient_x25519_pub: &[u8],
+) -> Result<Vec<u8>, ContainerError> {
+ // Preconditions as explicit checks (not asserts - python -O safety parity).
+ if manifest.content_hash != crypto::content_hash(plaintext) {
+ return Err(ContainerError::Precondition(
+ "manifest.content_hash != sha256(plaintext)",
+ ));
+ }
+ if manifest.size_bytes != plaintext.len() as u64 {
+ return Err(ContainerError::Precondition(
+ "manifest.size_bytes != len(plaintext)",
+ ));
+ }
+ let recipient = manifest
+ .recipient
+ .as_ref()
+ .ok_or(ContainerError::Precondition("manifest.recipient is None"))?;
+ if recipient.x25519_pub != hex::encode(recipient_x25519_pub) {
+ return Err(ContainerError::Precondition(
+ "manifest.recipient.x25519_pub mismatch with recipient pubkey",
+ ));
+ }
+ if recipient_x25519_pub.len() != 32 {
+ return Err(ContainerError::Precondition("recipient pubkey must be 32 bytes"));
+ }
+ if issuer_ed25519_priv.len() != 32 {
+ return Err(ContainerError::Precondition("issuer priv key must be 32 bytes"));
+ }
+
+ manifest.sign(issuer_ed25519_priv)?;
+
+ let dek = crypto::random_dek();
+ let wrapped = crypto::wrap_dek_for_recipient(dek.as_ref(), recipient_x25519_pub)?;
+ let aad = manifest.content_hash.as_bytes();
+ let (nonce, ct) = crypto::aead_encrypt(dek.as_ref(), plaintext, aad)?;
+
+ let sf = SealedFile {
+ manifest: manifest.clone(),
+ wrapped_dek: wrapped.to_json_hex(),
+ aead_nonce: nonce,
+ ciphertext: ct,
+ suite_id: SUITE_CLASSIC_V1_ID,
+ };
+ sf.to_bytes()
+}
+
+/// Open a sealed blob. Returns (plaintext, manifest).
+pub fn open_sealed(
+ blob: &[u8],
+ recipient_x25519_priv: &[u8],
+ trusted_issuer_pubs: Option<&[String]>,
+) -> Result<(Vec<u8>, Manifest), ContainerError> {
+ if recipient_x25519_priv.len() != 32 {
+ return Err(ContainerError::Precondition("recipient priv key must be 32 bytes"));
+ }
+
+ let sf = SealedFile::from_bytes(blob)?;
+ if !sf.manifest.verify()? {
+ return Err(ContainerError::Manifest(ManifestError::MissingSignature));
+ }
+
+ if let Some(trusted) = trusted_issuer_pubs {
+ if !trusted.iter().any(|p| p == &sf.manifest.issuer_ed25519_pub) {
+ return Err(ContainerError::Precondition("issuer not in trusted set"));
+ }
+ }
+
+ // Policy enforcement (time-based) - expanded version in oversight-policy crate later
+ let now = std::time::SystemTime::now()
+ .duration_since(std::time::UNIX_EPOCH)
+ .map(|d| d.as_secs() as i64)
+ .unwrap_or(0);
+ if let Some(na) = sf.manifest.policy.get("not_after").and_then(|v| v.as_i64()) {
+ if now > na {
+ return Err(ContainerError::Precondition("file expired (not_after)"));
+ }
+ }
+ if let Some(nb) = sf.manifest.policy.get("not_before").and_then(|v| v.as_i64()) {
+ if now < nb {
+ return Err(ContainerError::Precondition("file not yet released (not_before)"));
+ }
+ }
+
+ // DEK unwrap: try slots if present, else single wrap
+ let dek = if let Some(slots) = sf.wrapped_dek.get("slots").and_then(|v| v.as_array()) {
+ let mut recovered = None;
+ for slot in slots {
+ let wrapped = WrappedDek::from_json_hex(slot)?;
+ if let Ok(dek) = crypto::unwrap_dek(&wrapped, recipient_x25519_priv) {
+ recovered = Some(dek);
+ break;
+ }
+ }
+ recovered.ok_or(ContainerError::NoDecryptableSlot { slots: slots.len() })?
+ } else {
+ let wrapped = WrappedDek::from_json_hex(&sf.wrapped_dek)?;
+ crypto::unwrap_dek(&wrapped, recipient_x25519_priv)?
+ };
+
+ let aad = sf.manifest.content_hash.as_bytes();
+ let plaintext = crypto::aead_decrypt(dek.as_ref(), &sf.aead_nonce, &sf.ciphertext, aad)?;
+
+ if crypto::content_hash(&plaintext) != sf.manifest.content_hash {
+ return Err(ContainerError::HashMismatch);
+ }
+
+ Ok((plaintext, sf.manifest))
+}
+
+/// Seal for multiple recipients (compact storage: one ciphertext, N key wraps).
+pub fn seal_multi(
+ plaintext: &[u8],
+ manifest: &mut Manifest,
+ issuer_ed25519_priv: &[u8],
+ recipient_x25519_pubs: &[&[u8]],
+) -> Result<Vec<u8>, ContainerError> {
+ if manifest.content_hash != crypto::content_hash(plaintext) {
+ return Err(ContainerError::Precondition(
+ "manifest.content_hash != sha256(plaintext)",
+ ));
+ }
+ if manifest.size_bytes != plaintext.len() as u64 {
+ return Err(ContainerError::Precondition(
+ "manifest.size_bytes != len(plaintext)",
+ ));
+ }
+ if recipient_x25519_pubs.is_empty() {
+ return Err(ContainerError::Precondition("need at least one recipient"));
+ }
+ for (i, pub_key) in recipient_x25519_pubs.iter().enumerate() {
+ if pub_key.len() != 32 {
+ return Err(ContainerError::Precondition(
+ "recipient pubkey must be 32 bytes",
+ ));
+ }
+ let _ = i;
+ }
+
+ manifest.sign(issuer_ed25519_priv)?;
+ let dek = crypto::random_dek();
+ let slots: Result<Vec<_>, _> = recipient_x25519_pubs
+ .iter()
+ .map(|p| crypto::wrap_dek_for_recipient(dek.as_ref(), p))
+ .collect();
+ let slots = slots?;
+ let slots_json: Vec<_> = slots.iter().map(|s| s.to_json_hex()).collect();
+
+ let aad = manifest.content_hash.as_bytes();
+ let (nonce, ct) = crypto::aead_encrypt(dek.as_ref(), plaintext, aad)?;
+
+ let sf = SealedFile {
+ manifest: manifest.clone(),
+ wrapped_dek: serde_json::json!({ "slots": slots_json }),
+ aead_nonce: nonce,
+ ciphertext: ct,
+ suite_id: SUITE_CLASSIC_V1_ID,
+ };
+ sf.to_bytes()
+}
+
+#[cfg(test)]
+mod tests {
+ use super::*;
+ use oversight_crypto::ClassicIdentity;
+ use oversight_manifest::Recipient;
+
+ fn make_manifest(issuer: &ClassicIdentity, recipient: &ClassicIdentity, plaintext: &[u8]) -> Manifest {
+ Manifest::new(
+ "doc.txt",
+ crypto::content_hash(plaintext),
+ plaintext.len() as u64,
+ "issuer@test",
+ hex::encode(issuer.ed25519_pub),
+ Recipient {
+ recipient_id: "alice@test".into(),
+ x25519_pub: hex::encode(recipient.x25519_pub),
+ ed25519_pub: None,
+ },
+ "https://registry.test",
+ "text/plain",
+ None,
+ None,
+ "GLOBAL",
+ )
+ }
+
+ #[test]
+ fn seal_open_round_trip() {
+ let issuer = ClassicIdentity::generate();
+ let recipient = ClassicIdentity::generate();
+ let plaintext = b"This is my secret document.";
+ let mut m = make_manifest(&issuer, &recipient, plaintext);
+ let blob = seal(plaintext, &mut m, issuer.ed25519_priv.as_ref(), &recipient.x25519_pub).unwrap();
+ let (pt, manifest) = open_sealed(&blob, recipient.x25519_priv.as_ref(), None).unwrap();
+ assert_eq!(pt, plaintext);
+ assert_eq!(manifest.file_id, m.file_id);
+ }
+
+ #[test]
+ fn wrong_recipient_rejected() {
+ let issuer = ClassicIdentity::generate();
+ let alice = ClassicIdentity::generate();
+ let bob = ClassicIdentity::generate();
+ let plaintext = b"secret";
+ let mut m = make_manifest(&issuer, &alice, plaintext);
+ let blob = seal(plaintext, &mut m, issuer.ed25519_priv.as_ref(), &alice.x25519_pub).unwrap();
+ // Bob tries to open - should fail at AEAD stage
+ assert!(open_sealed(&blob, bob.x25519_priv.as_ref(), None).is_err());
+ }
+
+ #[test]
+ fn ciphertext_tamper_rejected() {
+ let issuer = ClassicIdentity::generate();
+ let alice = ClassicIdentity::generate();
+ let plaintext = b"secret";
+ let mut m = make_manifest(&issuer, &alice, plaintext);
+ let mut blob = seal(plaintext, &mut m, issuer.ed25519_priv.as_ref(), &alice.x25519_pub).unwrap();
+ let len = blob.len();
+ blob[len - 1] ^= 0x01;
+ assert!(open_sealed(&blob, alice.x25519_priv.as_ref(), None).is_err());
+ }
+
+ #[test]
+ fn bad_magic_rejected() {
+ let mut blob = vec![0u8; 100];
+ blob[0..6].copy_from_slice(b"FAKE\x00\x00");
+ assert!(SealedFile::from_bytes(&blob).is_err());
+ }
+
+ #[test]
+ fn oversized_manifest_rejected() {
+ let mut blob = Vec::new();
+ blob.extend_from_slice(&MAGIC);
+ blob.push(1);
+ blob.push(1);
+ // Claim a 5MB manifest
+ blob.extend_from_slice(&(5u32 * 1024 * 1024).to_be_bytes());
+ blob.resize(100, 0);
+ match SealedFile::from_bytes(&blob) {
+ Err(ContainerError::Oversized { field: "manifest", .. }) => (),
+ other => panic!("expected Oversized manifest error, got {:?}", other),
+ }
+ }
+
+ #[test]
+ fn truncated_file_rejected() {
+ // Just a magic byte, nothing else
+ let blob = MAGIC.to_vec();
+ assert!(SealedFile::from_bytes(&blob).is_err());
+ }
+
+ #[test]
+ fn expired_file_rejected() {
+ let issuer = ClassicIdentity::generate();
+ let alice = ClassicIdentity::generate();
+ let plaintext = b"secret";
+ let mut m = make_manifest(&issuer, &alice, plaintext);
+ m.policy["not_after"] = serde_json::json!(1000); // long ago
+ let blob = seal(plaintext, &mut m, issuer.ed25519_priv.as_ref(), &alice.x25519_pub).unwrap();
+ match open_sealed(&blob, alice.x25519_priv.as_ref(), None) {
+ Err(ContainerError::Precondition("file expired (not_after)")) => (),
+ other => panic!("expected expiry error, got {:?}", other.is_ok()),
+ }
+ }
+
+ #[test]
+ fn seal_multi_three_recipients() {
+ let issuer = ClassicIdentity::generate();
+ let alice = ClassicIdentity::generate();
+ let bob = ClassicIdentity::generate();
+ let carol = ClassicIdentity::generate();
+ let stranger = ClassicIdentity::generate();
+
+ let plaintext = b"shared document for cohort";
+ // For seal_multi, we use a placeholder recipient in the manifest
+ let mut m = Manifest::new(
+ "cohort.txt",
+ crypto::content_hash(plaintext),
+ plaintext.len() as u64,
+ "issuer@test",
+ hex::encode(issuer.ed25519_pub),
+ Recipient {
+ recipient_id: "cohort".into(),
+ x25519_pub: hex::encode(alice.x25519_pub), // placeholder
+ ed25519_pub: None,
+ },
+ "https://registry.test",
+ "text/plain",
+ None,
+ None,
+ "GLOBAL",
+ );
+ let recipients: Vec<&[u8]> = vec![&alice.x25519_pub, &bob.x25519_pub, &carol.x25519_pub];
+ let blob = seal_multi(plaintext, &mut m, issuer.ed25519_priv.as_ref(), &recipients).unwrap();
+
+ // All three should decrypt
+ for r in [&alice, &bob, &carol] {
+ let (pt, _) = open_sealed(&blob, r.x25519_priv.as_ref(), None).unwrap();
+ assert_eq!(pt, plaintext);
+ }
+ // Stranger should fail
+ assert!(open_sealed(&blob, stranger.x25519_priv.as_ref(), None).is_err());
+ }
+}
oversight-rust/oversight-crypto/Cargo.toml +20 -0
@@ -0,0 +1,20 @@
+[package]
+name = "oversight-crypto"
+version.workspace = true
+edition.workspace = true
+rust-version.workspace = true
+license.workspace = true
+description = "Cryptographic primitives for Oversight: X25519, Ed25519, XChaCha20-Poly1305, HKDF, hybrid PQ hooks"
+
+[dependencies]
+x25519-dalek.workspace = true
+ed25519-dalek.workspace = true
+chacha20poly1305.workspace = true
+hkdf.workspace = true
+sha2.workspace = true
+rand.workspace = true
+rand_core.workspace = true
+hex.workspace = true
+zeroize.workspace = true
+thiserror.workspace = true
+serde_json.workspace = true
oversight-rust/oversight-crypto/src/lib.rs +395 -0
@@ -0,0 +1,395 @@
+//! # oversight-crypto
+//!
+//! Cryptographic primitives for Oversight.
+//!
+//! ## Design
+//!
+//! NIST-standardized, peer-reviewed primitives only. NO custom crypto.
+//!
+//! ### Classical suite (SNTL-CLASSIC-v1 on-the-wire, maintained for compatibility)
+//! - **X25519** - ECDH key agreement
+//! - **Ed25519** - digital signatures
+//! - **XChaCha20-Poly1305** - authenticated encryption (AEAD)
+//! - **HKDF-SHA256** - key derivation
+//!
+//! ### Post-quantum hybrid suite (OSGT-HYBRID-v1)
+//! - **X25519 + ML-KEM-768** - hybrid key encapsulation (requires both be broken)
+//! - **Ed25519 + ML-DSA-65** - hybrid signatures
+//!
+//! PQ primitives are gated behind the `pq` feature and require `liboqs`.
+//!
+//! ## Memory safety
+//!
+//! All secret bytes are wrapped in `zeroize::Zeroizing` so they scrub on drop.
+//! Rust's ownership rules prevent the classic "use-after-free" class of bugs
+//! that plague C cryptographic libraries.
+
+use chacha20poly1305::{
+ aead::{Aead, AeadCore, KeyInit, Payload},
+ XChaCha20Poly1305,
+};
+use ed25519_dalek::{
+ Signature as EdSignature, Signer, SigningKey as EdSigningKey, Verifier,
+ VerifyingKey as EdVerifyingKey,
+};
+use hkdf::Hkdf;
+use rand::rngs::OsRng;
+use rand_core::RngCore;
+use sha2::{Digest, Sha256};
+use thiserror::Error;
+use x25519_dalek::{PublicKey as X25519PublicKey, StaticSecret as X25519StaticSecret};
+use zeroize::{Zeroize, Zeroizing};
+
+pub const XCHACHA_KEY_LEN: usize = 32;
+pub const XCHACHA_NONCE_LEN: usize = 24;
+pub const X25519_KEY_LEN: usize = 32;
+pub const ED25519_KEY_LEN: usize = 32;
+pub const ED25519_SIG_LEN: usize = 64;
+pub const DEK_LEN: usize = 32;
+
+pub const SUITE_CLASSIC_V1: &str = "OSGT-CLASSIC-v1";
+pub const SUITE_HYBRID_V1: &str = "OSGT-HYBRID-v1";
+
+#[derive(Debug, Error)]
+pub enum CryptoError {
+ #[error("invalid key length: expected {expected}, got {got}")]
+ InvalidKeyLength { expected: usize, got: usize },
+ #[error("AEAD decryption failed (tag mismatch or key wrong)")]
+ AeadFailed,
+ #[error("signature verification failed")]
+ BadSignature,
+ #[error("malformed hex: {0}")]
+ Hex(#[from] hex::FromHexError),
+ #[error("HKDF error")]
+ Hkdf,
+ #[error("missing wrapped-DEK field: {0}")]
+ MissingField(&'static str),
+}
+
+// -------------------------- Identity --------------------------
+
+/// A recipient or issuer identity: X25519 for encryption, Ed25519 for signing.
+///
+/// Secret material lives in `Zeroizing` so it scrubs on drop.
+pub struct ClassicIdentity {
+ pub x25519_priv: Zeroizing<[u8; X25519_KEY_LEN]>,
+ pub x25519_pub: [u8; X25519_KEY_LEN],
+ pub ed25519_priv: Zeroizing<[u8; ED25519_KEY_LEN]>,
+ pub ed25519_pub: [u8; ED25519_KEY_LEN],
+}
+
+impl ClassicIdentity {
+ pub fn generate() -> Self {
+ let mut rng = OsRng;
+
+ // X25519
+ let mut x_priv_bytes = [0u8; X25519_KEY_LEN];
+ rng.fill_bytes(&mut x_priv_bytes);
+ let x_static = X25519StaticSecret::from(x_priv_bytes);
+ let x_pub = X25519PublicKey::from(&x_static);
+
+ // Ed25519
+ let mut ed_seed = [0u8; ED25519_KEY_LEN];
+ rng.fill_bytes(&mut ed_seed);
+ let ed_signing = EdSigningKey::from_bytes(&ed_seed);
+ let ed_verifying = ed_signing.verifying_key();
+
+ Self {
+ x25519_priv: Zeroizing::new(x_static.to_bytes()),
+ x25519_pub: x_pub.to_bytes(),
+ ed25519_priv: Zeroizing::new(ed_seed),
+ ed25519_pub: ed_verifying.to_bytes(),
+ }
+ }
+
+ pub fn from_raw(
+ x25519_priv: [u8; X25519_KEY_LEN],
+ ed25519_priv: [u8; ED25519_KEY_LEN],
+ ) -> Self {
+ let x_static = X25519StaticSecret::from(x25519_priv);
+ let x_pub = X25519PublicKey::from(&x_static);
+ let ed_signing = EdSigningKey::from_bytes(&ed25519_priv);
+ let ed_verifying = ed_signing.verifying_key();
+ Self {
+ x25519_priv: Zeroizing::new(x25519_priv),
+ x25519_pub: x_pub.to_bytes(),
+ ed25519_priv: Zeroizing::new(ed25519_priv),
+ ed25519_pub: ed_verifying.to_bytes(),
+ }
+ }
+}
+
+// -------------------------- AEAD --------------------------
+
+/// XChaCha20-Poly1305 encrypt. Returns (nonce, ciphertext||tag).
+/// 24-byte nonces are safe to random-generate (2^96 security margin).
+pub fn aead_encrypt(
+ key: &[u8],
+ plaintext: &[u8],
+ aad: &[u8],
+) -> Result<([u8; XCHACHA_NONCE_LEN], Vec<u8>), CryptoError> {
+ if key.len() != XCHACHA_KEY_LEN {
+ return Err(CryptoError::InvalidKeyLength {
+ expected: XCHACHA_KEY_LEN,
+ got: key.len(),
+ });
+ }
+ let cipher = XChaCha20Poly1305::new(key.into());
+ let nonce = XChaCha20Poly1305::generate_nonce(&mut OsRng);
+ let ct = cipher
+ .encrypt(&nonce, Payload { msg: plaintext, aad })
+ .map_err(|_| CryptoError::AeadFailed)?;
+ let mut nonce_arr = [0u8; XCHACHA_NONCE_LEN];
+ nonce_arr.copy_from_slice(&nonce);
+ Ok((nonce_arr, ct))
+}
+
+pub fn aead_decrypt(
+ key: &[u8],
+ nonce: &[u8],
+ ciphertext: &[u8],
+ aad: &[u8],
+) -> Result<Vec<u8>, CryptoError> {
+ if key.len() != XCHACHA_KEY_LEN {
+ return Err(CryptoError::InvalidKeyLength {
+ expected: XCHACHA_KEY_LEN,
+ got: key.len(),
+ });
+ }
+ if nonce.len() != XCHACHA_NONCE_LEN {
+ return Err(CryptoError::InvalidKeyLength {
+ expected: XCHACHA_NONCE_LEN,
+ got: nonce.len(),
+ });
+ }
+ let cipher = XChaCha20Poly1305::new(key.into());
+ cipher
+ .decrypt(nonce.into(), Payload { msg: ciphertext, aad })
+ .map_err(|_| CryptoError::AeadFailed)
+}
+
+// -------------------------- Key agreement --------------------------
+
+/// Classical ECIES-style DEK wrap using X25519 + HKDF-SHA256 + XChaCha20-Poly1305.
+///
+/// Returns a wrapped-envelope with hex-encoded fields suitable for JSON embed.
+#[derive(Debug, Clone, PartialEq, Eq)]
+pub struct WrappedDek {
+ pub ephemeral_pub: [u8; X25519_KEY_LEN],
+ pub nonce: [u8; XCHACHA_NONCE_LEN],
+ pub wrapped_dek: Vec<u8>,
+}
+
+impl WrappedDek {
+ pub fn to_json_hex(&self) -> serde_json::Value {
+ serde_json::json!({
+ "ephemeral_pub": hex::encode(self.ephemeral_pub),
+ "nonce": hex::encode(self.nonce),
+ "wrapped_dek": hex::encode(&self.wrapped_dek),
+ })
+ }
+
+ pub fn from_json_hex(v: &serde_json::Value) -> Result<Self, CryptoError> {
+ fn field(v: &serde_json::Value, name: &'static str) -> Result<String, CryptoError> {
+ v.get(name)
+ .and_then(|x| x.as_str())
+ .map(str::to_string)
+ .ok_or(CryptoError::MissingField(name))
+ }
+ let eph_bytes = hex::decode(field(v, "ephemeral_pub")?)?;
+ let nonce_bytes = hex::decode(field(v, "nonce")?)?;
+ let wrapped = hex::decode(field(v, "wrapped_dek")?)?;
+ if eph_bytes.len() != X25519_KEY_LEN {
+ return Err(CryptoError::InvalidKeyLength {
+ expected: X25519_KEY_LEN,
+ got: eph_bytes.len(),
+ });
+ }
+ if nonce_bytes.len() != XCHACHA_NONCE_LEN {
+ return Err(CryptoError::InvalidKeyLength {
+ expected: XCHACHA_NONCE_LEN,
+ got: nonce_bytes.len(),
+ });
+ }
+ let mut eph = [0u8; X25519_KEY_LEN];
+ eph.copy_from_slice(&eph_bytes);
+ let mut nonce = [0u8; XCHACHA_NONCE_LEN];
+ nonce.copy_from_slice(&nonce_bytes);
+ Ok(WrappedDek { ephemeral_pub: eph, nonce, wrapped_dek: wrapped })
+ }
+}
+
+pub fn wrap_dek_for_recipient(
+ dek: &[u8],
+ recipient_x25519_pub: &[u8],
+) -> Result<WrappedDek, CryptoError> {
+ if recipient_x25519_pub.len() != X25519_KEY_LEN {
+ return Err(CryptoError::InvalidKeyLength {
+ expected: X25519_KEY_LEN,
+ got: recipient_x25519_pub.len(),
+ });
+ }
+ let mut eph_bytes = [0u8; X25519_KEY_LEN];
+ OsRng.fill_bytes(&mut eph_bytes);
+ let eph = X25519StaticSecret::from(eph_bytes);
+ let eph_pub = X25519PublicKey::from(&eph);
+
+ let mut peer_arr = [0u8; X25519_KEY_LEN];
+ peer_arr.copy_from_slice(recipient_x25519_pub);
+ let peer = X25519PublicKey::from(peer_arr);
+
+ let shared = Zeroizing::new(eph.diffie_hellman(&peer).to_bytes());
+
+ let hk = Hkdf::<Sha256>::new(None, shared.as_ref());
+ let mut kek = Zeroizing::new([0u8; 32]);
+ hk.expand(b"oversight-v1-dek-wrap", kek.as_mut())
+ .map_err(|_| CryptoError::Hkdf)?;
+
+ let (nonce, wrapped) = aead_encrypt(kek.as_ref(), dek, b"oversight-dek")?;
+ Ok(WrappedDek { ephemeral_pub: eph_pub.to_bytes(), nonce, wrapped_dek: wrapped })
+}
+
+pub fn unwrap_dek(
+ wrapped: &WrappedDek,
+ recipient_x25519_priv: &[u8],
+) -> Result<Zeroizing<Vec<u8>>, CryptoError> {
+ if recipient_x25519_priv.len() != X25519_KEY_LEN {
+ return Err(CryptoError::InvalidKeyLength {
+ expected: X25519_KEY_LEN,
+ got: recipient_x25519_priv.len(),
+ });
+ }
+ let mut priv_arr = [0u8; X25519_KEY_LEN];
+ priv_arr.copy_from_slice(recipient_x25519_priv);
+ let sk = X25519StaticSecret::from(priv_arr);
+ priv_arr.zeroize();
+
+ let peer = X25519PublicKey::from(wrapped.ephemeral_pub);
+ let shared = Zeroizing::new(sk.diffie_hellman(&peer).to_bytes());
+
+ let hk = Hkdf::<Sha256>::new(None, shared.as_ref());
+ let mut kek = Zeroizing::new([0u8; 32]);
+ hk.expand(b"oversight-v1-dek-wrap", kek.as_mut())
+ .map_err(|_| CryptoError::Hkdf)?;
+
+ let plaintext = aead_decrypt(
+ kek.as_ref(),
+ &wrapped.nonce,
+ &wrapped.wrapped_dek,
+ b"oversight-dek",
+ )?;
+ Ok(Zeroizing::new(plaintext))
+}
+
+// -------------------------- Signatures --------------------------
+
+pub fn sign_message(msg: &[u8], ed25519_priv: &[u8]) -> Result<[u8; ED25519_SIG_LEN], CryptoError> {
+ if ed25519_priv.len() != ED25519_KEY_LEN {
+ return Err(CryptoError::InvalidKeyLength {
+ expected: ED25519_KEY_LEN,
+ got: ed25519_priv.len(),
+ });
+ }
+ let mut seed = [0u8; ED25519_KEY_LEN];
+ seed.copy_from_slice(ed25519_priv);
+ let signing = EdSigningKey::from_bytes(&seed);
+ seed.zeroize();
+ Ok(signing.sign(msg).to_bytes())
+}
+
+pub fn verify_message(msg: &[u8], sig: &[u8], ed25519_pub: &[u8]) -> bool {
+ if sig.len() != ED25519_SIG_LEN || ed25519_pub.len() != ED25519_KEY_LEN {
+ return false;
+ }
+ let mut pub_arr = [0u8; ED25519_KEY_LEN];
+ pub_arr.copy_from_slice(ed25519_pub);
+ let verifying = match EdVerifyingKey::from_bytes(&pub_arr) {
+ Ok(v) => v,
+ Err(_) => return false,
+ };
+ let mut sig_arr = [0u8; ED25519_SIG_LEN];
+ sig_arr.copy_from_slice(sig);
+ let signature = EdSignature::from_bytes(&sig_arr);
+ verifying.verify(msg, &signature).is_ok()
+}
+
+// -------------------------- Utility --------------------------
+
+pub fn random_dek() -> Zeroizing<[u8; DEK_LEN]> {
+ let mut dek = Zeroizing::new([0u8; DEK_LEN]);
+ OsRng.fill_bytes(dek.as_mut());
+ dek
+}
+
+pub fn content_hash(data: &[u8]) -> String {
+ let mut h = Sha256::new();
+ h.update(data);
+ hex::encode(h.finalize())
+}
+
+// -------------------------- Tests --------------------------
+
+#[cfg(test)]
+mod tests {
+ use super::*;
+
+ #[test]
+ fn aead_round_trip() {
+ let key = [42u8; XCHACHA_KEY_LEN];
+ let (nonce, ct) = aead_encrypt(&key, b"hello world", b"aad-test").unwrap();
+ let pt = aead_decrypt(&key, &nonce, &ct, b"aad-test").unwrap();
+ assert_eq!(pt, b"hello world");
+ }
+
+ #[test]
+ fn aead_tamper_rejected() {
+ let key = [42u8; XCHACHA_KEY_LEN];
+ let (nonce, mut ct) = aead_encrypt(&key, b"hello world", b"").unwrap();
+ ct[0] ^= 0x01;
+ assert!(aead_decrypt(&key, &nonce, &ct, b"").is_err());
+ }
+
+ #[test]
+ fn aead_wrong_aad_rejected() {
+ let key = [42u8; XCHACHA_KEY_LEN];
+ let (nonce, ct) = aead_encrypt(&key, b"hello world", b"correct").unwrap();
+ assert!(aead_decrypt(&key, &nonce, &ct, b"wrong").is_err());
+ }
+
+ #[test]
+ fn wrap_unwrap_round_trip() {
+ let alice = ClassicIdentity::generate();
+ let dek = random_dek();
+ let wrapped = wrap_dek_for_recipient(dek.as_ref(), &alice.x25519_pub).unwrap();
+ let recovered = unwrap_dek(&wrapped, alice.x25519_priv.as_ref()).unwrap();
+ assert_eq!(&recovered[..], dek.as_ref());
+ }
+
+ #[test]
+ fn wrap_wrong_recipient_rejected() {
+ let alice = ClassicIdentity::generate();
+ let bob = ClassicIdentity::generate();
+ let dek = random_dek();
+ let wrapped = wrap_dek_for_recipient(dek.as_ref(), &alice.x25519_pub).unwrap();
+ // Bob tries to unwrap -- AEAD tag check will fail
+ assert!(unwrap_dek(&wrapped, bob.x25519_priv.as_ref()).is_err());
+ }
+
+ #[test]
+ fn sign_verify_round_trip() {
+ let id = ClassicIdentity::generate();
+ let sig = sign_message(b"test message", id.ed25519_priv.as_ref()).unwrap();
+ assert!(verify_message(b"test message", &sig, &id.ed25519_pub));
+ assert!(!verify_message(b"tampered message", &sig, &id.ed25519_pub));
+ }
+
+ #[test]
+ fn json_round_trip() {
+ let alice = ClassicIdentity::generate();
+ let dek = random_dek();
+ let wrapped = wrap_dek_for_recipient(dek.as_ref(), &alice.x25519_pub).unwrap();
+ let json = wrapped.to_json_hex();
+ let parsed = WrappedDek::from_json_hex(&json).unwrap();
+ assert_eq!(wrapped, parsed);
+ }
+}
oversight-rust/oversight-manifest/Cargo.toml +16 -0
@@ -0,0 +1,16 @@
+[package]
+name = "oversight-manifest"
+version.workspace = true
+edition.workspace = true
+rust-version.workspace = true
+license.workspace = true
+description = "Signed canonical-JSON manifest for Oversight"
+
+[dependencies]
+oversight-crypto = { path = "../oversight-crypto" }
+serde.workspace = true
+serde_json.workspace = true
+serde_jcs.workspace = true
+hex.workspace = true
+uuid.workspace = true
+thiserror.workspace = true
oversight-rust/oversight-manifest/src/lib.rs +236 -0
@@ -0,0 +1,236 @@
+//! # oversight-manifest
+//!
+//! The signed metadata that binds a sealed file to its recipient, watermarks,
+//! beacons, and policy. It's the artifact a registry stores and a verifier checks.
+//!
+//! Wire format: canonical JSON (JCS, RFC 8785), UTF-8, Ed25519-signed.
+
+use oversight_crypto::{self as crypto, CryptoError};
+use serde::{Deserialize, Serialize};
+use thiserror::Error;
+
+#[derive(Debug, Error)]
+pub enum ManifestError {
+ #[error(transparent)]
+ Crypto(#[from] CryptoError),
+ #[error("json error: {0}")]
+ Json(#[from] serde_json::Error),
+ #[error("signature missing or empty")]
+ MissingSignature,
+ #[error("issuer pubkey missing or empty")]
+ MissingIssuer,
+ #[error("hex decode: {0}")]
+ Hex(#[from] hex::FromHexError),
+ #[error("canonicalization failed")]
+ Canonicalization,
+}
+
+#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)]
+pub struct Recipient {
+ pub recipient_id: String,
+ pub x25519_pub: String,
+ #[serde(default, skip_serializing_if = "Option::is_none")]
+ pub ed25519_pub: Option<String>,
+}
+
+#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)]
+pub struct WatermarkRef {
+ pub layer: String,
+ pub mark_id: String,
+}
+
+#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)]
+#[serde(default)]
+pub struct Manifest {
+ pub file_id: String,
+ pub issued_at: i64,
+ pub version: String,
+ pub suite: String,
+ pub original_filename: String,
+ pub content_hash: String,
+ pub content_type: String,
+ pub size_bytes: u64,
+ pub issuer_id: String,
+ pub issuer_ed25519_pub: String,
+ #[serde(skip_serializing_if = "Option::is_none")]
+ pub recipient: Option<Recipient>,
+ pub watermarks: Vec<WatermarkRef>,
+ pub beacons: Vec<serde_json::Value>,
+ pub policy: serde_json::Value,
+ pub signature_ed25519: String,
+ pub signature_ml_dsa: String,
+}
+
+impl Default for Manifest {
+ fn default() -> Self {
+ Self {
+ file_id: String::new(),
+ issued_at: 0,
+ version: "OVERSIGHT-v1".into(),
+ suite: crypto::SUITE_CLASSIC_V1.into(),
+ original_filename: String::new(),
+ content_hash: String::new(),
+ content_type: "application/octet-stream".into(),
+ size_bytes: 0,
+ issuer_id: String::new(),
+ issuer_ed25519_pub: String::new(),
+ recipient: None,
+ watermarks: Vec::new(),
+ beacons: Vec::new(),
+ policy: serde_json::json!({}),
+ signature_ed25519: String::new(),
+ signature_ml_dsa: String::new(),
+ }
+ }
+}
+
+impl Manifest {
+ pub fn new(
+ original_filename: impl Into<String>,
+ content_hash: impl Into<String>,
+ size_bytes: u64,
+ issuer_id: impl Into<String>,
+ issuer_ed25519_pub_hex: impl Into<String>,
+ recipient: Recipient,
+ registry_url: impl Into<String>,
+ content_type: impl Into<String>,
+ not_after: Option<i64>,
+ max_opens: Option<u64>,
+ jurisdiction: impl Into<String>,
+ ) -> Self {
+ let mut policy = serde_json::json!({
+ "registry_url": registry_url.into(),
+ "jurisdiction": jurisdiction.into(),
+ });
+ if let Some(na) = not_after {
+ policy["not_after"] = serde_json::json!(na);
+ }
+ if let Some(mx) = max_opens {
+ policy["max_opens"] = serde_json::json!(mx);
+ }
+
+ Self {
+ file_id: uuid::Uuid::new_v4().to_string(),
+ issued_at: std::time::SystemTime::now()
+ .duration_since(std::time::UNIX_EPOCH)
+ .map(|d| d.as_secs() as i64)
+ .unwrap_or(0),
+ original_filename: original_filename.into(),
+ content_hash: content_hash.into(),
+ content_type: content_type.into(),
+ size_bytes,
+ issuer_id: issuer_id.into(),
+ issuer_ed25519_pub: issuer_ed25519_pub_hex.into(),
+ recipient: Some(recipient),
+ policy,
+ ..Default::default()
+ }
+ }
+
+ /// Canonical bytes (excluding signatures) - this is what gets signed.
+ pub fn canonical_bytes(&self) -> Result<Vec<u8>, ManifestError> {
+ let mut v = serde_json::to_value(self)?;
+ // Strip signatures before canonicalizing.
+ if let Some(obj) = v.as_object_mut() {
+ obj.insert("signature_ed25519".into(), serde_json::json!(""));
+ obj.insert("signature_ml_dsa".into(), serde_json::json!(""));
+ }
+ serde_jcs::to_vec(&v).map_err(|_| ManifestError::Canonicalization)
+ }
+
+ pub fn to_json(&self) -> Result<Vec<u8>, ManifestError> {
+ let v = serde_json::to_value(self)?;
+ serde_jcs::to_vec(&v).map_err(|_| ManifestError::Canonicalization)
+ }
+
+ pub fn from_json(bytes: &[u8]) -> Result<Self, ManifestError> {
+ let m: Manifest = serde_json::from_slice(bytes)?;
+ Ok(m)
+ }
+
+ pub fn sign(&mut self, issuer_ed25519_priv: &[u8]) -> Result<(), ManifestError> {
+ let bytes = self.canonical_bytes()?;
+ let sig = crypto::sign_message(&bytes, issuer_ed25519_priv)?;
+ self.signature_ed25519 = hex::encode(sig);
+ Ok(())
+ }
+
+ pub fn verify(&self) -> Result<bool, ManifestError> {
+ if self.signature_ed25519.is_empty() {
+ return Ok(false);
+ }
+ if self.issuer_ed25519_pub.is_empty() {
+ return Ok(false);
+ }
+ let bytes = self.canonical_bytes()?;
+ let sig = hex::decode(&self.signature_ed25519)?;
+ let pub_key = hex::decode(&self.issuer_ed25519_pub)?;
+ Ok(crypto::verify_message(&bytes, &sig, &pub_key))
+ }
+}
+
+#[cfg(test)]
+mod tests {
+ use super::*;
+ use oversight_crypto::ClassicIdentity;
+
+ #[test]
+ fn sign_verify_round_trip() {
+ let issuer = ClassicIdentity::generate();
+ let recipient = ClassicIdentity::generate();
+
+ let mut m = Manifest::new(
+ "doc.txt",
+ crypto::content_hash(b"hello world"),
+ 11,
+ "issuer@test",
+ hex::encode(issuer.ed25519_pub),
+ Recipient {
+ recipient_id: "alice@test".into(),
+ x25519_pub: hex::encode(recipient.x25519_pub),
+ ed25519_pub: None,
+ },
+ "https://registry.test",
+ "text/plain",
+ None,
+ None,
+ "GLOBAL",
+ );
+
+ m.sign(issuer.ed25519_priv.as_ref()).unwrap();
+ assert!(m.verify().unwrap());
+
+ // Tamper: mutate content_hash
+ m.content_hash = "tampered".into();
+ assert!(!m.verify().unwrap());
+ }
+
+ #[test]
+ fn json_round_trip() {
+ let issuer = ClassicIdentity::generate();
+ let recipient = ClassicIdentity::generate();
+ let mut m = Manifest::new(
+ "doc.txt",
+ "abc123",
+ 42,
+ "issuer@test",
+ hex::encode(issuer.ed25519_pub),
+ Recipient {
+ recipient_id: "alice@test".into(),
+ x25519_pub: hex::encode(recipient.x25519_pub),
+ ed25519_pub: None,
+ },
+ "https://registry.test",
+ "text/plain",
+ None,
+ None,
+ "GLOBAL",
+ );
+ m.sign(issuer.ed25519_priv.as_ref()).unwrap();
+
+ let bytes = m.to_json().unwrap();
+ let parsed = Manifest::from_json(&bytes).unwrap();
+ assert_eq!(m, parsed);
+ assert!(parsed.verify().unwrap());
+ }
+}
oversight-rust/oversight-policy/Cargo.toml +17 -0
@@ -0,0 +1,17 @@
+[package]
+name = "oversight-policy"
+version.workspace = true
+edition.workspace = true
+rust-version.workspace = true
+license.workspace = true
+description = "Policy enforcement for Oversight: not_after / not_before / max_opens / jurisdiction"
+
+[dependencies]
+oversight-manifest = { path = "../oversight-manifest" }
+serde.workspace = true
+serde_json.workspace = true
+thiserror.workspace = true
+fs2 = "0.4"
+
+[dev-dependencies]
+tempfile = "3"
oversight-rust/oversight-policy/src/lib.rs +351 -0
@@ -0,0 +1,351 @@
+//! # oversight-policy
+//!
+//! Policy enforcement for opens. Mirrors the Python `oversight_core.policy`
+//! module with the same TOCTOU-safe atomic check-and-bump for `max_opens`.
+//!
+//! ## Enforcement modes
+//!
+//! - **LocalOnly**: counter state in a per-file JSON, protected by an
+//! OS-level flock. Write-to-temp-then-rename for crash consistency.
+//! Single-user, no network.
+//! - **Registry**: counter lives in the registry (caller handles network).
+//! - **Hybrid**: prefer registry, fall back to local if offline.
+//!
+//! The LocalOnly mode is not secure against an attacker who can tamper with
+//! the state file (they can reset the counter by deleting the file). It is
+//! however safe against races from concurrent honest openers.
+
+use fs2::FileExt;
+use oversight_manifest::Manifest;
+use serde::{Deserialize, Serialize};
+use std::fs::{File, OpenOptions};
+use std::io::Write;
+use std::path::PathBuf;
+use std::time::{SystemTime, UNIX_EPOCH};
+use thiserror::Error;
+
+#[derive(Debug, Error)]
+pub enum PolicyError {
+ #[error("policy violation: {0}")]
+ Violation(String),
+ #[error("I/O: {0}")]
+ Io(#[from] std::io::Error),
+ #[error("JSON: {0}")]
+ Json(#[from] serde_json::Error),
+ #[error("context required for this policy but not provided")]
+ ContextRequired,
+ #[error("invalid file_id for counter path: {0:?}")]
+ BadFileId(String),
+}
+
+pub type Result<T> = std::result::Result<T, PolicyError>;
+
+#[derive(Debug, Clone, Copy, PartialEq, Eq)]
+pub enum Mode {
+ LocalOnly,
+ Registry,
+ Hybrid,
+}
+
+/// State the opener needs to enforce policy.
+#[derive(Debug, Clone)]
+pub struct PolicyContext {
+ pub jurisdiction: String,
+ pub state_dir: Option<PathBuf>,
+ pub registry_url: Option<String>,
+ pub mode: Mode,
+}
+
+impl Default for PolicyContext {
+ fn default() -> Self {
+ PolicyContext {
+ jurisdiction: "GLOBAL".into(),
+ state_dir: None,
+ registry_url: None,
+ mode: Mode::LocalOnly,
+ }
+ }
+}
+
+impl PolicyContext {
+ pub fn local_only(state_dir: impl Into<PathBuf>) -> Result<Self> {
+ let dir: PathBuf = state_dir.into();
+ std::fs::create_dir_all(&dir)?;
+ Ok(Self {
+ jurisdiction: "GLOBAL".into(),
+ state_dir: Some(dir),
+ registry_url: None,
+ mode: Mode::LocalOnly,
+ })
+ }
+
+ pub fn with_jurisdiction(mut self, j: impl Into<String>) -> Self {
+ self.jurisdiction = j.into();
+ self
+ }
+}
+
+#[derive(Debug, Serialize, Deserialize)]
+struct CounterState {
+ count: u64,
+ last: i64,
+}
+
+fn now_unix() -> i64 {
+ SystemTime::now()
+ .duration_since(UNIX_EPOCH)
+ .map(|d| d.as_secs() as i64)
+ .unwrap_or(0)
+}
+
+fn sanitize_file_id(file_id: &str) -> Result<()> {
+ if file_id.is_empty()
+ || file_id.contains('/')
+ || file_id.contains('\\')
+ || file_id.contains("..")
+ || file_id.contains('\0')
+ {
+ return Err(PolicyError::BadFileId(file_id.to_string()));
+ }
+ Ok(())
+}
+
+fn counter_path(ctx: &PolicyContext, file_id: &str) -> Result<PathBuf> {
+ sanitize_file_id(file_id)?;
+ let dir = ctx
+ .state_dir
+ .as_ref()
+ .ok_or(PolicyError::ContextRequired)?;
+ Ok(dir.join(format!("{}.opens.json", file_id)))
+}
+
+fn lock_path(ctx: &PolicyContext, file_id: &str) -> Result<PathBuf> {
+ sanitize_file_id(file_id)?;
+ let dir = ctx
+ .state_dir
+ .as_ref()
+ .ok_or(PolicyError::ContextRequired)?;
+ Ok(dir.join(format!("{}.opens.lock", file_id)))
+}
+
+fn read_count(ctx: &PolicyContext, file_id: &str) -> u64 {
+ let p = match counter_path(ctx, file_id) {
+ Ok(p) => p,
+ Err(_) => return 0,
+ };
+ if !p.exists() {
+ return 0;
+ }
+ let text = match std::fs::read_to_string(&p) {
+ Ok(t) => t,
+ Err(_) => return 0,
+ };
+ match serde_json::from_str::<CounterState>(&text) {
+ Ok(cs) => cs.count,
+ Err(_) => 0,
+ }
+}
+
+/// Atomic check-and-bump: grab a file lock, read count, if it's below
+/// max_opens bump and fsync the new value, else raise PolicyViolation.
+/// Guarantees TOCTOU safety across concurrent openers of the same file.
+fn local_check_and_bump(ctx: &PolicyContext, file_id: &str, max_opens: u64) -> Result<u64> {
+ let state_dir = ctx.state_dir.as_ref().ok_or(PolicyError::ContextRequired)?;
+ std::fs::create_dir_all(state_dir)?;
+
+ let lock_path_buf = lock_path(ctx, file_id)?;
+ let counter_path_buf = counter_path(ctx, file_id)?;
+
+ // Open or create the lock file and acquire an exclusive OS-level lock.
+ let lock_file = OpenOptions::new()
+ .create(true)
+ .read(true)
+ .write(true)
+ .open(&lock_path_buf)?;
+ lock_file.lock_exclusive()?;
+
+ // Critical section: read current count, check, write new.
+ let cur = read_count(ctx, file_id);
+ if cur >= max_opens {
+ // lock auto-releases on drop
+ FileExt::unlock(&lock_file)?;
+ return Err(PolicyError::Violation(format!(
+ "Open limit reached: max_opens={max_opens}, already opened {cur} times"
+ )));
+ }
+ let new_count = cur + 1;
+
+ // Atomic write: temp file in the same directory, then rename.
+ let state = CounterState {
+ count: new_count,
+ last: now_unix(),
+ };
+ let tmp_path = state_dir.join(format!(".{}.opens.tmp.{}", file_id, std::process::id()));
+ {
+ let mut tmp = OpenOptions::new()
+ .create(true)
+ .write(true)
+ .truncate(true)
+ .open(&tmp_path)?;
+ tmp.write_all(serde_json::to_string(&state)?.as_bytes())?;
+ tmp.flush()?;
+ tmp.sync_data()?;
+ }
+ std::fs::rename(&tmp_path, &counter_path_buf)?;
+
+ FileExt::unlock(&lock_file)?;
+ Ok(new_count)
+}
+
+/// Cheap, read-only policy checks (time window, jurisdiction).
+/// max_opens is enforced separately in `record_open` to prevent TOCTOU.
+pub fn check_policy(manifest: &Manifest, ctx: Option<&PolicyContext>) -> Result<()> {
+ let now = now_unix();
+
+ if let Some(na) = manifest.policy.get("not_after").and_then(|v| v.as_i64()) {
+ if now > na {
+ let ago_h = (now - na) / 3600;
+ return Err(PolicyError::Violation(format!(
+ "File expired: not_after={na}, now={now} ({ago_h}h ago)"
+ )));
+ }
+ }
+ if let Some(nb) = manifest.policy.get("not_before").and_then(|v| v.as_i64()) {
+ if now < nb {
+ let in_m = (nb - now) / 60;
+ return Err(PolicyError::Violation(format!(
+ "File not yet released: not_before={nb}, now={now} (available in {in_m}m)"
+ )));
+ }
+ }
+
+ if let Some(required) = manifest.policy.get("jurisdiction").and_then(|v| v.as_str()) {
+ if required != "GLOBAL" {
+ if let Some(ctx) = ctx {
+ if required != ctx.jurisdiction {
+ return Err(PolicyError::Violation(format!(
+ "Jurisdiction mismatch: file requires '{required}', opener is in '{}'",
+ ctx.jurisdiction
+ )));
+ }
+ }
+ }
+ }
+
+ // max_opens NOT checked here - it's checked atomically in record_open.
+ Ok(())
+}
+
+/// Atomic check-and-bump the open counter (if policy has max_opens).
+/// Call BEFORE decryption so plaintext is never computed when limit is exceeded.
+/// Returns new count (0 if no max_opens policy).
+pub fn record_open(manifest: &Manifest, ctx: Option<&PolicyContext>) -> Result<u64> {
+ let ctx = match ctx {
+ Some(c) => c,
+ None => return Ok(0),
+ };
+ let mx = match manifest.policy.get("max_opens").and_then(|v| v.as_u64()) {
+ Some(m) => m,
+ None => return Ok(0),
+ };
+ match ctx.mode {
+ Mode::LocalOnly | Mode::Hybrid | Mode::Registry => {
+ // Registry/Hybrid fallback to local; real registry handling would POST /policy/open.
+ local_check_and_bump(ctx, &manifest.file_id, mx)
+ }
+ }
+}
+
+// Silence unused import warning when building without tempfile dev-dep
+#[allow(dead_code)]
+fn _unused_lock_file_param(_: &File) {}
+
+#[cfg(test)]
+mod tests {
+ use super::*;
+ use oversight_manifest::{Manifest, Recipient};
+ use tempfile::TempDir;
+
+ fn make_manifest_with(policy: serde_json::Value) -> Manifest {
+ let mut m = Manifest::new(
+ "test.txt",
+ "abc",
+ 10,
+ "issuer",
+ "00".repeat(32),
+ Recipient {
+ recipient_id: "alice".into(),
+ x25519_pub: "00".repeat(32),
+ ed25519_pub: None,
+ },
+ "https://registry",
+ "text/plain",
+ None,
+ None,
+ "GLOBAL",
+ );
+ m.policy = policy;
+ m
+ }
+
+ #[test]
+ fn not_after_expired_rejected() {
+ let m = make_manifest_with(serde_json::json!({
+ "jurisdiction": "GLOBAL",
+ "not_after": 1000, // long ago
+ }));
+ let err = check_policy(&m, None).unwrap_err();
+ assert!(matches!(err, PolicyError::Violation(_)));
+ }
+
+ #[test]
+ fn not_before_future_rejected() {
+ let m = make_manifest_with(serde_json::json!({
+ "jurisdiction": "GLOBAL",
+ "not_before": now_unix() + 3600, // 1h from now
+ }));
+ assert!(check_policy(&m, None).is_err());
+ }
+
+ #[test]
+ fn jurisdiction_mismatch_rejected() {
+ let m = make_manifest_with(serde_json::json!({
+ "jurisdiction": "EU",
+ }));
+ let dir = TempDir::new().unwrap();
+ let ctx = PolicyContext::local_only(dir.path()).unwrap().with_jurisdiction("US");
+ assert!(check_policy(&m, Some(&ctx)).is_err());
+ }
+
+ #[test]
+ fn jurisdiction_global_ok_without_ctx() {
+ let m = make_manifest_with(serde_json::json!({
+ "jurisdiction": "GLOBAL",
+ }));
+ assert!(check_policy(&m, None).is_ok());
+ }
+
+ #[test]
+ fn max_opens_enforced() {
+ let dir = TempDir::new().unwrap();
+ let ctx = PolicyContext::local_only(dir.path()).unwrap();
+ let m = make_manifest_with(serde_json::json!({
+ "jurisdiction": "GLOBAL",
+ "max_opens": 2,
+ }));
+ assert_eq!(record_open(&m, Some(&ctx)).unwrap(), 1);
+ assert_eq!(record_open(&m, Some(&ctx)).unwrap(), 2);
+ assert!(record_open(&m, Some(&ctx)).is_err()); // 3rd exceeds
+ }
+
+ #[test]
+ fn file_id_sanitization() {
+ let dir = TempDir::new().unwrap();
+ let ctx = PolicyContext::local_only(dir.path()).unwrap();
+ let mut m = make_manifest_with(serde_json::json!({
+ "max_opens": 5,
+ }));
+ m.file_id = "../../../etc/passwd".into();
+ assert!(record_open(&m, Some(&ctx)).is_err());
+ }
+}
oversight-rust/oversight-semantic/Cargo.toml +12 -0
@@ -0,0 +1,12 @@
+[package]
+name = "oversight-semantic"
+version.workspace = true
+edition.workspace = true
+rust-version.workspace = true
+license.workspace = true
+description = "L3 semantic watermarking for Oversight - airgap-strip-survivor synonym rotation"
+
+[dependencies]
+regex = "1"
+once_cell = "1"
+sha2.workspace = true
oversight-rust/oversight-semantic/examples/debug.rs +29 -0
@@ -0,0 +1,29 @@
+use oversight_semantic::{embed_synonyms, verify_synonyms, iter_matchable_words};
+
+const TEXT: &str = "Q3 revenue performance exceeded expectations across all business units. \
+The team plans to continue the expansion strategy outlined in our report at \
+https://internal.example.com/q3-2026.pdf and will begin hiring in \
+/home/claude/hiring_plan.docx this month. However, there are important risks \
+to consider before we commence the next phase.";
+
+fn main() {
+ let mark_a = b"\x01\x23\x45\x67\x89\xab\xcd\xef";
+ let mark_b: &[u8] = b"\xde\xad\xbe\xef\xfe\xed\xfa\xce";
+
+ for (name, mark) in [("A", &mark_a[..]), ("B", mark_b)] {
+ let marked = embed_synonyms(TEXT, mark, 5);
+ let matches_before = iter_matchable_words(TEXT).len();
+ let matches_after = iter_matchable_words(&marked).len();
+ let (ok, score) = verify_synonyms(&marked, mark, 0.70);
+ println!("mark {}: matches before={} after={}, verify ok={} score={:.3}",
+ name, matches_before, matches_after, ok, score);
+
+ // Print the first few matches before/after
+ let before: Vec<_> = iter_matchable_words(TEXT).iter().take(10)
+ .map(|m| (m.orig_word.clone(), m.class_index, m.variant_index)).collect();
+ let after: Vec<_> = iter_matchable_words(&marked).iter().take(10)
+ .map(|m| (m.orig_word.clone(), m.class_index, m.variant_index)).collect();
+ println!(" first 10 before: {:?}", before);
+ println!(" first 10 after: {:?}", after);
+ }
+}
oversight-rust/oversight-semantic/src/lib.rs +371 -0
@@ -0,0 +1,371 @@
+//! # oversight-semantic
+//!
+//! L3 semantic watermarking - airgap-strip-survivor watermarking by
+//! rotating words between synonym classes. Mirrors the Python
+//! `oversight_core.semantic` and `oversight_core.synonyms_v2` modules.
+//!
+//! ## Threat model
+//!
+//! L1 (zero-width unicode) and L2 (trailing whitespace) survive copy-paste
+//! but fall to a "normalize & retype" attacker who opens the file in an
+//! airgapped VM, strips invisibles and whitespace, and writes a clean
+//! version. L3 survives that attack because the mark lives in **which
+//! words were chosen**, not in invisible characters.
+//!
+//! ## Algorithm
+//!
+//! Per match (word that's a member of a known synonym class):
+//! - Derive a deterministic variant index from the mark_id + position counter.
+//! - Replace the word with the selected variant, preserving original case.
+//!
+//! Recovery iterates candidate mark_ids from the registry and computes the
+//! correlation score (fraction of matches that agree with the expected
+//! variant sequence). Score >= 0.70 (default threshold) is attribution.
+
+use once_cell::sync::Lazy;
+use regex::Regex;
+use sha2::{Digest, Sha256};
+use std::collections::HashMap;
+
+/// A synonym class: N semantically equivalent words, tagged with part of speech.
+#[derive(Debug, Clone, Copy)]
+pub struct SC {
+ pub variants: &'static [&'static str],
+ pub pos: &'static str,
+}
+
+impl SC {
+ pub const fn new(variants: &'static [&'static str], pos: &'static str) -> Self {
+ SC { variants, pos }
+ }
+}
+
+// The 151-class dictionary, generated from Python oversight_core/synonyms_v2.py.
+include!("synonyms_v2_data.rs");
+
+/// Total number of synonym classes.
+pub fn class_count() -> usize {
+ CLASSES.len()
+}
+
+/// Build a lowercase-word → (class_index, variant_index, pos) lookup.
+/// First occurrence wins for ambiguous words. Only indexes single-word variants.
+static LOOKUP: Lazy<HashMap<&'static str, (usize, usize, &'static str)>> = Lazy::new(|| {
+ let mut m = HashMap::new();
+ for (ci, cls) in CLASSES.iter().enumerate() {
+ for (vi, w) in cls.variants.iter().enumerate() {
+ if !w.contains(' ') && !m.contains_key(*w) {
+ m.insert(*w, (ci, vi, cls.pos));
+ }
+ }
+ }
+ m
+});
+
+static ZW_CHARS: &[char] = &['\u{200b}', '\u{200c}', '\u{200d}', '\u{feff}'];
+
+fn strip_zw(s: &str) -> String {
+ s.chars().filter(|c| !ZW_CHARS.contains(c)).collect()
+}
+
+// Skip regions: URLs, emails, code spans, file paths, hex blobs, base64 blobs.
+static URL_RE: Lazy<Regex> = Lazy::new(|| Regex::new(r"https?://\S+").unwrap());
+static EMAIL_RE: Lazy<Regex> = Lazy::new(|| Regex::new(r"\b[\w.+-]+@[\w.-]+\.\w+\b").unwrap());
+static INLINE_CODE_RE: Lazy<Regex> = Lazy::new(|| Regex::new(r"`[^`]+`").unwrap());
+static CODE_BLOCK_RE: Lazy<Regex> = Lazy::new(|| Regex::new(r"(?s)```.*?```").unwrap());
+static UNIX_PATH_RE: Lazy<Regex> =
+ Lazy::new(|| Regex::new(r"(?:^|\s)(?:/|~/|\./)[^\s]+").unwrap());
+static HEX_BLOB_RE: Lazy<Regex> = Lazy::new(|| Regex::new(r"\b[A-Fa-f0-9]{16,}\b").unwrap());
+static BASE64_RE: Lazy<Regex> = Lazy::new(|| Regex::new(r"\b[A-Za-z0-9+/]{32,}={0,2}\b").unwrap());
+static WORD_RE: Lazy<Regex> = Lazy::new(|| Regex::new(r"\b([A-Za-z]+)\b").unwrap());
+
+/// Compute which byte positions in `text` are inside skip regions.
+fn skip_mask(text: &str) -> Vec<bool> {
+ let mut mask = vec![false; text.len()];
+ let patterns: &[&Lazy<Regex>] = &[
+ &URL_RE, &EMAIL_RE, &INLINE_CODE_RE, &CODE_BLOCK_RE, &UNIX_PATH_RE,
+ &HEX_BLOB_RE, &BASE64_RE,
+ ];
+ for pat in patterns {
+ for m in pat.find_iter(text) {
+ for i in m.start()..m.end() {
+ if i < mask.len() {
+ mask[i] = true;
+ }
+ }
+ }
+ }
+ mask
+}
+
+/// A matchable word in the text with its class/variant assignment.
+#[derive(Debug, Clone)]
+pub struct Match {
+ pub start: usize,
+ pub end: usize,
+ pub orig_word: String,
+ pub class_index: usize,
+ pub variant_index: usize,
+ pub pos: &'static str,
+}
+
+/// Walk text and yield every word that is (a) in the synonym table,
+/// (b) not inside a URL/path/code/hex region.
+pub fn iter_matchable_words(text: &str) -> Vec<Match> {
+ let mask = skip_mask(text);
+ let mut out = Vec::new();
+ for m in WORD_RE.find_iter(text) {
+ // Skip if any byte of the match is in a skip region.
+ let mut in_skip = false;
+ for i in m.start()..m.end() {
+ if i < mask.len() && mask[i] {
+ in_skip = true;
+ break;
+ }
+ }
+ if in_skip {
+ continue;
+ }
+ let word = m.as_str();
+ let key = word.to_lowercase();
+ if let Some(&(ci, vi, pos)) = LOOKUP.get(key.as_str()) {
+ out.push(Match {
+ start: m.start(),
+ end: m.end(),
+ orig_word: word.to_string(),
+ class_index: ci,
+ variant_index: vi,
+ pos,
+ });
+ }
+ }
+ out
+}
+
+/// Is this variant safe to round-trip through our single-word matcher?
+/// Variants with whitespace or hyphens break because WORD_RE only matches
+/// [A-Za-z]+ - `write-up` gets tokenized as two words and neither is in
+/// the lookup, desyncing the variant sequence.
+fn is_round_trippable(variant: &str) -> bool {
+ !variant.contains(' ') && !variant.contains('-')
+}
+
+/// Derive a deterministic variant sequence from a mark_id using SHA-256(mark_id || counter).
+/// Yields `n_matches` bytes each bounded to `class_size` (v2 uses 3 variants per class).
+fn mark_id_to_variant_sequence(mark_id: &[u8], n_matches: usize, class_size: usize) -> Vec<usize> {
+ let mut out = Vec::with_capacity(n_matches);
+ let mut counter: u64 = 0;
+ while out.len() < n_matches {
+ let mut h = Sha256::new();
+ h.update(mark_id);
+ h.update(&counter.to_be_bytes());
+ let digest = h.finalize();
+ for b in digest.iter() {
+ if out.len() >= n_matches {
+ break;
+ }
+ out.push((*b as usize) % class_size);
+ }
+ counter += 1;
+ }
+ out
+}
+
+/// Preserve the case pattern of `orig` when emitting `replacement`.
+/// - all upper: UPPERCASE replacement
+/// - first upper rest lower: Title Case replacement
+/// - otherwise: lowercase replacement
+fn case_preserve(replacement: &str, orig: &str) -> String {
+ if orig.chars().all(|c| c.is_uppercase() || !c.is_alphabetic()) && orig.len() > 1 {
+ return replacement.to_uppercase();
+ }
+ let first_upper = orig.chars().next().map(|c| c.is_uppercase()).unwrap_or(false);
+ let rest_lower = orig.chars().skip(1).all(|c| c.is_lowercase() || !c.is_alphabetic());
+ if first_upper && rest_lower {
+ let mut s = String::new();
+ for (i, c) in replacement.chars().enumerate() {
+ if i == 0 {
+ for uc in c.to_uppercase() {
+ s.push(uc);
+ }
+ } else {
+ s.push(c);
+ }
+ }
+ return s;
+ }
+ replacement.to_lowercase()
+}
+
+/// Embed a mark_id into the text via synonym rotation.
+///
+/// If the text has fewer than `min_instances` matchable words, returns the
+/// text unchanged - no silent partial marking (the Python impl prints a
+/// warning; here we just return unchanged and let the caller decide).
+pub fn embed_synonyms(text: &str, mark_id: &[u8], min_instances: usize) -> String {
+ let matches = iter_matchable_words(text);
+ if matches.len() < min_instances {
+ return text.to_string();
+ }
+ let variants = mark_id_to_variant_sequence(mark_id, matches.len(), 3);
+ let mut out = String::with_capacity(text.len());
+ let mut cursor = 0usize;
+ for (m, &target) in matches.iter().zip(variants.iter()) {
+ let cls = &CLASSES[m.class_index];
+ let mut vi = target % cls.variants.len();
+ // Skip multi-word and hyphenated variants - our matcher only sees
+ // single unbroken A-Za-z words, and these would desync verify.
+ for _ in 0..cls.variants.len() {
+ if is_round_trippable(cls.variants[vi]) {
+ break;
+ }
+ vi = (vi + 1) % cls.variants.len();
+ }
+ if !is_round_trippable(cls.variants[vi]) {
+ // All variants are non-round-trippable (shouldn't happen); keep original
+ out.push_str(&text[cursor..m.end]);
+ cursor = m.end;
+ continue;
+ }
+ let replacement = case_preserve(cls.variants[vi], &m.orig_word);
+ out.push_str(&text[cursor..m.start]);
+ out.push_str(&replacement);
+ cursor = m.end;
+ }
+ out.push_str(&text[cursor..]);
+ out
+}
+
+/// Verify whether `text` carries `candidate_mark_id`. Returns (match, score).
+///
+/// The score is the fraction of matchable words whose variant matches the
+/// expected variant for the candidate mark_id. Default threshold 0.70.
+pub fn verify_synonyms(text: &str, candidate_mark_id: &[u8], threshold: f64) -> (bool, f64) {
+ let text = strip_zw(text);
+ let actual: Vec<(usize, usize)> = iter_matchable_words(&text)
+ .into_iter()
+ .map(|m| (m.class_index, m.variant_index))
+ .collect();
+ if actual.is_empty() {
+ return (false, 0.0);
+ }
+ let expected = mark_id_to_variant_sequence(candidate_mark_id, actual.len(), 3);
+ let mut matches = 0usize;
+ let mut counted = 0usize;
+ for ((ci, actual_vi), &target) in actual.iter().zip(expected.iter()) {
+ let cls = &CLASSES[*ci];
+ counted += 1;
+ // Mirror embed's round-trippability skip: if target variant is
+ // not safely single-word, advance until it is (or give up).
+ let mut exp = target % cls.variants.len();
+ for _ in 0..cls.variants.len() {
+ if is_round_trippable(cls.variants[exp]) {
+ break;
+ }
+ exp = (exp + 1) % cls.variants.len();
+ }
+ if !is_round_trippable(cls.variants[exp]) {
+ // All variants non-round-trippable - embed kept original. Count as match.
+ matches += 1;
+ continue;
+ }
+ if exp == *actual_vi {
+ matches += 1;
+ }
+ }
+ let score = matches as f64 / counted as f64;
+ (score >= threshold, score)
+}
+
+#[cfg(test)]
+mod tests {
+ use super::*;
+
+ const TEST_TEXT: &str = "Q3 revenue performance exceeded expectations across all business units. \
+The team plans to continue the expansion strategy outlined in our report at \
+https://internal.example.com/q3-2026.pdf and will begin hiring in \
+/home/claude/hiring_plan.docx this month. However, there are important risks \
+to consider before we commence the next phase. We need to carefully review \
+the competitive situation and determine whether our current approach is the \
+right one. The board will also request that we improve internal reporting \
+and reduce operational overhead. It is difficult to know exactly how quickly \
+the market will change, but we should respond rapidly when opportunities appear. \
+Overall the results show clear momentum and a strong basis for continued growth.";
+
+ #[test]
+ fn dict_has_expected_size() {
+ // We ported 50 + 43 + 20 + 30 + 8 = 151 classes.
+ assert_eq!(class_count(), 151);
+ }
+
+ #[test]
+ fn matcher_finds_words() {
+ let matches = iter_matchable_words(TEST_TEXT);
+ assert!(
+ matches.len() >= 10,
+ "expected at least 10 matchable words, got {}",
+ matches.len()
+ );
+ }
+
+ #[test]
+ fn url_and_path_preserved_through_embed() {
+ let mark = b"\x01\x23\x45\x67\x89\xab\xcd\xef";
+ let marked = embed_synonyms(TEST_TEXT, mark, 5);
+ assert!(marked.contains("https://internal.example.com/q3-2026.pdf"),
+ "URL was munged");
+ assert!(marked.contains("/home/claude/hiring_plan.docx"),
+ "path was munged");
+ }
+
+ #[test]
+ fn correct_mark_verifies_with_high_score() {
+ let mark = b"\x01\x23\x45\x67\x89\xab\xcd\xef";
+ let marked = embed_synonyms(TEST_TEXT, mark, 5);
+ let (ok, score) = verify_synonyms(&marked, mark, 0.70);
+ assert!(ok, "correct mark failed to verify");
+ assert!(score > 0.95, "expected near-1.0 score, got {}", score);
+ }
+
+ #[test]
+ fn wrong_mark_rejected() {
+ let good = b"\x01\x23\x45\x67\x89\xab\xcd\xef";
+ let bad = b"\xff\xee\xdd\xcc\xbb\xaa\x99\x88";
+ let marked = embed_synonyms(TEST_TEXT, good, 5);
+ let (ok, score) = verify_synonyms(&marked, bad, 0.70);
+ assert!(!ok, "wrong mark verified (score={})", score);
+ assert!(score < 0.70, "wrong-mark score suspiciously high: {}", score);
+ }
+
+ #[test]
+ fn airgap_strip_survivor() {
+ // Simulate the attacker: strip zero-width chars + normalize trailing whitespace.
+ // The semantic mark should still survive.
+ let mark = b"\xde\xad\xbe\xef\xfe\xed\xfa\xce";
+ let marked = embed_synonyms(TEST_TEXT, mark, 5);
+ // Attacker normalizes: strip zero-width + trailing whitespace
+ let stripped: String = marked
+ .lines()
+ .map(|l| l.trim_end().to_string())
+ .collect::<Vec<_>>()
+ .join("\n");
+ let stripped = strip_zw(&stripped);
+ let (ok, score) = verify_synonyms(&stripped, mark, 0.70);
+ assert!(ok, "airgap-strip broke L3 attribution (score={})", score);
+ }
+
+ #[test]
+ fn case_preserve_works() {
+ assert_eq!(case_preserve("start", "BEGIN"), "START");
+ assert_eq!(case_preserve("start", "Begin"), "Start");
+ assert_eq!(case_preserve("start", "begin"), "start");
+ }
+
+ #[test]
+ fn short_text_unchanged() {
+ let mark = b"\x01\x02\x03\x04\x05\x06\x07\x08";
+ let short = "Hello world";
+ let marked = embed_synonyms(short, mark, 5);
+ assert_eq!(marked, short); // below min_instances threshold
+ }
+}
oversight-rust/oversight-semantic/src/synonyms_v2_data.rs +159 -0
@@ -0,0 +1,159 @@
+
+pub const CLASSES: &[SC] = &[
+ // verbs - 50 classes
+ SC::new(&["begin", "start", "commence"], "verb"),
+ SC::new(&["end", "finish", "conclude"], "verb"),
+ SC::new(&["use", "utilize", "employ"], "verb"),
+ SC::new(&["make", "create", "produce"], "verb"),
+ SC::new(&["get", "obtain", "acquire"], "verb"),
+ SC::new(&["find", "locate", "identify"], "verb"),
+ SC::new(&["show", "display", "present"], "verb"),
+ SC::new(&["tell", "inform", "notify"], "verb"),
+ SC::new(&["give", "provide", "supply"], "verb"),
+ SC::new(&["help", "assist", "aid"], "verb"),
+ SC::new(&["think", "believe", "consider"], "verb"),
+ SC::new(&["know", "understand", "recognize"], "verb"),
+ SC::new(&["see", "observe", "notice"], "verb"),
+ SC::new(&["want", "desire", "need"], "verb"),
+ SC::new(&["look", "appear", "seem"], "verb"),
+ SC::new(&["ask", "request", "query"], "verb"),
+ SC::new(&["send", "transmit", "deliver"], "verb"),
+ SC::new(&["allow", "permit", "enable"], "verb"),
+ SC::new(&["stop", "halt", "cease"], "verb"),
+ SC::new(&["continue", "proceed", "persist"], "verb"),
+ SC::new(&["try", "attempt", "endeavor"], "verb"),
+ SC::new(&["change", "modify", "alter"], "verb"),
+ SC::new(&["add", "append", "include"], "verb"),
+ SC::new(&["remove", "delete", "eliminate"], "verb"),
+ SC::new(&["check", "verify", "confirm"], "verb"),
+ SC::new(&["review", "examine", "evaluate"], "verb"),
+ SC::new(&["agree", "concur", "consent"], "verb"),
+ SC::new(&["decide", "determine", "resolve"], "verb"),
+ SC::new(&["require", "need", "demand"], "verb"),
+ SC::new(&["contain", "include", "hold"], "verb"),
+ SC::new(&["return", "yield", "give back"], "verb"),
+ SC::new(&["create", "generate", "build"], "verb"),
+ SC::new(&["destroy", "eliminate", "eradicate"], "verb"),
+ SC::new(&["improve", "enhance", "upgrade"], "verb"),
+ SC::new(&["protect", "safeguard", "defend"], "verb"),
+ SC::new(&["discuss", "address", "cover"], "verb"),
+ SC::new(&["explain", "clarify", "describe"], "verb"),
+ SC::new(&["propose", "suggest", "recommend"], "verb"),
+ SC::new(&["demonstrate", "show", "prove"], "verb"),
+ SC::new(&["achieve", "accomplish", "attain"], "verb"),
+ SC::new(&["manage", "handle", "administer"], "verb"),
+ SC::new(&["develop", "build", "engineer"], "verb"),
+ SC::new(&["establish", "set up", "institute"], "verb"),
+ SC::new(&["support", "back", "endorse"], "verb"),
+ SC::new(&["reject", "refuse", "decline"], "verb"),
+ SC::new(&["reduce", "decrease", "lower"], "verb"),
+ SC::new(&["increase", "raise", "boost"], "verb"),
+ SC::new(&["operate", "run", "function"], "verb"),
+ SC::new(&["execute", "perform", "run"], "verb"),
+ SC::new(&["investigate", "examine", "research"], "verb"),
+ // adjectives - 43 classes
+ SC::new(&["big", "large", "substantial"], "adj"),
+ SC::new(&["small", "tiny", "minor"], "adj"),
+ SC::new(&["fast", "quick", "rapid"], "adj"),
+ SC::new(&["slow", "gradual", "deliberate"], "adj"),
+ SC::new(&["important", "critical", "significant"], "adj"),
+ SC::new(&["hard", "difficult", "challenging"], "adj"),
+ SC::new(&["easy", "simple", "straightforward"], "adj"),
+ SC::new(&["good", "excellent", "effective"], "adj"),
+ SC::new(&["bad", "poor", "inferior"], "adj"),
+ SC::new(&["new", "recent", "current"], "adj"),
+ SC::new(&["old", "prior", "previous"], "adj"),
+ SC::new(&["common", "typical", "standard"], "adj"),
+ SC::new(&["rare", "unusual", "uncommon"], "adj"),
+ SC::new(&["safe", "secure", "protected"], "adj"),
+ SC::new(&["dangerous", "risky", "hazardous"], "adj"),
+ SC::new(&["correct", "accurate", "right"], "adj"),
+ SC::new(&["wrong", "incorrect", "mistaken"], "adj"),
+ SC::new(&["clear", "obvious", "evident"], "adj"),
+ SC::new(&["unclear", "vague", "ambiguous"], "adj"),
+ SC::new(&["strong", "robust", "powerful"], "adj"),
+ SC::new(&["weak", "fragile", "limited"], "adj"),
+ SC::new(&["full", "complete", "entire"], "adj"),
+ SC::new(&["empty", "vacant", "bare"], "adj"),
+ SC::new(&["open", "available", "accessible"], "adj"),
+ SC::new(&["closed", "sealed", "restricted"], "adj"),
+ SC::new(&["visible", "apparent", "observable"], "adj"),
+ SC::new(&["hidden", "concealed", "obscured"], "adj"),
+ SC::new(&["public", "open", "unrestricted"], "adj"),
+ SC::new(&["private", "confidential", "restricted"], "adj"),
+ SC::new(&["complete", "finished", "done"], "adj"),
+ SC::new(&["partial", "incomplete", "limited"], "adj"),
+ SC::new(&["useful", "helpful", "valuable"], "adj"),
+ SC::new(&["useless", "pointless", "ineffective"], "adj"),
+ SC::new(&["interesting", "engaging", "compelling"], "adj"),
+ SC::new(&["boring", "dull", "tedious"], "adj"),
+ SC::new(&["early", "initial", "preliminary"], "adj"),
+ SC::new(&["late", "delayed", "overdue"], "adj"),
+ SC::new(&["possible", "feasible", "viable"], "adj"),
+ SC::new(&["impossible", "unfeasible", "impractical"], "adj"),
+ SC::new(&["normal", "typical", "regular"], "adj"),
+ SC::new(&["abnormal", "unusual", "atypical"], "adj"),
+ SC::new(&["high", "elevated", "significant"], "adj"),
+ SC::new(&["low", "reduced", "minimal"], "adj"),
+ // adverbs - 20 classes
+ SC::new(&["quickly", "rapidly", "swiftly"], "adv"),
+ SC::new(&["slowly", "gradually", "steadily"], "adv"),
+ SC::new(&["carefully", "cautiously", "thoroughly"], "adv"),
+ SC::new(&["often", "frequently", "regularly"], "adv"),
+ SC::new(&["rarely", "seldom", "infrequently"], "adv"),
+ SC::new(&["usually", "typically", "generally"], "adv"),
+ SC::new(&["sometimes", "occasionally", "periodically"], "adv"),
+ SC::new(&["always", "consistently", "invariably"], "adv"),
+ SC::new(&["never", "not ever", "at no time"], "adv"),
+ SC::new(&["clearly", "obviously", "plainly"], "adv"),
+ SC::new(&["exactly", "precisely", "specifically"], "adv"),
+ SC::new(&["approximately", "roughly", "around"], "adv"),
+ SC::new(&["completely", "entirely", "fully"], "adv"),
+ SC::new(&["partially", "partly", "somewhat"], "adv"),
+ SC::new(&["immediately", "instantly", "promptly"], "adv"),
+ SC::new(&["eventually", "ultimately", "finally"], "adv"),
+ SC::new(&["recently", "lately", "newly"], "adv"),
+ SC::new(&["currently", "presently", "now"], "adv"),
+ SC::new(&["previously", "formerly", "earlier"], "adv"),
+ SC::new(&["easily", "readily", "effortlessly"], "adv"),
+ // nouns - 30 classes
+ SC::new(&["problem", "issue", "concern"], "noun"),
+ SC::new(&["answer", "response", "reply"], "noun"),
+ SC::new(&["question", "query", "inquiry"], "noun"),
+ SC::new(&["idea", "concept", "notion"], "noun"),
+ SC::new(&["plan", "strategy", "approach"], "noun"),
+ SC::new(&["result", "outcome", "consequence"], "noun"),
+ SC::new(&["method", "approach", "technique"], "noun"),
+ SC::new(&["goal", "objective", "aim"], "noun"),
+ SC::new(&["change", "modification", "alteration"], "noun"),
+ SC::new(&["system", "framework", "structure"], "noun"),
+ SC::new(&["process", "procedure", "workflow"], "noun"),
+ SC::new(&["feature", "function", "capability"], "noun"),
+ SC::new(&["effect", "impact", "influence"], "noun"),
+ SC::new(&["cause", "reason", "source"], "noun"),
+ SC::new(&["example", "instance", "case"], "noun"),
+ SC::new(&["detail", "particular", "specific"], "noun"),
+ SC::new(&["summary", "overview", "synopsis"], "noun"),
+ SC::new(&["notice", "notification", "alert"], "noun"),
+ SC::new(&["record", "log", "entry"], "noun"),
+ SC::new(&["report", "document", "write-up"], "noun"),
+ SC::new(&["data", "information", "content"], "noun"),
+ SC::new(&["value", "amount", "quantity"], "noun"),
+ SC::new(&["location", "place", "site"], "noun"),
+ SC::new(&["time", "moment", "instant"], "noun"),
+ SC::new(&["benefit", "advantage", "gain"], "noun"),
+ SC::new(&["risk", "hazard", "threat"], "noun"),
+ SC::new(&["error", "mistake", "flaw"], "noun"),
+ SC::new(&["need", "requirement", "necessity"], "noun"),
+ SC::new(&["request", "application", "petition"], "noun"),
+ SC::new(&["opportunity", "chance", "possibility"], "noun"),
+ // connectors - 8 classes
+ SC::new(&["however", "nevertheless", "nonetheless"], "conj"),
+ SC::new(&["therefore", "consequently", "thus"], "conj"),
+ SC::new(&["also", "additionally", "furthermore"], "conj"),
+ SC::new(&["but", "yet", "though"], "conj"),
+ SC::new(&["because", "since", "as"], "conj"),
+ SC::new(&["although", "while", "whereas"], "conj"),
+ SC::new(&["similarly", "likewise", "comparably"], "conj"),
+ SC::new(&["instead", "rather", "alternatively"], "conj"),
+];
oversight-rust/oversight-tlog/Cargo.toml +20 -0
@@ -0,0 +1,20 @@
+[package]
+name = "oversight-tlog"
+version.workspace = true
+edition.workspace = true
+rust-version.workspace = true
+license.workspace = true
+description = "RFC 6962-compliant Merkle transparency log for Oversight"
+
+[dependencies]
+oversight-crypto = { path = "../oversight-crypto" }
+ed25519-dalek.workspace = true
+sha2.workspace = true
+serde.workspace = true
+serde_json.workspace = true
+hex.workspace = true
+thiserror.workspace = true
+serde_jcs.workspace = true
+
+[dev-dependencies]
+tempfile = "3"
oversight-rust/oversight-tlog/src/lib.rs +486 -0
@@ -0,0 +1,486 @@
+//! # oversight-tlog
+//!
+//! RFC 6962-compliant Merkle transparency log for Oversight.
+//!
+//! Every event (registration, beacon callback, attribution query) is appended
+//! as a leaf. The log signs a tree head with Ed25519 so auditors can verify
+//! inclusion proofs and detect any attempt to remove or reorder entries.
+//!
+//! ## RFC 6962 Compliance
+//!
+//! This implementation faithfully follows RFC 6962 §2.1 Merkle Tree Hash and
+//! §2.1.1 inclusion proofs. Proofs produced here verify against any RFC 6962
+//! client (Sigstore Rekor, Certificate Transparency log verifiers, the Go
+//! Trillian library, etc.).
+//!
+//! ```text
+//! MTH({}) = SHA-256()
+//! MTH({d[0]}) = SHA-256(0x00 || d[0])
+//! MTH(D[0..n]) = SHA-256(0x01 || MTH(D[0..k]) || MTH(D[k..n]))
+//! where k is the largest power of 2 < n
+//! ```
+//!
+//! ## Durability
+//!
+//! Every `append` fsyncs before returning. If the process crashes mid-write,
+//! the entry is either fully on disk or not at all - no torn writes.
+
+use ed25519_dalek::{Signer, SigningKey};
+use serde::{Deserialize, Serialize};
+use sha2::{Digest, Sha256};
+use std::fs::{File, OpenOptions};
+use std::io::{BufRead, BufReader, Write};
+use std::path::{Path, PathBuf};
+use std::sync::Mutex;
+use thiserror::Error;
+
+#[derive(Debug, Error)]
+pub enum TlogError {
+ #[error("I/O: {0}")]
+ Io(#[from] std::io::Error),
+ #[error("JSON: {0}")]
+ Json(#[from] serde_json::Error),
+ #[error("hex: {0}")]
+ Hex(#[from] hex::FromHexError),
+ #[error("invalid signing key length: expected 32, got {0}")]
+ BadKeyLength(usize),
+ #[error("index {0} out of range (tree_size={1})")]
+ IndexOutOfRange(usize, usize),
+}
+
+pub type Result<T> = std::result::Result<T, TlogError>;
+
+/// SHA-256 of input
+#[inline]
+fn h(data: &[u8]) -> [u8; 32] {
+ let mut hasher = Sha256::new();
+ hasher.update(data);
+ hasher.finalize().into()
+}
+
+/// Largest power of 2 strictly less than n (for n >= 2). RFC 6962 §2.1.
+fn largest_power_of_2_less_than(n: usize) -> usize {
+ assert!(n >= 2);
+ let mut k = 1usize;
+ while k * 2 < n {
+ k *= 2;
+ }
+ k
+}
+
+/// RFC 6962 §2.1 Merkle Tree Hash over pre-hashed leaves.
+fn mth(leaf_hashes: &[[u8; 32]]) -> [u8; 32] {
+ let n = leaf_hashes.len();
+ assert!(n >= 1);
+ if n == 1 {
+ return leaf_hashes[0];
+ }
+ let k = largest_power_of_2_less_than(n);
+ let left = mth(&leaf_hashes[..k]);
+ let right = mth(&leaf_hashes[k..]);
+ let mut data = Vec::with_capacity(1 + 64);
+ data.push(0x01);
+ data.extend_from_slice(&left);
+ data.extend_from_slice(&right);
+ h(&data)
+}
+
+/// RFC 6962 §2.1.1 audit path for leaf at index `m`.
+/// Returns siblings from deepest (closest to leaf) to shallowest (closest to root).
+fn audit_path(leaf_hashes: &[[u8; 32]], m: usize) -> Vec<[u8; 32]> {
+ let n = leaf_hashes.len();
+ if n <= 1 {
+ return Vec::new();
+ }
+ let k = largest_power_of_2_less_than(n);
+ if m < k {
+ let mut path = audit_path(&leaf_hashes[..k], m);
+ path.push(mth(&leaf_hashes[k..]));
+ path
+ } else {
+ let mut path = audit_path(&leaf_hashes[k..], m - k);
+ path.push(mth(&leaf_hashes[..k]));
+ path
+ }
+}
+
+/// Verify a leaf's inclusion proof against an expected root. RFC 6962 §2.1.1.
+pub fn verify_inclusion_proof(
+ leaf_hash: &[u8; 32],
+ index: usize,
+ proof: &[[u8; 32]],
+ tree_size: usize,
+ expected_root: &[u8; 32],
+) -> bool {
+ if tree_size == 0 || index >= tree_size {
+ return false;
+ }
+
+ fn rec(
+ h_in: [u8; 32],
+ m: usize,
+ remaining: &[[u8; 32]],
+ n: usize,
+ ) -> Option<[u8; 32]> {
+ if n == 1 {
+ return if remaining.is_empty() { Some(h_in) } else { None };
+ }
+ if remaining.is_empty() {
+ return None;
+ }
+ let k = largest_power_of_2_less_than(n);
+ let sibling = *remaining.last().unwrap();
+ let deeper = &remaining[..remaining.len() - 1];
+ let (left, right) = if m < k {
+ (rec(h_in, m, deeper, k)?, sibling)
+ } else {
+ (sibling, rec(h_in, m - k, deeper, n - k)?)
+ };
+ let mut data = Vec::with_capacity(65);
+ data.push(0x01);
+ data.extend_from_slice(&left);
+ data.extend_from_slice(&right);
+ Some(h(&data))
+ }
+
+ rec(*leaf_hash, index, proof, tree_size)
+ .map(|computed| &computed == expected_root)
+ .unwrap_or(false)
+}
+
+/// On-disk leaf record format.
+#[derive(Debug, Clone, Serialize, Deserialize)]
+struct LeafRecord {
+ index: usize,
+ leaf_hash: String,
+ leaf_data: String,
+}
+
+/// Signed tree head.
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct SignedTreeHead {
+ pub size: usize,
+ pub root: String,
+ #[serde(default, skip_serializing_if = "String::is_empty")]
+ pub signature: String,
+ #[serde(default, skip_serializing_if = "String::is_empty")]
+ pub signed_message: String,
+}
+
+/// Inclusion proof returned to clients.
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct InclusionProof {
+ pub index: usize,
+ pub leaf_hash: String,
+ pub proof: Vec<String>,
+ pub root: String,
+ pub tree_size: usize,
+}
+
+/// Append-only Merkle transparency log.
+pub struct TransparencyLog {
+ dir: PathBuf,
+ leaves_path: PathBuf,
+ leaves: Mutex<Vec<[u8; 32]>>,
+ cached_root: Mutex<Option<[u8; 32]>>,
+ signing_key: Option<SigningKey>,
+}
+
+impl TransparencyLog {
+ pub fn open(data_dir: impl AsRef<Path>) -> Result<Self> {
+ Self::open_with_signer(data_dir, None)
+ }
+
+ pub fn open_with_signer(
+ data_dir: impl AsRef<Path>,
+ signing_key_hex: Option<&str>,
+ ) -> Result<Self> {
+ let dir = data_dir.as_ref().to_path_buf();
+ std::fs::create_dir_all(&dir)?;
+ let leaves_path = dir.join("leaves.jsonl");
+
+ // Load existing leaves (recovery)
+ let mut leaves: Vec<[u8; 32]> = Vec::new();
+ if leaves_path.exists() {
+ let f = File::open(&leaves_path)?;
+ let reader = BufReader::new(f);
+ for line in reader.lines() {
+ let line = line?;
+ if line.trim().is_empty() {
+ continue;
+ }
+ if let Ok(rec) = serde_json::from_str::<LeafRecord>(&line) {
+ if let Ok(bytes) = hex::decode(&rec.leaf_hash) {
+ if bytes.len() == 32 {
+ let mut arr = [0u8; 32];
+ arr.copy_from_slice(&bytes);
+ leaves.push(arr);
+ }
+ }
+ }
+ }
+ }
+
+ let signing_key = match signing_key_hex {
+ Some(hex_str) => {
+ let bytes = hex::decode(hex_str)?;
+ if bytes.len() != 32 {
+ return Err(TlogError::BadKeyLength(bytes.len()));
+ }
+ let mut arr = [0u8; 32];
+ arr.copy_from_slice(&bytes);
+ Some(SigningKey::from_bytes(&arr))
+ }
+ None => None,
+ };
+
+ Ok(TransparencyLog {
+ dir,
+ leaves_path,
+ leaves: Mutex::new(leaves),
+ cached_root: Mutex::new(None),
+ signing_key,
+ })
+ }
+
+ /// Append an opaque leaf. Returns its 0-based index. Durable on return.
+ pub fn append(&self, leaf_data: &[u8]) -> Result<usize> {
+ let mut leaves = self.leaves.lock().unwrap();
+ let index = leaves.len();
+
+ // RFC 6962 leaf prefix
+ let mut prefixed = Vec::with_capacity(1 + leaf_data.len());
+ prefixed.push(0x00);
+ prefixed.extend_from_slice(leaf_data);
+ let leaf_hash = h(&prefixed);
+ leaves.push(leaf_hash);
+
+ // Invalidate cached root
+ *self.cached_root.lock().unwrap() = None;
+
+ // Durable append: fsync before returning
+ let record = LeafRecord {
+ index,
+ leaf_hash: hex::encode(leaf_hash),
+ leaf_data: String::from_utf8_lossy(leaf_data).to_string(),
+ };
+ let line = serde_json::to_string(&record)? + "\n";
+ let mut f = OpenOptions::new()
+ .create(true)
+ .append(true)
+ .open(&self.leaves_path)?;
+ f.write_all(line.as_bytes())?;
+ f.flush()?;
+ f.sync_data()?;
+
+ Ok(index)
+ }
+
+ /// Append a JSON event. Helper that canonicalizes and calls append().
+ pub fn append_event(&self, event: &serde_json::Value) -> Result<usize> {
+ let bytes = serde_jcs::to_vec(event).map_err(|_| {
+ TlogError::Json(serde_json::Error::custom("canonicalization failed"))
+ })?;
+ self.append(&bytes)
+ }
+
+ pub fn size(&self) -> usize {
+ self.leaves.lock().unwrap().len()
+ }
+
+ /// RFC 6962 root. Cached after first compute, invalidated on append.
+ pub fn root(&self) -> [u8; 32] {
+ let mut cached = self.cached_root.lock().unwrap();
+ if let Some(r) = *cached {
+ return r;
+ }
+ let leaves = self.leaves.lock().unwrap();
+ let root = if leaves.is_empty() {
+ [0u8; 32]
+ } else {
+ mth(&leaves)
+ };
+ *cached = Some(root);
+ root
+ }
+
+ /// Signed tree head. Signature present if a signing key was supplied.
+ pub fn signed_head(&self) -> SignedTreeHead {
+ let size = self.size();
+ let root = self.root();
+ let mut head = SignedTreeHead {
+ size,
+ root: hex::encode(root),
+ signature: String::new(),
+ signed_message: String::new(),
+ };
+ if let Some(ref sk) = self.signing_key {
+ let msg_value = serde_json::json!({
+ "size": size,
+ "root": head.root,
+ });
+ let msg = serde_jcs::to_vec(&msg_value).unwrap_or_default();
+ let sig = sk.sign(&msg);
+ head.signature = hex::encode(sig.to_bytes());
+ head.signed_message = String::from_utf8_lossy(&msg).to_string();
+ }
+ head
+ }
+
+ pub fn inclusion_proof(&self, index: usize) -> Option<InclusionProof> {
+ let leaves = self.leaves.lock().unwrap();
+ if index >= leaves.len() {
+ return None;
+ }
+ let leaves_copy: Vec<[u8; 32]> = leaves.clone();
+ let leaf_hash_hex = hex::encode(leaves[index]);
+ let tree_size = leaves.len();
+ drop(leaves); // release before calling root() which also locks
+
+ let path = audit_path(&leaves_copy, index);
+ let root = self.root();
+ Some(InclusionProof {
+ index,
+ leaf_hash: leaf_hash_hex,
+ proof: path.iter().map(hex::encode).collect(),
+ root: hex::encode(root),
+ tree_size,
+ })
+ }
+
+ pub fn data_dir(&self) -> &Path {
+ &self.dir
+ }
+}
+
+// serde_json needs this little helper for custom errors
+trait JsonErrorExt {
+ fn custom(msg: &'static str) -> Self;
+}
+impl JsonErrorExt for serde_json::Error {
+ fn custom(msg: &'static str) -> Self {
+ serde::de::Error::custom(msg)
+ }
+}
+
+#[cfg(test)]
+mod tests {
+ use super::*;
+ use tempfile::TempDir;
+
+ fn mktlog() -> (TempDir, TransparencyLog) {
+ let dir = TempDir::new().unwrap();
+ let tl = TransparencyLog::open(dir.path()).unwrap();
+ (dir, tl)
+ }
+
+ #[test]
+ fn append_and_size() {
+ let (_d, tl) = mktlog();
+ assert_eq!(tl.size(), 0);
+ tl.append(b"event0").unwrap();
+ tl.append(b"event1").unwrap();
+ assert_eq!(tl.size(), 2);
+ }
+
+ #[test]
+ fn root_changes_on_append() {
+ let (_d, tl) = mktlog();
+ tl.append(b"a").unwrap();
+ let r1 = tl.root();
+ tl.append(b"b").unwrap();
+ let r2 = tl.root();
+ assert_ne!(r1, r2);
+ }
+
+ #[test]
+ fn inclusion_proofs_verify_for_every_leaf() {
+ for n in [1usize, 2, 3, 4, 5, 7, 8, 16, 17, 100] {
+ let (_d, tl) = mktlog();
+ for i in 0..n {
+ tl.append(format!("event_{i}").as_bytes()).unwrap();
+ }
+ let root = tl.root();
+ for i in 0..n {
+ let proof = tl.inclusion_proof(i).expect("proof");
+ let leaf_hash_bytes = hex::decode(&proof.leaf_hash).unwrap();
+ let mut leaf_hash = [0u8; 32];
+ leaf_hash.copy_from_slice(&leaf_hash_bytes);
+ let siblings: Vec<[u8; 32]> = proof
+ .proof
+ .iter()
+ .map(|s| {
+ let b = hex::decode(s).unwrap();
+ let mut a = [0u8; 32];
+ a.copy_from_slice(&b);
+ a
+ })
+ .collect();
+ assert!(
+ verify_inclusion_proof(&leaf_hash, i, &siblings, n, &root),
+ "n={} leaf={} failed to verify",
+ n,
+ i
+ );
+ }
+ }
+ }
+
+ #[test]
+ fn tampered_proof_rejected() {
+ let (_d, tl) = mktlog();
+ for i in 0..5 {
+ tl.append(format!("e{i}").as_bytes()).unwrap();
+ }
+ let proof = tl.inclusion_proof(2).unwrap();
+ let leaf_hash_bytes = hex::decode(&proof.leaf_hash).unwrap();
+ let mut leaf_hash = [0u8; 32];
+ leaf_hash.copy_from_slice(&leaf_hash_bytes);
+ let mut siblings: Vec<[u8; 32]> = proof
+ .proof
+ .iter()
+ .map(|s| {
+ let b = hex::decode(s).unwrap();
+ let mut a = [0u8; 32];
+ a.copy_from_slice(&b);
+ a
+ })
+ .collect();
+ if let Some(first) = siblings.first_mut() {
+ first[0] ^= 0x01;
+ }
+ let root = tl.root();
+ assert!(!verify_inclusion_proof(&leaf_hash, 2, &siblings, 5, &root));
+ }
+
+ #[test]
+ fn signed_head_with_key() {
+ let dir = TempDir::new().unwrap();
+ let key_hex = hex::encode([42u8; 32]);
+ let tl = TransparencyLog::open_with_signer(dir.path(), Some(&key_hex)).unwrap();
+ tl.append(b"some event").unwrap();
+ let head = tl.signed_head();
+ assert_eq!(head.size, 1);
+ assert!(!head.signature.is_empty());
+ assert!(!head.signed_message.is_empty());
+ }
+
+ #[test]
+ fn survives_reopen() {
+ let dir = TempDir::new().unwrap();
+ {
+ let tl = TransparencyLog::open(dir.path()).unwrap();
+ tl.append(b"event_a").unwrap();
+ tl.append(b"event_b").unwrap();
+ }
+ // Re-open - leaves should be recovered from disk
+ let tl2 = TransparencyLog::open(dir.path()).unwrap();
+ assert_eq!(tl2.size(), 2);
+ }
+
+ #[test]
+ fn empty_tree_has_zero_root() {
+ let (_d, tl) = mktlog();
+ assert_eq!(tl.root(), [0u8; 32]);
+ }
+}
oversight-rust/oversight-watermark/Cargo.toml +10 -0
@@ -0,0 +1,10 @@
+[package]
+name = "oversight-watermark"
+version.workspace = true
+edition.workspace = true
+rust-version.workspace = true
+license.workspace = true
+description = "Per-recipient text watermarking for Oversight (L1 zero-width, L2 whitespace)"
+
+[dependencies]
+rand.workspace = true
oversight-rust/oversight-watermark/src/lib.rs +210 -0
@@ -0,0 +1,210 @@
+//! # oversight-watermark
+//!
+//! Per-recipient text watermarking. Two MVP layers:
+//!
+//! - **L1 zero-width unicode**: embeds mark_id bits as ZWSP / ZWNJ frames.
+//! Survives copy-paste. Defeated by normalize/strip passes.
+//!
+//! - **L2 whitespace**: trailing-space vs trailing-tab on lines. Survives
+//! more aggressive cleaning than L1.
+//!
+//! Higher-fidelity layers (semantic synonym rotation, DCT image watermarks,
+//! PDF/DOCX metadata) live in separate crates so each can evolve independently.
+
+use rand::{rngs::OsRng, RngCore};
+
+pub const ZW_SPACE: char = '\u{200b}'; // bit 0
+pub const ZW_NONJOIN: char = '\u{200c}'; // bit 1
+pub const ZW_JOIN: char = '\u{200d}'; // frame delimiter
+
+fn bits_of(data: &[u8]) -> Vec<u8> {
+ let mut out = Vec::with_capacity(data.len() * 8);
+ for byte in data {
+ for i in 0..8 {
+ out.push((byte >> (7 - i)) & 1);
+ }
+ }
+ out
+}
+
+fn bytes_from_bits(bits: &[u8]) -> Vec<u8> {
+ let n = (bits.len() / 8) * 8;
+ let mut out = Vec::with_capacity(n / 8);
+ let mut i = 0;
+ while i < n {
+ let mut b: u8 = 0;
+ for j in 0..8 {
+ b = (b << 1) | (bits[i + j] & 1);
+ }
+ out.push(b);
+ i += 8;
+ }
+ out
+}
+
+/// Generate a random mark_id. 8 bytes = 64 bits = plenty for attribution.
+pub fn new_mark_id(n_bytes: usize) -> Vec<u8> {
+ let mut out = vec![0u8; n_bytes];
+ OsRng.fill_bytes(&mut out);
+ out
+}
+
+// -------------------------- L1: zero-width unicode --------------------------
+
+/// Embed `mark_id` as repeated zero-width frames scattered through the text.
+///
+/// Each frame is `[ZW_JOIN] [bits as ZWSP/ZWNJ] [ZW_JOIN]`. Multiple redundant
+/// frames are inserted at roughly `density`-char intervals so that any
+/// surviving segment yields an attribution.
+pub fn embed_zw(text: &str, mark_id: &[u8], density: usize) -> String {
+ let bits = bits_of(mark_id);
+ let mut frame = String::with_capacity(bits.len() + 2);
+ frame.push(ZW_JOIN);
+ for b in &bits {
+ frame.push(if *b == 0 { ZW_SPACE } else { ZW_NONJOIN });
+ }
+ frame.push(ZW_JOIN);
+
+ if text.chars().count() < density {
+ let mut out = String::from(text);
+ out.push_str(&frame);
+ return out;
+ }
+
+ let mut out = String::with_capacity(text.len() + frame.len() * (text.len() / density));
+ for (i, ch) in text.chars().enumerate() {
+ out.push(ch);
+ if i > 0 && i % density == 0 {
+ out.push_str(&frame);
+ }
+ }
+ out
+}
+
+/// Recover candidate mark_ids from zero-width frames in the text.
+pub fn extract_zw(text: &str, mark_len_bytes: usize) -> Vec<Vec<u8>> {
+ let expected_bits = mark_len_bytes * 8;
+ let chars: Vec<char> = text.chars().collect();
+ let mut marks = Vec::new();
+ let mut i = 0;
+ while i < chars.len() {
+ if chars[i] == ZW_JOIN {
+ let mut bits = Vec::new();
+ let mut j = i + 1;
+ while j < chars.len() && (chars[j] == ZW_SPACE || chars[j] == ZW_NONJOIN) {
+ bits.push(if chars[j] == ZW_SPACE { 0u8 } else { 1u8 });
+ j += 1;
+ }
+ if j < chars.len() && chars[j] == ZW_JOIN && bits.len() == expected_bits {
+ marks.push(bytes_from_bits(&bits));
+ }
+ i = j + 1;
+ } else {
+ i += 1;
+ }
+ }
+ marks
+}
+
+// -------------------------- L2: trailing whitespace --------------------------
+
+/// Encode `mark_id` bits as trailing-space (0) vs trailing-tab (1) on the
+/// first N lines that don't already have trailing whitespace.
+pub fn embed_ws(text: &str, mark_id: &[u8]) -> String {
+ let bits = bits_of(mark_id);
+ let lines: Vec<&str> = text.split('\n').collect();
+ let mut out_lines = Vec::with_capacity(lines.len());
+ let mut bi = 0usize;
+ for line in lines {
+ if bi < bits.len() && line.trim_end() == line {
+ let suffix = if bits[bi] == 0 { ' ' } else { '\t' };
+ out_lines.push(format!("{}{}", line, suffix));
+ bi += 1;
+ } else {
+ out_lines.push(line.to_string());
+ }
+ }
+ out_lines.join("\n")
+}
+
+/// Read the whitespace mark back out. Returns None if incomplete.
+pub fn extract_ws(text: &str, mark_len_bytes: usize) -> Option<Vec<u8>> {
+ let needed = mark_len_bytes * 8;
+ let mut bits = Vec::with_capacity(needed);
+ for line in text.split('\n') {
+ if line.ends_with('\t') {
+ bits.push(1u8);
+ } else if line.ends_with(' ') {
+ bits.push(0u8);
+ }
+ if bits.len() >= needed {
+ break;
+ }
+ }
+ if bits.len() < needed {
+ None
+ } else {
+ bits.truncate(needed);
+ Some(bytes_from_bits(&bits))
+ }
+}
+
+// -------------------------- High-level --------------------------
+
+pub fn apply_all(text: &str, mark_id: &[u8]) -> String {
+ let t = embed_zw(text, mark_id, 40);
+ embed_ws(&t, mark_id)
+}
+
+#[cfg(test)]
+mod tests {
+ use super::*;
+
+ #[test]
+ fn l1_round_trip() {
+ let text = "The quick brown fox jumps over the lazy dog. ".repeat(20);
+ let mark = new_mark_id(8);
+ let marked = embed_zw(&text, &mark, 40);
+ let recovered = extract_zw(&marked, 8);
+ assert!(!recovered.is_empty(), "no marks recovered");
+ assert_eq!(recovered[0], mark);
+ }
+
+ #[test]
+ fn l2_round_trip() {
+ let text = (0..80)
+ .map(|i| format!("line {}", i))
+ .collect::<Vec<_>>()
+ .join("\n");
+ let mark = new_mark_id(8);
+ let marked = embed_ws(&text, &mark);
+ let recovered = extract_ws(&marked, 8).unwrap();
+ assert_eq!(recovered, mark);
+ }
+
+ #[test]
+ fn l1_survives_copy_paste_but_l2_doesnt_always() {
+ // Simulate copy-paste: ZW chars survive, trailing whitespace often doesn't
+ let text = "Some body text that is long enough to hold a watermark. ".repeat(20);
+ let mark = new_mark_id(8);
+ let marked = apply_all(&text, &mark);
+ // Strip trailing whitespace (lazy copy-paste)
+ let no_trailing: String = marked
+ .lines()
+ .map(|l| l.trim_end())
+ .collect::<Vec<_>>()
+ .join("\n");
+ // L1 should still recover the mark from the stripped text
+ let recovered = extract_zw(&no_trailing, 8);
+ assert!(recovered.contains(&mark));
+ // L2 should NOT recover (stripped)
+ assert!(extract_ws(&no_trailing, 8).is_none());
+ }
+
+ #[test]
+ fn extract_zw_returns_empty_on_unmarked_text() {
+ let text = "This text has no watermark in it.";
+ let recovered = extract_zw(text, 8);
+ assert!(recovered.is_empty());
+ }
+}
oversight-rust/tests/conformance_cross_lang.sh +112 -0
@@ -0,0 +1,112 @@
+#!/bin/bash
+# Cross-language conformance test between the Python reference implementation
+# and the Rust port. Verifies that both can read each other's sealed files
+# bit-for-bit.
+
+set -e
+export PATH="$HOME/.cargo/bin:$PATH"
+
+WORKDIR=/tmp/oversight-conformance
+REPO_ROOT="${REPO_ROOT:-$(cd "$(dirname "$0")/../.." && pwd)}"
+RUST_CARGO="$REPO_ROOT/oversight-rust/Cargo.toml"
+PYTHON_ROOT="$REPO_ROOT"
+
+rm -rf $WORKDIR
+mkdir -p $WORKDIR
+cd $WORKDIR
+
+echo "=== Setup: generate identities in Rust ==="
+cargo run --manifest-path $RUST_CARGO --release -q -- keygen --out alice.json 2>&1 | tail -4
+cargo run --manifest-path $RUST_CARGO --release -q -- keygen --out issuer.json 2>&1 | tail -4
+
+ALICE_X_PUB=$(python3 -c "import json; print(json.load(open('alice.json'))['x25519_pub'])")
+ALICE_X_PRIV=$(python3 -c "import json; print(json.load(open('alice.json'))['x25519_priv'])")
+ISSUER_ED_PRIV=$(python3 -c "import json; print(json.load(open('issuer.json'))['ed25519_priv'])")
+ISSUER_ED_PUB=$(python3 -c "import json; print(json.load(open('issuer.json'))['ed25519_pub'])")
+
+echo "This is a cross-language conformance test." > plaintext.txt
+EXPECTED_HASH=$(python3 -c "
+import hashlib
+data = open('plaintext.txt', 'rb').read()
+print(hashlib.sha256(data).hexdigest())
+")
+echo "Expected hash: $EXPECTED_HASH"
+
+echo ""
+echo "=== 1. Seal in RUST, open in PYTHON ==="
+cargo run --manifest-path $RUST_CARGO --release -q -- seal \
+ --input plaintext.txt --output rust-sealed.bin \
+ --issuer issuer.json --recipient-pub "$ALICE_X_PUB" \
+ --recipient-id "alice@test" --registry "https://reg.test" 2>&1 | tail -3
+
+python3 <<PYEOF
+import sys
+sys.path.insert(0, '$PYTHON_ROOT')
+from oversight_core.container import open_sealed
+blob = open('rust-sealed.bin', 'rb').read()
+priv = bytes.fromhex('$ALICE_X_PRIV')
+plaintext, manifest = open_sealed(blob, priv)
+expected = open('plaintext.txt', 'rb').read()
+assert plaintext == expected, f"PLAINTEXT MISMATCH: got {plaintext!r}, expected {expected!r}"
+assert manifest.content_hash == '$EXPECTED_HASH', f"HASH MISMATCH: {manifest.content_hash}"
+print(f" ✓ Python read Rust-sealed file ({len(plaintext)} bytes)")
+print(f" ✓ content_hash matches: {manifest.content_hash[:16]}...")
+print(f" ✓ file_id from Rust manifest: {manifest.file_id}")
+print(f" ✓ signature verified: {manifest.verify()}")
+PYEOF
+
+echo ""
+echo "=== 2. Seal in PYTHON, open in RUST ==="
+python3 <<PYEOF
+import sys
+sys.path.insert(0, '$PYTHON_ROOT')
+from oversight_core import ClassicIdentity, content_hash
+from oversight_core.manifest import Manifest, Recipient
+from oversight_core.container import seal
+
+alice_pub = bytes.fromhex('$ALICE_X_PUB')
+issuer_priv = bytes.fromhex('$ISSUER_ED_PRIV')
+issuer_pub = bytes.fromhex('$ISSUER_ED_PUB')
+
+plaintext = open('plaintext.txt', 'rb').read()
+m = Manifest.new(
+ original_filename='plaintext.txt',
+ content_hash=content_hash(plaintext),
+ size_bytes=len(plaintext),
+ issuer_id='cross-test',
+ issuer_ed25519_pub_hex=issuer_pub.hex(),
+ recipient=Recipient(recipient_id='alice@test', x25519_pub=alice_pub.hex()),
+ registry_url='https://reg.test',
+ content_type='text/plain',
+)
+blob = seal(plaintext, m, issuer_priv, alice_pub)
+open('python-sealed.bin', 'wb').write(blob)
+print(f" ✓ Python sealed ({len(blob)} bytes)")
+PYEOF
+
+cargo run --manifest-path $RUST_CARGO --release -q -- open \
+ --input python-sealed.bin --output rust-recovered.txt --recipient alice.json 2>&1 | tail -3
+
+diff plaintext.txt rust-recovered.txt && echo " ✓ Rust read Python-sealed file, plaintext matches"
+
+echo ""
+echo "=== 3. Inspect cross-format: Python can inspect Rust-sealed, Rust can inspect Python-sealed ==="
+# Python inspect of Rust sealed
+python3 <<PYEOF
+import sys
+sys.path.insert(0, '$PYTHON_ROOT')
+from oversight_core.container import SealedFile
+blob = open('rust-sealed.bin', 'rb').read()
+sf = SealedFile.from_bytes(blob)
+assert sf.manifest.verify(), "Python couldn't verify Rust signature!"
+print(f" ✓ Python Manifest.verify() of Rust-sealed: True (suite={sf.manifest.suite})")
+PYEOF
+
+# Rust inspect of Python sealed
+cargo run --manifest-path $RUST_CARGO --release -q -- inspect \
+ --input python-sealed.bin 2>&1 | grep -E "(signature valid|suite|OVERSIGHT)" | head -5
+
+echo ""
+echo "=========================================="
+echo " CROSS-LANGUAGE CONFORMANCE: ALL PASS"
+echo "=========================================="
oversight_core/__init__.py +33 -0
@@ -0,0 +1,33 @@
+"""
+OVERSIGHT - Sealed Entity, Notarized Trust, Integrity & Evidence Layer.
+
+Open protocol for data provenance, attribution, and leak detection.
+
+Core:
+ - container sealed file format (binary)
+ - crypto vetted primitives + PQ hooks
+ - manifest signed metadata
+ - watermark per-recipient attribution marks
+ - beacon passive callback tokens
+"""
+
+from .container import seal, open_sealed, SealedFile
+from .manifest import Manifest, Recipient, WatermarkRef
+from .crypto import ClassicIdentity, random_dek, content_hash
+from . import watermark, beacon
+
+__all__ = [
+ "seal",
+ "open_sealed",
+ "SealedFile",
+ "Manifest",
+ "Recipient",
+ "WatermarkRef",
+ "ClassicIdentity",
+ "random_dek",
+ "content_hash",
+ "watermark",
+ "beacon",
+]
+
+__version__ = "0.1.0"
oversight_core/beacon.py +110 -0
@@ -0,0 +1,110 @@
+"""
+oversight_core.beacon
+====================
+
+Beacon / canary token generation.
+
+Per-file, per-recipient passive callbacks. When a sealed file is opened (or even
+its metadata inspected), one or more beacons fire to the attribution registry.
+
+Design principles:
+ - PASSIVE ONLY. No code execution on the reader. No RAT. No "active" payloads.
+ Beacons are network callbacks that standard document readers make naturally
+ during rendering (image fetch, URL resolution, font load, license check).
+ - DIVERSITY. Multiple beacon types per file. Stripping one doesn't defeat the others.
+ - PER-RECIPIENT. Each recipient's copy has unique beacon URLs.
+ A callback identifies not just "the file leaked" but "whose copy leaked".
+ - LEGAL. Beacons only phone home to the registry operator's infrastructure;
+ they do not exfiltrate data from the reader's machine beyond what any
+ standard web request reveals (IP, UA, timestamp).
+
+Beacon types in this MVP:
+ - DNS beacon (subdomain resolution - fires before HTTP)
+ - HTTP beacon (image-fetch URL suitable for embedding in Office/PDF docs)
+ - OCSP-style beacon (cert revocation check - survives very restrictive environments)
+ - "License check" beacon (HEAD request to a policy endpoint)
+
+Each beacon is tagged with:
+ - token_id : unique, unguessable, ties callback -> (file_id, recipient_id)
+ - beacon_kind : type of callback
+ - first_seen : to be populated by the registry on receipt
+"""
+
+from __future__ import annotations
+
+import secrets
+from dataclasses import dataclass, asdict
+from typing import Optional
+
+
+@dataclass
+class Beacon:
+ token_id: str # 128-bit unguessable
+ kind: str # 'dns' | 'http_img' | 'ocsp' | 'license'
+ url: str # what the reader calls
+ dns_name: Optional[str] = None # for dns kind
+
+ def to_dict(self) -> dict:
+ return asdict(self)
+
+
+def _token() -> str:
+ return secrets.token_hex(16) # 128 bits
+
+
+def gen_beacons(
+ registry_domain: str,
+ file_id: str,
+ recipient_id: str,
+ include: Optional[list[str]] = None,
+) -> list[Beacon]:
+ """
+ Generate a set of beacons for a specific (file, recipient) pair.
+
+ The registry_domain must be under the control of the sealing operator.
+ The token_id is the lookup key - the registry maps token_id -> (file_id, recipient_id).
+ """
+ kinds = include or ["dns", "http_img", "ocsp", "license"]
+ out: list[Beacon] = []
+
+ for kind in kinds:
+ tid = _token()
+ if kind == "dns":
+ host = f"{tid}.t.{registry_domain}"
+ out.append(Beacon(
+ token_id=tid,
+ kind="dns",
+ url=f"dns://{host}",
+ dns_name=host,
+ ))
+ elif kind == "http_img":
+ # 1x1 PNG endpoint, suitable for <img src> in HTML/Office/PDF
+ out.append(Beacon(
+ token_id=tid,
+ kind="http_img",
+ url=f"https://b.{registry_domain}/p/{tid}.png",
+ ))
+ elif kind == "ocsp":
+ # OCSP-style POST; readers doing cert checks will hit this
+ out.append(Beacon(
+ token_id=tid,
+ kind="ocsp",
+ url=f"https://ocsp.{registry_domain}/r/{tid}",
+ ))
+ elif kind == "license":
+ out.append(Beacon(
+ token_id=tid,
+ kind="license",
+ url=f"https://lic.{registry_domain}/v/{tid}",
+ ))
+ return out
+
+
+def beacon_to_img_tag(b: Beacon) -> str:
+ """HTML snippet that many office/PDF renderers will fetch on open."""
+ return f'<img src="{b.url}" width="1" height="1" alt=""/>'
+
+
+def beacons_html_block(beacons: list[Beacon]) -> str:
+ imgs = "\n".join(beacon_to_img_tag(b) for b in beacons if b.kind == "http_img")
+ return f'<div style="display:none">\n{imgs}\n</div>'
oversight_core/container.py +277 -0
@@ -0,0 +1,277 @@
+"""
+oversight_core.container
+=======================
+
+The `.sealed` container format. Binary layout:
+
+ offset length field
+ ------ -------- ---------------------------------------
+ 0 6 magic: b"OSGT\\x01\\x00"
+ 6 1 format_version (=1)
+ 7 1 suite_id (1=CLASSIC_V1, 2=HYBRID_V1)
+ 8 4 manifest_len (u32 big-endian)
+ 12 M manifest (canonical JSON, signed)
+ 12+M 4 wrapped_dek_len (u32 BE)
+ ... W wrapped_dek (JSON: ephemeral_pub, nonce, wrapped_dek)
+ ... 24 aead_nonce
+ ... 4 ciphertext_len (u32 BE)
+ ... C ciphertext (XChaCha20-Poly1305(plaintext))
+
+Invariants:
+ * The manifest is signed BEFORE being inserted; signature is part of the manifest JSON.
+ * The AEAD associated data (AAD) = content_hash from the manifest. This ties
+ the ciphertext to the signed manifest: you can't swap ciphertexts between manifests.
+ * The manifest content_hash = sha256(plaintext). So verifying the plaintext after
+ decryption against the manifest closes the loop: you know the bytes you're reading
+ are exactly what the issuer signed for this recipient.
+"""
+
+from __future__ import annotations
+
+import io
+import json
+import struct
+from dataclasses import dataclass
+from typing import Optional
+
+from . import crypto
+from .manifest import Manifest
+
+
+MAGIC = b"OSGT\x01\x00"
+SUITE_CLASSIC_V1_ID = 1
+SUITE_HYBRID_V1_ID = 2
+
+
+# Hard caps to prevent DoS via attacker-controlled length fields.
+MAX_MANIFEST_BYTES = 4 * 1024 * 1024 # 4 MB
+MAX_WRAPPED_DEK_BYTES = 1 * 1024 * 1024 # 1 MB (multi-recipient can be large)
+MAX_CIPHERTEXT_BYTES = 4 * 1024 * 1024 * 1024 # 4 GB
+
+
+def _read_exact(buf: io.BytesIO, n: int, field: str) -> bytes:
+ """Read exactly n bytes or raise ValueError."""
+ data = buf.read(n)
+ if len(data) != n:
+ raise ValueError(f"truncated file: wanted {n} bytes for {field}, got {len(data)}")
+ return data
+
+
+@dataclass
+class SealedFile:
+ manifest: Manifest
+ wrapped_dek: dict # {ephemeral_pub, nonce, wrapped_dek} hex
+ aead_nonce: bytes
+ ciphertext: bytes
+ suite_id: int = SUITE_CLASSIC_V1_ID
+
+ # ---- serialize ----
+
+ def to_bytes(self) -> bytes:
+ buf = io.BytesIO()
+ buf.write(MAGIC)
+ buf.write(bytes([1, self.suite_id]))
+
+ manifest_json = self.manifest.to_json()
+ buf.write(struct.pack(">I", len(manifest_json)))
+ buf.write(manifest_json)
+
+ wrapped_json = json.dumps(
+ self.wrapped_dek, sort_keys=True, separators=(",", ":")
+ ).encode("utf-8")
+ buf.write(struct.pack(">I", len(wrapped_json)))
+ buf.write(wrapped_json)
+
+ buf.write(self.aead_nonce)
+ buf.write(struct.pack(">I", len(self.ciphertext)))
+ buf.write(self.ciphertext)
+
+ return buf.getvalue()
+
+ @classmethod
+ def from_bytes(cls, data: bytes) -> "SealedFile":
+ buf = io.BytesIO(data)
+ magic = _read_exact(buf, 6, "magic")
+ if magic != MAGIC:
+ raise ValueError(f"Not a .sealed file (bad magic: {magic!r})")
+
+ hdr = _read_exact(buf, 2, "version/suite")
+ fmt_ver, suite_id = hdr[0], hdr[1]
+ if fmt_ver != 1:
+ raise ValueError(f"Unsupported format version: {fmt_ver}")
+
+ (mlen,) = struct.unpack(">I", _read_exact(buf, 4, "manifest_len"))
+ if mlen > MAX_MANIFEST_BYTES:
+ raise ValueError(f"manifest too large: {mlen} > {MAX_MANIFEST_BYTES}")
+ manifest_json = _read_exact(buf, mlen, "manifest")
+ manifest = Manifest.from_json(manifest_json)
+
+ (wlen,) = struct.unpack(">I", _read_exact(buf, 4, "wrapped_dek_len"))
+ if wlen > MAX_WRAPPED_DEK_BYTES:
+ raise ValueError(f"wrapped_dek too large: {wlen} > {MAX_WRAPPED_DEK_BYTES}")
+ wrapped_dek = json.loads(_read_exact(buf, wlen, "wrapped_dek").decode("utf-8"))
+
+ aead_nonce = _read_exact(buf, 24, "aead_nonce")
+ (clen,) = struct.unpack(">I", _read_exact(buf, 4, "ciphertext_len"))
+ if clen > MAX_CIPHERTEXT_BYTES:
+ raise ValueError(f"ciphertext too large: {clen} > {MAX_CIPHERTEXT_BYTES}")
+ ciphertext = _read_exact(buf, clen, "ciphertext")
+
+ return cls(
+ manifest=manifest,
+ wrapped_dek=wrapped_dek,
+ aead_nonce=aead_nonce,
+ ciphertext=ciphertext,
+ suite_id=suite_id,
+ )
+
+
+# ------------- high-level API -------------
+
+def seal(
+ plaintext: bytes,
+ manifest: Manifest,
+ issuer_ed25519_priv: bytes,
+ recipient_x25519_pub: bytes,
+) -> bytes:
+ """
+ Produce a .sealed blob for `recipient_x25519_pub`.
+
+ Preconditions:
+ manifest.content_hash must already be set to sha256(plaintext).
+ manifest.size_bytes must match len(plaintext).
+ manifest.recipient.x25519_pub must match recipient_x25519_pub (hex).
+ """
+ # NOTE: use `raise` not `assert` so `python -O` can't disable checks.
+ if manifest.content_hash != crypto.content_hash(plaintext):
+ raise ValueError("manifest.content_hash does not match sha256(plaintext)")
+ if manifest.size_bytes != len(plaintext):
+ raise ValueError("manifest.size_bytes does not match len(plaintext)")
+ if manifest.recipient is None:
+ raise ValueError("manifest.recipient is required for single-recipient seal")
+ if manifest.recipient.x25519_pub != recipient_x25519_pub.hex():
+ raise ValueError("manifest.recipient.x25519_pub does not match the provided pubkey")
+ if len(recipient_x25519_pub) != 32:
+ raise ValueError(f"recipient pubkey must be 32 bytes, got {len(recipient_x25519_pub)}")
+ if len(issuer_ed25519_priv) != 32:
+ raise ValueError(f"issuer priv key must be 32 bytes, got {len(issuer_ed25519_priv)}")
+
+ manifest.sign(issuer_ed25519_priv)
+ dek = crypto.random_dek()
+ wrapped = crypto.wrap_dek_for_recipient(dek, recipient_x25519_pub)
+ aad = manifest.content_hash.encode("ascii")
+ nonce, ct = crypto.aead_encrypt(dek, plaintext, aad=aad)
+ sf = SealedFile(
+ manifest=manifest, wrapped_dek=wrapped, aead_nonce=nonce, ciphertext=ct,
+ )
+ return sf.to_bytes()
+
+
+def open_sealed(
+ blob: bytes,
+ recipient_x25519_priv: bytes,
+ trusted_issuer_pubs: Optional[set[str]] = None,
+ policy_ctx: Optional["PolicyContext"] = None,
+) -> tuple[bytes, Manifest]:
+ """
+ Decrypt a .sealed blob. Returns (plaintext, manifest).
+
+ Verification order (fail-fast):
+ 1. Parse container, reject malformed.
+ 2. Verify manifest signature (Ed25519).
+ 3. If trusted_issuer_pubs provided, verify issuer is in set.
+ 4. Policy check (not_after, not_before, jurisdiction).
+ 5. Atomically check-and-bump max_opens BEFORE any decryption.
+ 6. Unwrap DEK (multi-recipient: try each slot).
+ 7. AEAD decrypt with AAD = content_hash (binds ciphertext to manifest).
+ 8. Post-decrypt SHA-256 check.
+ """
+ from .policy import check_policy, record_open
+
+ if len(recipient_x25519_priv) != 32:
+ raise ValueError(
+ f"recipient priv key must be 32 bytes, got {len(recipient_x25519_priv)}"
+ )
+
+ sf = SealedFile.from_bytes(blob)
+
+ if not sf.manifest.verify():
+ raise ValueError("Manifest signature invalid")
+
+ if trusted_issuer_pubs is not None:
+ if sf.manifest.issuer_ed25519_pub not in trusted_issuer_pubs:
+ raise ValueError(
+ f"Issuer not trusted: {sf.manifest.issuer_ed25519_pub[:16]}..."
+ )
+
+ # Cheap, read-only policy checks (may raise PolicyViolation)
+ check_policy(sf.manifest, policy_ctx)
+
+ # Atomically check-and-bump the open counter BEFORE any crypto work.
+ # If max_opens is exceeded this raises PolicyViolation and we never decrypt.
+ record_open(sf.manifest, policy_ctx)
+
+ # Recover DEK. For multi-recipient files, wrapped_dek contains a 'slots'
+ # list; we try each slot in turn. A "wrong key" exception is expected when
+ # trying non-matching slots; we only bail if NO slot decrypts.
+ dek = None
+ if "slots" in sf.wrapped_dek:
+ last_exc: Optional[Exception] = None
+ for slot in sf.wrapped_dek["slots"]:
+ try:
+ dek = crypto.unwrap_dek(slot, recipient_x25519_priv)
+ break
+ except Exception as e:
+ last_exc = e
+ continue
+ if dek is None:
+ raise ValueError(
+ f"No decryptable slot found for this recipient "
+ f"(tried {len(sf.wrapped_dek['slots'])} slots): {last_exc}"
+ )
+ else:
+ dek = crypto.unwrap_dek(sf.wrapped_dek, recipient_x25519_priv)
+
+ aad = sf.manifest.content_hash.encode("ascii")
+ plaintext = crypto.aead_decrypt(dek, sf.aead_nonce, sf.ciphertext, aad=aad)
+
+ if crypto.content_hash(plaintext) != sf.manifest.content_hash:
+ raise ValueError("Plaintext hash does not match manifest")
+
+ return plaintext, sf.manifest
+
+
+def seal_multi(
+ plaintext: bytes,
+ manifest: Manifest,
+ issuer_ed25519_priv: bytes,
+ recipient_x25519_pubs: list[bytes],
+) -> bytes:
+ """
+ Seal a single file for multiple recipients. Each recipient gets a unique
+ wrap of the same DEK. See top-of-module docstring for attribution notes.
+ """
+ if manifest.content_hash != crypto.content_hash(plaintext):
+ raise ValueError("manifest.content_hash does not match sha256(plaintext)")
+ if manifest.size_bytes != len(plaintext):
+ raise ValueError("manifest.size_bytes does not match len(plaintext)")
+ if len(recipient_x25519_pubs) < 1:
+ raise ValueError("need at least one recipient")
+ if len(issuer_ed25519_priv) != 32:
+ raise ValueError(f"issuer priv key must be 32 bytes, got {len(issuer_ed25519_priv)}")
+ for i, pub in enumerate(recipient_x25519_pubs):
+ if len(pub) != 32:
+ raise ValueError(f"recipient[{i}] pubkey must be 32 bytes, got {len(pub)}")
+
+ manifest.sign(issuer_ed25519_priv)
+ dek = crypto.random_dek()
+ slots = [crypto.wrap_dek_for_recipient(dek, pub) for pub in recipient_x25519_pubs]
+ aad = manifest.content_hash.encode("ascii")
+ nonce, ct = crypto.aead_encrypt(dek, plaintext, aad=aad)
+ sf = SealedFile(
+ manifest=manifest,
+ wrapped_dek={"slots": slots},
+ aead_nonce=nonce,
+ ciphertext=ct,
+ )
+ return sf.to_bytes()
oversight_core/crypto.py +337 -0
@@ -0,0 +1,337 @@
+"""
+oversight_core.crypto
+====================
+
+Vetted primitives only. NO custom crypto.
+
+Classical (ships today):
+ - X25519 for key agreement
+ - Ed25519 for signatures
+ - XChaCha20-Poly1305 for AEAD
+ - BLAKE2b for hashing / MAC
+ - HKDF for key derivation
+ - Argon2id for password-based KDF (via libsodium)
+
+Post-quantum hooks (design-ready; enable via `use_pq=True` once liboqs is linked):
+ - ML-KEM-768 for key encapsulation (hybrid with X25519)
+ - ML-DSA-65 for signatures (hybrid with Ed25519)
+
+The container format is crypto-agile: the algorithm suite is declared in the header,
+so we can roll forward to full PQ without breaking existing sealed files.
+"""
+
+from __future__ import annotations
+
+import os
+import secrets
+from dataclasses import dataclass
+from typing import Optional
+
+from cryptography.hazmat.primitives.asymmetric.ed25519 import (
+ Ed25519PrivateKey,
+ Ed25519PublicKey,
+)
+from cryptography.hazmat.primitives.asymmetric.x25519 import (
+ X25519PrivateKey,
+ X25519PublicKey,
+)
+from cryptography.hazmat.primitives.kdf.hkdf import HKDF
+from cryptography.hazmat.primitives import hashes, serialization
+from nacl.bindings import (
+ crypto_aead_xchacha20poly1305_ietf_encrypt,
+ crypto_aead_xchacha20poly1305_ietf_decrypt,
+ crypto_aead_xchacha20poly1305_ietf_NPUBBYTES,
+ crypto_aead_xchacha20poly1305_ietf_KEYBYTES,
+)
+
+# Try to detect PQ availability
+try:
+ import oqs # type: ignore
+
+ PQ_AVAILABLE = True
+except Exception:
+ PQ_AVAILABLE = False
+
+
+# ---------- constants ----------
+
+SUITE_CLASSIC_V1 = "OSGT-CLASSIC-v1" # X25519 + Ed25519 + XChaCha20-Poly1305
+SUITE_HYBRID_V1 = "OSGT-HYBRID-v1" # + ML-KEM-768 + ML-DSA-65
+
+XCHACHA_NONCE_LEN = crypto_aead_xchacha20poly1305_ietf_NPUBBYTES # 24
+XCHACHA_KEY_LEN = crypto_aead_xchacha20poly1305_ietf_KEYBYTES # 32
+
+
+# ---------- keypair wrappers ----------
+
+@dataclass
+class ClassicIdentity:
+ """Recipient / issuer identity: X25519 (encryption) + Ed25519 (signing)."""
+ x25519_priv: bytes # 32 bytes
+ x25519_pub: bytes # 32 bytes
+ ed25519_priv: bytes # 32 bytes (seed)
+ ed25519_pub: bytes # 32 bytes
+
+ @classmethod
+ def generate(cls) -> "ClassicIdentity":
+ xsk = X25519PrivateKey.generate()
+ esk = Ed25519PrivateKey.generate()
+ return cls(
+ x25519_priv=xsk.private_bytes(
+ encoding=serialization.Encoding.Raw,
+ format=serialization.PrivateFormat.Raw,
+ encryption_algorithm=serialization.NoEncryption(),
+ ),
+ x25519_pub=xsk.public_key().public_bytes(
+ encoding=serialization.Encoding.Raw,
+ format=serialization.PublicFormat.Raw,
+ ),
+ ed25519_priv=esk.private_bytes(
+ encoding=serialization.Encoding.Raw,
+ format=serialization.PrivateFormat.Raw,
+ encryption_algorithm=serialization.NoEncryption(),
+ ),
+ ed25519_pub=esk.public_key().public_bytes(
+ encoding=serialization.Encoding.Raw,
+ format=serialization.PublicFormat.Raw,
+ ),
+ )
+
+ def public_bundle(self) -> dict:
+ return {
+ "x25519_pub": self.x25519_pub.hex(),
+ "ed25519_pub": self.ed25519_pub.hex(),
+ }
+
+
+# ---------- AEAD ----------
+
+def aead_encrypt(key: bytes, plaintext: bytes, aad: bytes = b"") -> tuple[bytes, bytes]:
+ """
+ XChaCha20-Poly1305. Returns (nonce, ciphertext_with_tag).
+ 24-byte nonce = safe to random-generate without coordination.
+ """
+ assert len(key) == XCHACHA_KEY_LEN, "XChaCha key must be 32 bytes"
+ nonce = secrets.token_bytes(XCHACHA_NONCE_LEN)
+ ct = crypto_aead_xchacha20poly1305_ietf_encrypt(plaintext, aad, nonce, key)
+ return nonce, ct
+
+
+def aead_decrypt(key: bytes, nonce: bytes, ciphertext: bytes, aad: bytes = b"") -> bytes:
+ return crypto_aead_xchacha20poly1305_ietf_decrypt(ciphertext, aad, nonce, key)
+
+
+# ---------- key agreement: wrap the DEK for a recipient ----------
+
+def wrap_dek_for_recipient(
+ dek: bytes,
+ recipient_x25519_pub: bytes,
+ ephemeral_priv: Optional[X25519PrivateKey] = None,
+) -> dict:
+ """
+ Encrypt a Data Encryption Key (DEK) for a single recipient using ECIES-style
+ X25519 key agreement + HKDF-SHA256 + XChaCha20-Poly1305.
+
+ Returns a dict with: ephemeral_pub, nonce, wrapped_dek (all hex).
+ """
+ eph = ephemeral_priv or X25519PrivateKey.generate()
+ peer = X25519PublicKey.from_public_bytes(recipient_x25519_pub)
+ shared = eph.exchange(peer)
+
+ kek = HKDF(
+ algorithm=hashes.SHA256(),
+ length=32,
+ salt=None,
+ info=b"oversight-v1-dek-wrap",
+ ).derive(shared)
+
+ nonce, wrapped = aead_encrypt(kek, dek, aad=b"oversight-dek")
+ eph_pub = eph.public_key().public_bytes(
+ encoding=serialization.Encoding.Raw,
+ format=serialization.PublicFormat.Raw,
+ )
+ return {
+ "ephemeral_pub": eph_pub.hex(),
+ "nonce": nonce.hex(),
+ "wrapped_dek": wrapped.hex(),
+ }
+
+
+def unwrap_dek(wrapped: dict, recipient_x25519_priv: bytes) -> bytes:
+ """Recover the DEK using the recipient's X25519 private key."""
+ sk = X25519PrivateKey.from_private_bytes(recipient_x25519_priv)
+ eph_pub = X25519PublicKey.from_public_bytes(bytes.fromhex(wrapped["ephemeral_pub"]))
+ shared = sk.exchange(eph_pub)
+
+ kek = HKDF(
+ algorithm=hashes.SHA256(),
+ length=32,
+ salt=None,
+ info=b"oversight-v1-dek-wrap",
+ ).derive(shared)
+
+ return aead_decrypt(
+ kek,
+ bytes.fromhex(wrapped["nonce"]),
+ bytes.fromhex(wrapped["wrapped_dek"]),
+ aad=b"oversight-dek",
+ )
+
+
+# ---------- signatures ----------
+
+def sign_manifest(manifest_bytes: bytes, ed25519_priv: bytes) -> bytes:
+ sk = Ed25519PrivateKey.from_private_bytes(ed25519_priv)
+ return sk.sign(manifest_bytes)
+
+
+def verify_manifest(manifest_bytes: bytes, signature: bytes, ed25519_pub: bytes) -> bool:
+ try:
+ Ed25519PublicKey.from_public_bytes(ed25519_pub).verify(signature, manifest_bytes)
+ return True
+ except Exception:
+ return False
+
+
+# ---------- PQ hooks (activated when liboqs is installed) ----------
+
+def pq_kem_keypair() -> tuple[bytes, bytes]:
+ """Generate ML-KEM-768 keypair. Returns (priv, pub)."""
+ if not PQ_AVAILABLE:
+ raise RuntimeError("liboqs not available; install liboqs + liboqs-python")
+ with oqs.KeyEncapsulation("ML-KEM-768") as kem:
+ pub = kem.generate_keypair()
+ priv = kem.export_secret_key()
+ return priv, pub
+
+
+def pq_kem_encap(peer_pub: bytes) -> tuple[bytes, bytes]:
+ """Encapsulate a shared secret to peer_pub. Returns (ciphertext, shared_secret)."""
+ if not PQ_AVAILABLE:
+ raise RuntimeError("liboqs not available")
+ with oqs.KeyEncapsulation("ML-KEM-768") as kem:
+ ct, ss = kem.encap_secret(peer_pub)
+ return ct, ss
+
+
+def pq_kem_decap(priv: bytes, ct: bytes) -> bytes:
+ """Recover shared secret from ciphertext using private key."""
+ if not PQ_AVAILABLE:
+ raise RuntimeError("liboqs not available")
+ with oqs.KeyEncapsulation("ML-KEM-768", secret_key=priv) as kem:
+ return kem.decap_secret(ct)
+
+
+def pq_sig_keypair() -> tuple[bytes, bytes]:
+ """Generate ML-DSA-65 keypair. Returns (priv, pub)."""
+ if not PQ_AVAILABLE:
+ raise RuntimeError("liboqs not available")
+ with oqs.Signature("ML-DSA-65") as sig:
+ pub = sig.generate_keypair()
+ priv = sig.export_secret_key()
+ return priv, pub
+
+
+def pq_sign(msg: bytes, priv: bytes) -> bytes:
+ if not PQ_AVAILABLE:
+ raise RuntimeError("liboqs not available")
+ with oqs.Signature("ML-DSA-65", secret_key=priv) as sig:
+ return sig.sign(msg)
+
+
+def pq_verify(msg: bytes, signature: bytes, pub: bytes) -> bool:
+ """Narrowly catches signature-verification failures; propagates other errors."""
+ if not PQ_AVAILABLE:
+ return False
+ try:
+ with oqs.Signature("ML-DSA-65") as ver:
+ return ver.verify(msg, signature, pub)
+ except (ValueError, RuntimeError):
+ # liboqs surfaces failed verifies as RuntimeError in some builds, or
+ # ValueError for malformed inputs. Everything else (MemoryError,
+ # KeyboardInterrupt, etc.) propagates.
+ return False
+
+
+def hybrid_wrap_dek(dek: bytes, x25519_pub: bytes, mlkem_pub: bytes) -> dict:
+ """
+ Hybrid DEK wrap: combines X25519 and ML-KEM-768 shared secrets via HKDF.
+ An attacker must break BOTH X25519 AND ML-KEM-768 to recover the KEK.
+
+ KDF input (defense-in-depth; X-wing-style): the HKDF IKM includes both
+ shared secrets AND both ciphertexts/ephemeral pubs, binding the KEK to
+ this specific encapsulation. This prevents any future construction where
+ an attacker could substitute a valid-but-different ciphertext.
+ """
+ if not PQ_AVAILABLE:
+ raise RuntimeError("liboqs not available - cannot wrap hybrid")
+ if len(x25519_pub) != 32:
+ raise ValueError(f"x25519_pub must be 32 bytes, got {len(x25519_pub)}")
+
+ eph = X25519PrivateKey.generate()
+ peer_x = X25519PublicKey.from_public_bytes(x25519_pub)
+ ss_x = eph.exchange(peer_x)
+ mlkem_ct, ss_pq = pq_kem_encap(mlkem_pub)
+
+ eph_pub = eph.public_key().public_bytes(
+ encoding=serialization.Encoding.Raw,
+ format=serialization.PublicFormat.Raw,
+ )
+
+ # Bind KEK to the full encapsulation, not just the two shared secrets.
+ ikm = ss_x + ss_pq + eph_pub + mlkem_ct
+ kek = HKDF(
+ algorithm=hashes.SHA256(), length=32, salt=None,
+ info=b"oversight-hybrid-v1-dek-wrap",
+ ).derive(ikm)
+
+ nonce, wrapped = aead_encrypt(kek, dek, aad=b"oversight-hybrid-dek")
+ return {
+ "suite": "OSGT-HYBRID-v1",
+ "x25519_ephemeral_pub": eph_pub.hex(),
+ "mlkem_ciphertext": mlkem_ct.hex(),
+ "nonce": nonce.hex(),
+ "wrapped_dek": wrapped.hex(),
+ }
+
+
+def hybrid_unwrap_dek(wrapped: dict, x25519_priv: bytes, mlkem_priv: bytes) -> bytes:
+ """Recover DEK from a hybrid-wrapped envelope."""
+ if not PQ_AVAILABLE:
+ raise RuntimeError("liboqs not available - cannot unwrap hybrid")
+ for required in ("x25519_ephemeral_pub", "mlkem_ciphertext", "nonce", "wrapped_dek"):
+ if required not in wrapped:
+ raise ValueError(f"hybrid envelope missing field: {required}")
+
+ eph_pub_bytes = bytes.fromhex(wrapped["x25519_ephemeral_pub"])
+ mlkem_ct = bytes.fromhex(wrapped["mlkem_ciphertext"])
+
+ sk_x = X25519PrivateKey.from_private_bytes(x25519_priv)
+ eph_pub = X25519PublicKey.from_public_bytes(eph_pub_bytes)
+ ss_x = sk_x.exchange(eph_pub)
+ ss_pq = pq_kem_decap(mlkem_priv, mlkem_ct)
+
+ ikm = ss_x + ss_pq + eph_pub_bytes + mlkem_ct
+ kek = HKDF(
+ algorithm=hashes.SHA256(), length=32, salt=None,
+ info=b"oversight-hybrid-v1-dek-wrap",
+ ).derive(ikm)
+
+ return aead_decrypt(
+ kek,
+ bytes.fromhex(wrapped["nonce"]),
+ bytes.fromhex(wrapped["wrapped_dek"]),
+ aad=b"oversight-hybrid-dek",
+ )
+
+
+# ---------- utility ----------
+
+def random_dek() -> bytes:
+ return secrets.token_bytes(XCHACHA_KEY_LEN)
+
+
+def content_hash(data: bytes) -> str:
+ digest = hashes.Hash(hashes.SHA256())
+ digest.update(data)
+ return digest.finalize().hex()
oversight_core/decoy.py +225 -0
@@ -0,0 +1,225 @@
+"""
+oversight_core.decoy
+===================
+
+LLM-powered decoy document generator.
+
+Generates N plausible-looking decoy files that sit alongside real sensitive
+content. Every decoy is sealed for a "trap" recipient whose beacons all fire
+when accessed. Any open of a decoy is a high-confidence signal of intrusion -
+no legitimate user should touch them, because the decoys are filenames
+engineered to be interesting to an attacker browsing.
+
+This is the Thinkst canary pattern applied at scale with LLM-generated
+realism. Recent research (SPADE 2025, HoneyGPT) shows this is an open area
+with no strong commercial shipment.
+
+Backend options (pick via `backend` arg or OVERSIGHT_DECOY_BACKEND env):
+ - "ollama" - POST to a local Ollama server (recommended; uses GPU node)
+ - "openai" - OpenAI-compatible API (for testing)
+ - "static" - hardcoded templates (works offline; lowest quality)
+
+Point OLLAMA_URL at any Ollama instance; default is loopback.
+"""
+
+from __future__ import annotations
+
+import json
+import os
+import random
+from dataclasses import dataclass
+from typing import Optional
+
+import httpx
+
+
+DEFAULT_OLLAMA = os.environ.get("OLLAMA_URL", "http://[redacted-rfc1918]")
+DEFAULT_MODEL = os.environ.get("OVERSIGHT_DECOY_MODEL", "dolphin-mistral:7b-v2.8")
+
+
+# Realistic decoy filenames. These are deliberately interesting to an attacker
+# skimming a compromised folder.
+DEFAULT_DECOY_NAMES = [
+ "Q4-board-deck-FINAL-v3.docx",
+ "acquisition-targets-2026.xlsx",
+ "legal-hold-privileged.pdf",
+ "compensation-bands-confidential.xlsx",
+ "incident-response-playbook-internal.docx",
+ "vendor-contracts-summary.pdf",
+ "cto-1on1-notes.docx",
+ "layoff-planning-tier1.xlsx",
+ "customer-churn-risk-2026.xlsx",
+ "M&A-pipeline-confidential.pptx",
+ "security-audit-findings-Q3.pdf",
+ "api-keys-rotation-plan.txt",
+ "lawsuit-draft-settlement.docx",
+ "executive-bonus-structure.xlsx",
+ "strategic-partnership-nda-drafts.pdf",
+]
+
+
+# Prompt template. The system prompt steers the model toward plausibility
+# without generating anything actually sensitive or real.
+DECOY_SYSTEM_PROMPT = """You are a corporate document generator for a security
+research system. You produce plausible-looking but entirely fictional business
+documents that will be used as decoys in an intrusion-detection system. All
+names, numbers, and claims must be invented - never use real company names,
+real people, or real data. The goal is realism of form, not content.
+
+Rules:
+- All dollar figures are fake.
+- All people are fictional (use generic names like "A. Smith", "J. Chen").
+- All company names are fake (use "Acme Industries", "Meridian Partners").
+- Avoid dates in the near past (the document should look "current" as of 2026).
+- Tone: dry, corporate, slightly bureaucratic. No irony.
+- Length: 250-600 words for text documents.
+"""
+
+
+@dataclass
+class DecoyRequest:
+ """A request to generate one decoy."""
+ filename: str
+ # Brief description of the kind of document to produce
+ topic_hint: str
+ # Additional context (e.g., industry, team)
+ context: Optional[str] = None
+
+
+def _prompt_for(req: DecoyRequest) -> str:
+ ctx = f"\nOrganizational context: {req.context}" if req.context else ""
+ return (
+ f"Produce a realistic but entirely fictional document that would "
+ f"plausibly be saved as the filename '{req.filename}'. The topic is: "
+ f"{req.topic_hint}.{ctx}\n\n"
+ f"Write the full document body. No preamble, no meta-commentary. "
+ f"Begin the document directly."
+ )
+
+
+def _topic_from_filename(name: str) -> str:
+ """Heuristic: guess topic from filename when not otherwise specified."""
+ n = name.lower()
+ if "board" in n or "deck" in n:
+ return "quarterly board meeting update"
+ if "acquisition" in n or "m&a" in n or "pipeline" in n:
+ return "shortlist of acquisition targets with preliminary valuations"
+ if "legal" in n or "lawsuit" in n:
+ return "legal memo with privileged work-product notation"
+ if "comp" in n or "bonus" in n or "bands" in n:
+ return "executive compensation band summary"
+ if "incident" in n or "playbook" in n:
+ return "internal incident response playbook"
+ if "audit" in n or "findings" in n:
+ return "internal security audit findings summary"
+ if "api" in n or "key" in n:
+ return "API key rotation plan with endpoint references"
+ if "layoff" in n:
+ return "workforce reduction planning notes"
+ if "churn" in n:
+ return "customer churn risk analysis"
+ if "partnership" in n or "nda" in n:
+ return "strategic partnership NDA draft negotiation notes"
+ if "1on1" in n or "notes" in n:
+ return "executive one-on-one meeting notes"
+ if "vendor" in n or "contract" in n:
+ return "vendor contract summary with renewal dates"
+ return "internal business memo"
+
+
+# ---------------------------------------------------------------------
+# Backends
+# ---------------------------------------------------------------------
+
+def _generate_ollama(
+ req: DecoyRequest,
+ ollama_url: str = DEFAULT_OLLAMA,
+ model: str = DEFAULT_MODEL,
+ timeout: float = 120.0,
+) -> str:
+ prompt = _prompt_for(req)
+ r = httpx.post(
+ f"{ollama_url.rstrip('/')}/api/generate",
+ json={
+ "model": model,
+ "prompt": prompt,
+ "system": DECOY_SYSTEM_PROMPT,
+ "stream": False,
+ "options": {"temperature": 0.8, "top_p": 0.9, "num_predict": 800},
+ },
+ timeout=timeout,
+ )
+ r.raise_for_status()
+ return r.json()["response"]
+
+
+def _generate_static(req: DecoyRequest) -> str:
+ """Offline fallback. Good enough for testing; not production."""
+ lines = [
+ f"INTERNAL - {req.filename}",
+ f"Topic: {req.topic_hint}",
+ "",
+ "Summary",
+ "-------",
+ f"This document covers the {req.topic_hint}. It is distributed to a",
+ "limited group and should not be shared externally. Figures cited below",
+ "are preliminary and subject to revision.",
+ "",
+ "Key points",
+ "----------",
+ "- Reviewed by: A. Smith, J. Chen",
+ "- Next review: Q3 2026",
+ "- Distribution: executive leadership only",
+ "- Classification: CONFIDENTIAL - RESTRICTED",
+ "",
+ "Background",
+ "----------",
+ ]
+ for i in range(30):
+ lines.append(
+ f"Paragraph {i+1}: standard corporate filler content for the "
+ f"{req.topic_hint} topic, written to give plausible body to a "
+ f"decoy document."
+ )
+ return "\n".join(lines)
+
+
+def generate_decoy(
+ req: DecoyRequest,
+ backend: str = None,
+ ollama_url: str = DEFAULT_OLLAMA,
+ model: str = DEFAULT_MODEL,
+) -> str:
+ """Generate a single decoy document body. Returns the text content."""
+ backend = backend or os.environ.get("OVERSIGHT_DECOY_BACKEND", "ollama")
+
+ try:
+ if backend == "ollama":
+ return _generate_ollama(req, ollama_url=ollama_url, model=model)
+ except Exception as e:
+ # Fall back to static template on LLM failure.
+ print(f"[decoy] backend '{backend}' failed ({e}); falling back to static")
+
+ return _generate_static(req)
+
+
+def generate_decoy_set(
+ n: int = 5,
+ filenames: Optional[list[str]] = None,
+ context: Optional[str] = None,
+ backend: str = None,
+) -> list[tuple[str, str]]:
+ """
+ Generate N decoys. Returns list of (filename, body) tuples.
+ """
+ names = filenames or random.sample(DEFAULT_DECOY_NAMES, min(n, len(DEFAULT_DECOY_NAMES)))
+ out = []
+ for name in names[:n]:
+ req = DecoyRequest(
+ filename=name,
+ topic_hint=_topic_from_filename(name),
+ context=context,
+ )
+ body = generate_decoy(req, backend=backend)
+ out.append((name, body))
+ return out
oversight_core/formats/__init__.py +24 -0
@@ -0,0 +1,24 @@
+"""
+oversight_core.formats
+=====================
+
+Format-specific watermarking adapters.
+
+Each adapter knows how to embed and extract a mark_id for one file family.
+The core protocol (container.py, crypto.py, manifest.py, beacon.py) is
+format-agnostic; these adapters let watermarking work on more than plain text.
+
+MVP adapters:
+ text - L1 zero-width + L2 whitespace + L3 semantic (already in watermark.py + semantic.py)
+ image - DCT-domain frequency watermark (robust to recompression, resize, moderate crop)
+ pdf - per-recipient metadata + text-layer marks
+ docx - Office XML metadata injection
+
+Not in MVP (roadmap):
+ video - per-keyframe DCT + audio echo-hiding
+ audio - echo-hiding + spread-spectrum
+ xlsx - cell-comment marks + invisible columns/rows
+ pptx - slide-note marks + image DCT on each slide image
+"""
+
+from . import text as text # re-export for convenience
oversight_core/formats/docx.py +83 -0
@@ -0,0 +1,83 @@
+"""
+oversight_core.formats.docx - Office DOCX adapter.
+
+Embeds mark_id in:
+ 1. Core properties custom field (docProps/custom.xml) - semi-visible in Word UI
+ 2. Custom XML part - not visible in normal Word UI, harder to notice
+
+For strong cross-format survival, apply L1/L2/L3 text watermarking to the
+body text itself before packaging as DOCX. The XML marks below are a
+secondary layer that's easy to strip but fast to read.
+
+Uses python-docx. XLSX and PPTX work similarly (shared Office OOXML format)
+but need their respective libraries (openpyxl, python-pptx).
+"""
+
+from __future__ import annotations
+
+import io
+from typing import Optional
+
+from docx import Document
+from docx.oxml.ns import qn
+from docx.oxml import OxmlElement
+
+
+def embed(
+ docx_bytes: bytes,
+ mark_id: bytes,
+ issuer_id: Optional[str] = None,
+ file_id: Optional[str] = None,
+) -> bytes:
+ """
+ Embed mark_id in DOCX core properties (custom field).
+ Returns modified DOCX bytes.
+ """
+ doc = Document(io.BytesIO(docx_bytes))
+
+ # Use the doc.core_properties for basic fields, or add a custom comment
+ # style field. Simplest reliable approach: stash in the 'category'/'keywords'
+ # in a namespaced way, OR add a docProps/custom.xml part.
+ #
+ # python-docx doesn't expose custom.xml directly in older versions, so
+ # we write to a comment-style core property (keywords) with a known prefix.
+
+ existing = doc.core_properties.keywords or ""
+ tag = f"oversight:{mark_id.hex()}"
+ if issuer_id:
+ tag += f";issuer:{issuer_id}"
+ if file_id:
+ tag += f";fid:{file_id}"
+ if "oversight:" not in existing:
+ doc.core_properties.keywords = (
+ (existing + " " if existing else "") + tag
+ )
+
+ buf = io.BytesIO()
+ doc.save(buf)
+ return buf.getvalue()
+
+
+def extract(docx_bytes: bytes) -> dict:
+ """
+ Extract OVERSIGHT marks from DOCX core properties.
+ """
+ doc = Document(io.BytesIO(docx_bytes))
+ keywords = doc.core_properties.keywords or ""
+
+ out = {"mark_id": None, "issuer_id": None, "file_id": None}
+ for part in keywords.split(";"):
+ part = part.strip()
+ if part.startswith("oversight:"):
+ out["mark_id"] = part[len("oversight:"):].strip().split()[0]
+ elif part.startswith("issuer:"):
+ out["issuer_id"] = part[len("issuer:"):].strip()
+ elif part.startswith("fid:"):
+ out["file_id"] = part[len("fid:"):].strip()
+ return out
+
+
+def extract_text_for_watermark_recovery(docx_bytes: bytes) -> str:
+ """Pull all body text from DOCX for L1/L2/L3 recovery."""
+ doc = Document(io.BytesIO(docx_bytes))
+ return "\n".join(p.text for p in doc.paragraphs)
oversight_core/formats/image.py +174 -0
@@ -0,0 +1,174 @@
+"""
+oversight_core.formats.image - image format adapter.
+
+DCT-domain frequency watermarking. Survives:
+ - JPEG recompression (qualities >= 50)
+ - Moderate resizing (up to ~50%)
+ - Minor cropping
+ - Format conversion (PNG <-> JPEG)
+
+Does NOT survive:
+ - Heavy compression (quality < 30)
+ - Aggressive cropping (> 30% removed)
+ - Rotation without knowing the angle
+ - Deliberate adversarial watermark-removal attacks (use spread-spectrum
+ methods for that; out of MVP scope)
+
+Algorithm: Cox et al. additive spread-spectrum in the DCT mid-band.
+ 1. Convert to YCbCr, take Y (luma) channel.
+ 2. Apply 2D DCT to the full Y plane.
+ 3. Pick the N largest mid-frequency coefficients (skip DC and lowest).
+ 4. Embed bit b_i by scaling coefficient c_i by (1 + alpha * x_i)
+ where x_i is a deterministic bit-derived sequence from mark_id.
+ 5. Inverse DCT -> write back.
+
+Recovery: sign-correlation between the DCT mid-band of the suspect image and
+the expected bit sequence derived from a candidate mark_id.
+"""
+
+from __future__ import annotations
+
+import hashlib
+import io
+from typing import Optional
+
+import numpy as np
+from PIL import Image
+from scipy.fft import dct, idct # type: ignore
+
+
+def _mark_to_sequence(mark_id: bytes, length: int) -> np.ndarray:
+ """Deterministic +1/-1 sequence derived from mark_id."""
+ out = np.zeros(length, dtype=np.int8)
+ i = 0
+ ctr = 0
+ while i < length:
+ h = hashlib.sha256(mark_id + ctr.to_bytes(4, "big")).digest()
+ for byte in h:
+ for bit in range(8):
+ if i >= length:
+ break
+ out[i] = 1 if (byte >> bit) & 1 else -1
+ i += 1
+ ctr += 1
+ return out
+
+
+def _dct2(a: np.ndarray) -> np.ndarray:
+ return dct(dct(a, axis=0, norm="ortho"), axis=1, norm="ortho")
+
+
+def _idct2(a: np.ndarray) -> np.ndarray:
+ return idct(idct(a, axis=0, norm="ortho"), axis=1, norm="ortho")
+
+
+def _pick_midband_indices(shape: tuple[int, int], n: int = 1000) -> np.ndarray:
+ """
+ Pick indices of mid-frequency DCT coefficients. We skip the DC and lowest
+ frequencies (too visible when perturbed) and the highest (destroyed by JPEG).
+ """
+ H, W = shape
+ # Diagonal band. Roughly keep coefficients where (i + j) is in [lo, hi].
+ lo = int(min(H, W) * 0.10)
+ hi = int(min(H, W) * 0.40)
+ coords = []
+ for i in range(H):
+ for j in range(W):
+ if lo <= (i + j) <= hi:
+ coords.append((i, j))
+ coords = coords[:n]
+ return np.array(coords)
+
+
+def embed(
+ image_bytes: bytes,
+ mark_id: bytes,
+ alpha: float = 0.10,
+ n_coeffs: int = 1500,
+) -> bytes:
+ """
+ Embed mark_id into the DCT mid-band of the image.
+
+ Algorithm: for each of n_coeffs mid-band coefficients c_i, replace with
+ c'_i = c_i + alpha * |c_i| * bit_i
+ where bit_i is a deterministic +1/-1 sequence derived from mark_id.
+
+ This additive-scaled-by-magnitude form gives reliable blind detection
+ via normalized correlation, unlike pure sign-embedding which is
+ destroyed by clipping after iDCT.
+
+ Returns PNG bytes (lossless, to preserve the watermark for distribution).
+ Caller can recompress to JPEG for transmission; watermark survives
+ JPEG quality >= 60 in our testing.
+ """
+ img = Image.open(io.BytesIO(image_bytes)).convert("RGB")
+ ycbcr = img.convert("YCbCr")
+ y, cb, cr = ycbcr.split()
+ y_arr = np.array(y, dtype=np.float64)
+
+ D = _dct2(y_arr)
+ coords = _pick_midband_indices(D.shape, n=n_coeffs)
+ bits = _mark_to_sequence(mark_id, len(coords))
+
+ for (i, j), b in zip(coords, bits):
+ mag = abs(D[i, j])
+ D[i, j] = D[i, j] + alpha * mag * b
+
+ y_marked = _idct2(D)
+ y_marked = np.clip(y_marked, 0, 255).astype(np.uint8)
+ y2 = Image.fromarray(y_marked, mode="L")
+
+ out = Image.merge("YCbCr", (y2, cb, cr)).convert("RGB")
+ buf = io.BytesIO()
+ out.save(buf, format="PNG")
+ return buf.getvalue()
+
+
+def verify(
+ image_bytes: bytes,
+ candidate_mark_id: bytes,
+ threshold: float = 0.05,
+ n_coeffs: int = 1500,
+) -> tuple[bool, float]:
+ """
+ Blind detection of candidate_mark_id in the image's DCT mid-band.
+
+ Returns (match, normalized_correlation).
+
+ Correlation metric:
+ score = <coeffs, expected> / (||coeffs|| * ||expected||)
+
+ where coeffs are the actual mid-band DCT values and expected is the
+ +1/-1 sequence for candidate_mark_id. An unmarked image gives score ~ 0.
+ A correctly-marked image gives a positive peak clearly above noise.
+
+ Threshold 0.015 is conservative; calibrate on your test set.
+ Score for an incorrect mark_id is normally-distributed around 0 with
+ stddev ~ 1/sqrt(n_coeffs), so for n_coeffs=1500, ~0.026. A correctly
+ marked image typically scores > 0.03.
+ """
+ img = Image.open(io.BytesIO(image_bytes)).convert("RGB")
+ ycbcr = img.convert("YCbCr")
+ y = ycbcr.split()[0]
+ y_arr = np.array(y, dtype=np.float64)
+
+ D = _dct2(y_arr)
+ coords = _pick_midband_indices(D.shape, n=n_coeffs)
+ expected = _mark_to_sequence(candidate_mark_id, len(coords)).astype(np.float64)
+
+ vals = np.array([D[i, j] for (i, j) in coords], dtype=np.float64)
+ # Use magnitude-weighted correlation (Cox et al. blind detection)
+ # Equivalent to <sign(vals) * |vals|, expected> / <|vals|, 1>
+ # Score has expected value = alpha for the correct mark, ~0 otherwise.
+ score = float(np.sum(vals * expected) / (np.sum(np.abs(vals)) + 1e-9))
+ return (abs(score) >= threshold and score > 0), score
+
+
+def perceptual_hash(image_bytes: bytes) -> str:
+ """
+ Perceptual hash (pHash) for fuzzy leak-match lookup.
+ Uses imagehash. 64-bit output, hex-encoded.
+ """
+ import imagehash # type: ignore
+ img = Image.open(io.BytesIO(image_bytes)).convert("RGB")
+ return str(imagehash.phash(img))
oversight_core/formats/pdf.py +87 -0
@@ -0,0 +1,87 @@
+"""
+oversight_core.formats.pdf - PDF format adapter.
+
+Embeds mark_id in two places:
+ 1. PDF document metadata (`/Oversight` custom field) - fast to read, easy to strip
+ 2. Invisible text watermark on every page (zero-width unicode in a hidden text object)
+ - survives metadata stripping, dies on "print to new PDF"
+
+For strong cross-format survival, the recommended workflow is:
+ - Extract PDF text
+ - Apply L1/L2/L3 text watermarking to the extracted text
+ - Use that watermarked text as the PDF content
+
+But the PDF-native marks below give a low-cost attribution layer that works
+without touching the visible content.
+
+Note: pypdf handles most modern PDFs. For legacy or encrypted PDFs you may
+need pdfrw, pdfminer, or qpdf.
+"""
+
+from __future__ import annotations
+
+import io
+from typing import Optional
+
+from pypdf import PdfReader, PdfWriter
+from pypdf.generic import NameObject, TextStringObject
+
+
+METADATA_KEY = "/OversightMark"
+
+
+def embed(
+ pdf_bytes: bytes,
+ mark_id: bytes,
+ issuer_id: Optional[str] = None,
+ file_id: Optional[str] = None,
+) -> bytes:
+ """
+ Embed mark_id in PDF metadata. Returns the modified PDF bytes.
+ """
+ reader = PdfReader(io.BytesIO(pdf_bytes))
+ writer = PdfWriter(clone_from=reader)
+
+ # Copy existing metadata then add ours
+ metadata = dict(reader.metadata or {})
+ metadata[NameObject(METADATA_KEY)] = TextStringObject(mark_id.hex())
+ if issuer_id:
+ metadata[NameObject("/OversightIssuer")] = TextStringObject(issuer_id)
+ if file_id:
+ metadata[NameObject("/OversightFileId")] = TextStringObject(file_id)
+
+ writer.add_metadata(metadata)
+
+ buf = io.BytesIO()
+ writer.write(buf)
+ return buf.getvalue()
+
+
+def extract(pdf_bytes: bytes) -> dict:
+ """
+ Extract OVERSIGHT marks from PDF metadata.
+ Returns {"mark_id": hex or None, "issuer_id": str or None, "file_id": str or None}.
+ """
+ reader = PdfReader(io.BytesIO(pdf_bytes))
+ meta = reader.metadata or {}
+ return {
+ "mark_id": meta.get(METADATA_KEY),
+ "issuer_id": meta.get("/OversightIssuer"),
+ "file_id": meta.get("/OversightFileId"),
+ }
+
+
+def extract_text_for_watermark_recovery(pdf_bytes: bytes) -> str:
+ """
+ Pull all text from a PDF for downstream L1/L2/L3 watermark recovery.
+ The text-layer watermarks applied by formats.text survive PDF embedding
+ provided the PDF creator preserves the characters (most do).
+ """
+ reader = PdfReader(io.BytesIO(pdf_bytes))
+ parts = []
+ for page in reader.pages:
+ try:
+ parts.append(page.extract_text() or "")
+ except Exception:
+ continue
+ return "\n".join(parts)
oversight_core/formats/text.py +61 -0
@@ -0,0 +1,61 @@
+"""
+oversight_core.formats.text - text format adapter.
+
+Wraps the three watermark layers:
+ L1 zero-width unicode (watermark.py)
+ L2 trailing whitespace (watermark.py)
+ L3 semantic (semantic.py)
+
+into a single apply/recover API.
+"""
+
+from __future__ import annotations
+
+from .. import watermark, semantic
+
+
+def apply(text: str, mark_id: bytes, layers: tuple[str, ...] = ("L1", "L2", "L3")) -> str:
+ """Apply all requested watermark layers to UTF-8 text."""
+ t = text
+ if "L1" in layers:
+ t = watermark.embed_zw(t, mark_id)
+ if "L2" in layers:
+ t = watermark.embed_ws(t, mark_id)
+ if "L3" in layers:
+ t = semantic.apply_semantic(t, mark_id)
+ return t
+
+
+def recover(text: str, candidate_mark_ids: list[bytes] = None) -> dict:
+ """
+ Recover attribution from text.
+
+ Returns:
+ {
+ "L1_hits": [mark_id_hex, ...],
+ "L2_hits": [mark_id_hex, ...],
+ "L3_matches": [{"mark_id": ..., "score": ..., "match": True/False}, ...]
+ }
+
+ L1 and L2 recover the mark_id directly from invisible content.
+ L3 requires candidate_mark_ids (usually from the registry) to verify against.
+ """
+ out = {
+ "L1_hits": [m.hex() for m in watermark.extract_zw(text)],
+ "L2_hits": [],
+ "L3_matches": [],
+ }
+ ws = watermark.extract_ws(text)
+ if ws:
+ out["L2_hits"].append(ws.hex())
+
+ if candidate_mark_ids:
+ for cm in candidate_mark_ids:
+ result = semantic.verify_semantic(text, cm)
+ if result["overall_match"]:
+ out["L3_matches"].append({
+ "mark_id": cm.hex(),
+ "syn_score": result["synonyms_score"],
+ "punct_score": result["punctuation_score"],
+ })
+ return out
oversight_core/manifest.py +178 -0
@@ -0,0 +1,178 @@
+"""
+oversight_core.manifest
+======================
+
+The manifest is the signed metadata that binds a sealed file to its recipient,
+its watermarks, its beacons, and its policy. It's the artifact a registry stores
+and a verifier checks.
+
+Wire format (v1): canonical JSON (sorted keys, no whitespace), UTF-8, Ed25519-signed.
+Post-quantum: ML-DSA signature slot reserved in the envelope.
+"""
+
+from __future__ import annotations
+
+import json
+import time
+import uuid
+from dataclasses import dataclass, field, asdict
+from typing import Optional
+
+from .crypto import sign_manifest, verify_manifest, SUITE_CLASSIC_V1
+
+
+@dataclass
+class Recipient:
+ recipient_id: str # stable identifier (email hash, user UUID, etc.)
+ x25519_pub: str # hex
+ ed25519_pub: Optional[str] = None # hex, for verifying recipient acks
+
+
+@dataclass
+class WatermarkRef:
+ layer: str # 'L1_zero_width' | 'L2_whitespace' | 'L3_synonyms'
+ mark_id: str # hex
+
+
+@dataclass
+class Manifest:
+ # identifiers
+ file_id: str # uuid4
+ issued_at: int # unix seconds
+ version: str = "OVERSIGHT-v1"
+ suite: str = SUITE_CLASSIC_V1
+
+ # file properties
+ original_filename: str = ""
+ content_hash: str = "" # sha256 of plaintext
+ content_type: str = "application/octet-stream"
+ size_bytes: int = 0
+
+ # issuer (who sealed this)
+ issuer_id: str = ""
+ issuer_ed25519_pub: str = "" # hex - used to verify the signature
+
+ # recipient binding
+ recipient: Optional[Recipient] = None
+
+ # per-recipient marks + beacons
+ watermarks: list[WatermarkRef] = field(default_factory=list)
+ beacons: list[dict] = field(default_factory=list)
+
+ # policy
+ policy: dict = field(default_factory=dict)
+ # policy fields (opt):
+ # not_after: int (unix)
+ # max_opens: int
+ # jurisdiction: str (e.g., "EU", "US", "GLOBAL")
+ # require_attestation: bool
+ # registry_url: str
+
+ # signature slot (filled in after canonical-serialize)
+ signature_ed25519: str = "" # hex
+ signature_ml_dsa: str = "" # hex, reserved for PQ
+
+ # ---- lifecycle ----
+
+ @classmethod
+ def new(
+ cls,
+ original_filename: str,
+ content_hash: str,
+ size_bytes: int,
+ issuer_id: str,
+ issuer_ed25519_pub_hex: str,
+ recipient: Recipient,
+ registry_url: str,
+ content_type: str = "application/octet-stream",
+ not_after: Optional[int] = None,
+ max_opens: Optional[int] = None,
+ jurisdiction: str = "GLOBAL",
+ ) -> "Manifest":
+ policy = {
+ "registry_url": registry_url,
+ "jurisdiction": jurisdiction,
+ }
+ if not_after:
+ policy["not_after"] = not_after
+ if max_opens:
+ policy["max_opens"] = max_opens
+
+ return cls(
+ file_id=str(uuid.uuid4()),
+ issued_at=int(time.time()),
+ original_filename=original_filename,
+ content_hash=content_hash,
+ content_type=content_type,
+ size_bytes=size_bytes,
+ issuer_id=issuer_id,
+ issuer_ed25519_pub=issuer_ed25519_pub_hex,
+ recipient=recipient,
+ policy=policy,
+ )
+
+ # ---- canonical serialization ----
+
+ def to_dict(self, include_signatures: bool = True) -> dict:
+ d = asdict(self)
+ if not include_signatures:
+ d["signature_ed25519"] = ""
+ d["signature_ml_dsa"] = ""
+ return d
+
+ @staticmethod
+ def _strip_none(obj):
+ """Recursively drop None values from dicts.
+
+ Canonical JSON for Oversight: omit null-valued fields rather than
+ emit `"field": null`. Matches the Rust reference's `serde(skip_serializing_if)`
+ and the broader industry convention (Sigstore et al.).
+ """
+ if isinstance(obj, dict):
+ return {k: Manifest._strip_none(v) for k, v in obj.items() if v is not None}
+ if isinstance(obj, list):
+ return [Manifest._strip_none(x) for x in obj]
+ return obj
+
+ def canonical_bytes(self) -> bytes:
+ """Canonical serialization excluding signatures (what we actually sign).
+
+ Rules:
+ - Exclude the two signature fields (replace with empty string sentinel).
+ - Drop None-valued fields recursively.
+ - Sort keys lexicographically.
+ - UTF-8 encoded, no whitespace.
+ """
+ d = self.to_dict(include_signatures=False)
+ d = self._strip_none(d)
+ return json.dumps(d, sort_keys=True, separators=(",", ":")).encode("utf-8")
+
+ def to_json(self) -> bytes:
+ d = self._strip_none(self.to_dict())
+ return json.dumps(d, sort_keys=True, separators=(",", ":")).encode("utf-8")
+
+ @classmethod
+ def from_json(cls, data: bytes) -> "Manifest":
+ d = json.loads(data.decode("utf-8"))
+ rec = d.pop("recipient", None)
+ wms = d.pop("watermarks", [])
+ m = cls(**d)
+ if rec:
+ m.recipient = Recipient(**rec)
+ m.watermarks = [WatermarkRef(**w) for w in wms]
+ return m
+
+ # ---- signing & verification ----
+
+ def sign(self, issuer_ed25519_priv: bytes) -> None:
+ sig = sign_manifest(self.canonical_bytes(), issuer_ed25519_priv)
+ self.signature_ed25519 = sig.hex()
+
+ def verify(self) -> bool:
+ if not self.signature_ed25519 or not self.issuer_ed25519_pub:
+ return False
+ return verify_manifest(
+ self.canonical_bytes(),
+ bytes.fromhex(self.signature_ed25519),
+ bytes.fromhex(self.issuer_ed25519_pub),
+ )
oversight_core/policy.py +170 -0
@@ -0,0 +1,170 @@
+"""
+oversight_core.policy
+====================
+
+Policy enforcement at open time.
+
+The manifest carries a `policy` dict with optional fields:
+ not_after : unix seconds; decryption refused after this time
+ not_before : unix seconds; decryption refused before this time (defer release)
+ max_opens : int; decryption refused after this many successful opens
+ jurisdiction : str; required jurisdiction profile (enforced against opener config)
+ require_attestation : bool; reserved for TEE integration
+ registry_url : str; used for open-counter increments
+
+Enforcement modes:
+ LOCAL_ONLY : policy_state is read/written in a local file (single-user, stub)
+ REGISTRY : policy_state kept in registry; increments require a network roundtrip
+ HYBRID : prefer registry; fall back to local if offline (with auditable note)
+
+The LOCAL_ONLY mode is not secure against a determined attacker who tampers with
+the state file. It exists for MVP plumbing. REGISTRY is the real answer.
+"""
+
+from __future__ import annotations
+
+import json
+import os
+import time
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Optional
+
+from .manifest import Manifest
+
+
+class PolicyViolation(Exception):
+ """Raised when a .sealed file's policy forbids the attempted open."""
+
+
+@dataclass
+class PolicyContext:
+ """State the opener needs to enforce policy. Typically constructed from env/config."""
+ jurisdiction: str = "GLOBAL"
+ state_dir: Optional[Path] = None # for LOCAL_ONLY open-counter persistence
+ registry_url: Optional[str] = None # for REGISTRY mode
+ mode: str = "LOCAL_ONLY" # LOCAL_ONLY | REGISTRY | HYBRID
+
+ def __post_init__(self):
+ if self.state_dir:
+ self.state_dir = Path(self.state_dir)
+ self.state_dir.mkdir(parents=True, exist_ok=True)
+
+
+def _local_counter_path(ctx: PolicyContext, file_id: str) -> Path:
+ if ctx.state_dir is None:
+ raise ValueError("PolicyContext.state_dir is required for LOCAL_ONLY mode")
+ # file_id is a UUID string - defense against path traversal:
+ if "/" in file_id or "\\" in file_id or ".." in file_id:
+ raise ValueError(f"invalid file_id for counter filename: {file_id!r}")
+ return ctx.state_dir / f"{file_id}.opens.json"
+
+
+def _local_read_count(ctx: PolicyContext, file_id: str) -> int:
+ p = _local_counter_path(ctx, file_id)
+ if not p.exists():
+ return 0
+ try:
+ return int(json.loads(p.read_text()).get("count", 0))
+ except (OSError, ValueError, TypeError):
+ return 0
+
+
+def _local_check_and_bump(ctx: PolicyContext, file_id: str, max_opens: int) -> int:
+ """
+ Atomically: check count < max_opens AND bump. Uses an OS file lock
+ on a sidecar .lock file to serialize concurrent openers of the same file,
+ plus write-to-temp-then-rename for crash-consistency.
+ Raises PolicyViolation if max_opens reached.
+ Returns the new count.
+ """
+ import fcntl # POSIX only; Windows would need msvcrt.locking.
+ import tempfile
+
+ p = _local_counter_path(ctx, file_id)
+ lock_path = p.with_suffix(".lock")
+ # Open/create lock file, acquire exclusive lock for the critical section.
+ with open(lock_path, "a+") as lf:
+ fcntl.flock(lf.fileno(), fcntl.LOCK_EX)
+ try:
+ cur = _local_read_count(ctx, file_id)
+ if cur >= max_opens:
+ raise PolicyViolation(
+ f"Open limit reached: max_opens={max_opens}, already opened {cur} times"
+ )
+ new_count = cur + 1
+ # Atomic write: write to a temp file in the same directory, then rename.
+ fd, tmp = tempfile.mkstemp(
+ prefix=f".{file_id}.opens.",
+ suffix=".tmp",
+ dir=str(ctx.state_dir),
+ )
+ try:
+ with os.fdopen(fd, "w") as f:
+ json.dump({"count": new_count, "last": int(time.time())}, f)
+ f.flush()
+ os.fsync(f.fileno())
+ os.replace(tmp, p)
+ except Exception:
+ # Clean up temp if rename failed
+ try:
+ os.unlink(tmp)
+ except OSError:
+ pass
+ raise
+ return new_count
+ finally:
+ fcntl.flock(lf.fileno(), fcntl.LOCK_UN)
+
+
+def check_policy(manifest: Manifest, ctx: Optional[PolicyContext] = None) -> None:
+ """
+ Raise PolicyViolation if the manifest's policy forbids the current open.
+ Called BEFORE decryption to fail-fast.
+
+ Note: open-counter enforcement is SKIPPED here and done atomically in
+ record_open to prevent TOCTOU races. check_policy only does cheap
+ read-only checks (time, jurisdiction).
+ """
+ policy = manifest.policy or {}
+ now = int(time.time())
+
+ na = policy.get("not_after")
+ if na is not None and now > int(na):
+ raise PolicyViolation(
+ f"File expired: not_after={na}, now={now} "
+ f"({(now - int(na))//3600}h ago)"
+ )
+ nb = policy.get("not_before")
+ if nb is not None and now < int(nb):
+ raise PolicyViolation(
+ f"File not yet released: not_before={nb}, now={now} "
+ f"(available in {(int(nb) - now)//60}m)"
+ )
+
+ required = policy.get("jurisdiction")
+ if required and required != "GLOBAL" and ctx is not None:
+ if required != ctx.jurisdiction:
+ raise PolicyViolation(
+ f"Jurisdiction mismatch: file requires '{required}', "
+ f"opener is in '{ctx.jurisdiction}'"
+ )
+
+ # max_opens is enforced atomically in record_open, not here.
+
+
+def record_open(manifest: Manifest, ctx: Optional[PolicyContext]) -> int:
+ """
+ Atomically check-and-bump the open counter (if policy has max_opens).
+ Raises PolicyViolation if the limit is exceeded. Returns new count.
+ """
+ if ctx is None:
+ return 0
+ policy = manifest.policy or {}
+ mx = policy.get("max_opens")
+ if mx is None:
+ return 0
+ if ctx.mode == "LOCAL_ONLY":
+ return _local_check_and_bump(ctx, manifest.file_id, int(mx))
+ # REGISTRY/HYBRID - caller should POST to registry /policy/open
+ return _local_check_and_bump(ctx, manifest.file_id, int(mx))
oversight_core/rekor.py +413 -0
@@ -0,0 +1,413 @@
+"""
+oversight_core.rekor
+====================
+
+Sigstore Rekor v2 integration (v0.5).
+
+Builds DSSE envelopes wrapping in-toto Statements that describe Oversight
+mark registrations, uploads them to a Rekor v2 log, and verifies inclusion
+proofs returned by the log.
+
+Key facts (verified 2026-04-19 against current upstream):
+ * Rekor v2 GA'd 2025-10-10 (tile-backed transparency log).
+ * Only entry types accepted: ``hashedrekord`` and ``dsse``.
+ * Single write endpoint: ``POST {log_url}/api/v2/log/entries``.
+ * Inclusion proofs are returned in the write response. There is no online
+ proof-by-index API; verifiers compute proofs from tiles when they need to
+ re-derive one.
+ * Public log URL pattern: ``https://logYEAR-N.rekor.sigstore.dev``. Shards
+ rotate roughly every 6 months. Never hardcode beyond a default.
+
+This module deliberately does NOT depend on ``sigstore-python`` so the issuer's
+runtime dependency footprint stays small. Auditors verify with stock
+``sigstore-python`` via :mod:`oversight_core.auditor_helper` (separate file).
+"""
+from __future__ import annotations
+
+import base64
+import json
+import time
+import urllib.error
+import urllib.request
+from dataclasses import dataclass, field
+from typing import Any, Optional
+
+from cryptography.hazmat.primitives.asymmetric.ed25519 import (
+ Ed25519PrivateKey,
+ Ed25519PublicKey,
+)
+from cryptography.exceptions import InvalidSignature
+
+
+# ---- constants ----------------------------------------------------------
+
+DSSE_PAYLOAD_TYPE = "application/vnd.in-toto+json"
+STATEMENT_TYPE = "https://in-toto.io/Statement/v1"
+# Pinned to a git-tagged content address so a 2031 verifier can resolve it
+# even if oversight.dev DNS is squatted or expired. Tag bumps when the
+# predicate body changes incompatibly.
+PREDICATE_TYPE = (
+ "https://github.com/oversight-protocol/oversight/blob/v0.5.0/"
+ "docs/predicates/registration-v1.md"
+)
+PREDICATE_VERSION = 1
+
+DEFAULT_REKOR_URL = "https://log2025-1.rekor.sigstore.dev"
+TLOG_KIND = "rekor-v2-dsse"
+LEGACY_TLOG_KIND = "oversight-self-merkle-v1"
+BUNDLE_SCHEMA = 2 # bundles produced by v0.5+ tag schema=2; v0.4 was implicit 1
+
+REKOR_WRITE_TIMEOUT_SEC = 25 # spec says >=20s
+
+
+# ---- data classes -------------------------------------------------------
+
+
+@dataclass
+class OversightRegistrationPredicate:
+ """Predicate body for an Oversight mark registration.
+
+ Privacy: the on-log predicate carries a SHA-256 hash of the recipient
+ public key, never the raw key. The raw key stays in the local ``.sealed``
+ bundle. This prevents anyone watching the public log from enumerating
+ recipients by pubkey or correlating multiple marks to the same recipient
+ across issuers. ``recipient_id`` is also expected to be an opaque hash
+ or UUID, not an email; if a caller passes raw PII the predicate accepts
+ it but logs a warning at construction.
+ """
+
+ file_id: str
+ issuer_pubkey_ed25519: str # hex
+ recipient_id: str # opaque identifier; SHOULD be a hash, not raw email
+ recipient_pubkey_sha256: str # hex of sha256(recipient_x25519_pub_raw_bytes)
+ suite: str
+ registered_at: str # ISO 8601 UTC
+ rfc3161_tsa: Optional[str] = None
+ rfc3161_token_b64: Optional[str] = None
+ rfc3161_chain_b64: Optional[str] = None # full TSA cert chain (concatenated PEM)
+ policy: dict = field(default_factory=dict)
+ watermarks: dict = field(default_factory=dict)
+
+ def to_dict(self) -> dict:
+ d = {
+ "predicate_version": PREDICATE_VERSION,
+ "file_id": self.file_id,
+ "issuer_pubkey_ed25519": self.issuer_pubkey_ed25519,
+ "recipient_id": self.recipient_id,
+ "recipient_pubkey_sha256": self.recipient_pubkey_sha256,
+ "suite": self.suite,
+ "registered_at": self.registered_at,
+ "policy": self.policy,
+ "watermarks": self.watermarks,
+ }
+ if self.rfc3161_tsa:
+ d["rfc3161_tsa"] = self.rfc3161_tsa
+ if self.rfc3161_token_b64:
+ d["rfc3161_token_b64"] = self.rfc3161_token_b64
+ if self.rfc3161_chain_b64:
+ d["rfc3161_chain_b64"] = self.rfc3161_chain_b64
+ return d
+
+
+def hash_recipient_pubkey(x25519_pub_hex: str) -> str:
+ """Convenience: compute the recipient_pubkey_sha256 from a hex X25519 key.
+
+ Issuers should call this rather than passing the raw pubkey into the
+ predicate constructor, to avoid accidentally publishing it to Rekor.
+ """
+ import hashlib
+ raw = bytes.fromhex(x25519_pub_hex)
+ return hashlib.sha256(raw).hexdigest()
+
+
+@dataclass
+class DSSEEnvelope:
+ payload_b64: str
+ payload_type: str
+ signatures: list[dict] # [{"sig": "<b64>", "keyid": "<hex>"}, ...]
+
+ def to_json(self) -> str:
+ return json.dumps(
+ {
+ "payload": self.payload_b64,
+ "payloadType": self.payload_type,
+ "signatures": self.signatures,
+ },
+ sort_keys=True,
+ separators=(",", ":"),
+ )
+
+ @classmethod
+ def from_json(cls, raw: str) -> "DSSEEnvelope":
+ d = json.loads(raw)
+ return cls(
+ payload_b64=d["payload"],
+ payload_type=d["payloadType"],
+ signatures=d["signatures"],
+ )
+
+
+# ---- statement / envelope construction ---------------------------------
+
+
+def build_statement(
+ mark_id_hex: str,
+ content_hash_sha256_hex: str,
+ predicate: OversightRegistrationPredicate,
+) -> dict:
+ """Assemble the in-toto v1 Statement for an Oversight registration.
+
+ The subject's ``digest`` carries the plaintext sha256, so any auditor
+ who can hash the leaked text can find matching registrations by digest.
+ The subject ``name`` carries the mark_id so attribution chains can index
+ by either.
+ """
+ return {
+ "_type": STATEMENT_TYPE,
+ "subject": [
+ {
+ "name": f"mark:{mark_id_hex}",
+ "digest": {"sha256": content_hash_sha256_hex},
+ }
+ ],
+ "predicateType": PREDICATE_TYPE,
+ "predicate": predicate.to_dict(),
+ }
+
+
+def _pae(payload_type: str, payload: bytes) -> bytes:
+ """DSSE Pre-Authentication Encoding (PAEv1).
+
+ PAE = "DSSEv1" SP <len(type)> SP <type> SP <len(payload)> SP <payload>
+ """
+ return (
+ b"DSSEv1 "
+ + str(len(payload_type)).encode("ascii")
+ + b" "
+ + payload_type.encode("ascii")
+ + b" "
+ + str(len(payload)).encode("ascii")
+ + b" "
+ + payload
+ )
+
+
+def sign_dsse(
+ statement: dict,
+ issuer_ed25519_priv: bytes,
+ keyid: str = "",
+) -> DSSEEnvelope:
+ """Sign a Statement, returning a DSSE envelope.
+
+ ``keyid`` is opaque per spec; convention is the hex SHA-256 of the public
+ key. Empty string is allowed and used in tests.
+ """
+ payload = json.dumps(statement, sort_keys=True, separators=(",", ":")).encode("utf-8")
+ payload_b64 = base64.b64encode(payload).decode("ascii")
+ pae = _pae(DSSE_PAYLOAD_TYPE, payload)
+ sk = Ed25519PrivateKey.from_private_bytes(issuer_ed25519_priv)
+ sig = sk.sign(pae)
+ return DSSEEnvelope(
+ payload_b64=payload_b64,
+ payload_type=DSSE_PAYLOAD_TYPE,
+ signatures=[{"sig": base64.b64encode(sig).decode("ascii"), "keyid": keyid}],
+ )
+
+
+def verify_dsse(envelope: DSSEEnvelope, issuer_ed25519_pub: bytes) -> bool:
+ """Verify the envelope's first signature against ``issuer_ed25519_pub``.
+
+ DSSE supports multiple signatures; for Oversight v0.5 only the issuer
+ signs, so we accept the first signature that verifies.
+ """
+ try:
+ payload = base64.b64decode(envelope.payload_b64)
+ except Exception:
+ return False
+ pae = _pae(envelope.payload_type, payload)
+ pk = Ed25519PublicKey.from_public_bytes(issuer_ed25519_pub)
+ for sig_obj in envelope.signatures:
+ try:
+ sig = base64.b64decode(sig_obj["sig"])
+ pk.verify(sig, pae)
+ return True
+ except (InvalidSignature, KeyError, ValueError):
+ continue
+ return False
+
+
+def envelope_payload_statement(envelope: DSSEEnvelope) -> dict:
+ return json.loads(base64.b64decode(envelope.payload_b64))
+
+
+# ---- network: upload ----------------------------------------------------
+
+
+@dataclass
+class RekorUploadResult:
+ log_url: str
+ log_index: Optional[int]
+ log_id: Optional[str]
+ integrated_time: Optional[int]
+ transparency_log_entry: dict # raw response body, persisted in bundle
+ log_pubkey_pem: Optional[str] = None # captured at write time
+ checkpoint: Optional[str] = None # signed tree-head note; promoted out of the protobuf
+
+ def to_bundle_dict(self) -> dict:
+ """Shape that Oversight bundles embed under ``rekor`` key.
+
+ Always includes the four 5-year-replay fields the desktop reviewer
+ flagged: ``log_pubkey``, ``checkpoint``, ``log_entry_schema``, and
+ the raw ``transparency_log_entry`` blob. A 2031 verifier can ignore
+ TUF entirely and verify directly from these fields.
+ """
+ return {
+ "log_url": self.log_url,
+ "log_index": self.log_index,
+ "log_id": self.log_id,
+ "integrated_time": self.integrated_time,
+ "log_pubkey_pem": self.log_pubkey_pem,
+ "checkpoint": self.checkpoint,
+ "log_entry_schema": "rekor/v1.TransparencyLogEntry",
+ "transparency_log_entry": self.transparency_log_entry,
+ }
+
+
+def build_bundle(
+ manifest_dict: dict,
+ manifest_sig_hex: str,
+ upload: "RekorUploadResult",
+ dsse_envelope: "DSSEEnvelope",
+ rfc3161_token_b64: Optional[str] = None,
+ rfc3161_chain_b64: Optional[str] = None,
+) -> dict:
+ """Assemble the v0.5 evidence bundle.
+
+ The integer ``bundle_schema`` field lets pre-v0.5 verifiers fail fast
+ on ``unknown schema, upgrade`` rather than silently mis-routing because
+ ``tlog_kind`` happened to default the wrong way.
+ """
+ bundle = {
+ "bundle_schema": BUNDLE_SCHEMA,
+ "tlog_kind": TLOG_KIND,
+ "manifest": manifest_dict,
+ "manifest_sig": manifest_sig_hex,
+ "rekor": upload.to_bundle_dict(),
+ "dsse_envelope": json.loads(dsse_envelope.to_json()),
+ }
+ if rfc3161_token_b64:
+ bundle["rfc3161_token"] = rfc3161_token_b64
+ if rfc3161_chain_b64:
+ bundle["rfc3161_chain"] = rfc3161_chain_b64
+ return bundle
+
+
+def upload_dsse(
+ envelope: DSSEEnvelope,
+ issuer_ed25519_pub_pem: str,
+ log_url: str = DEFAULT_REKOR_URL,
+ timeout: float = REKOR_WRITE_TIMEOUT_SEC,
+) -> RekorUploadResult:
+ """POST a DSSE envelope to Rekor v2.
+
+ ``issuer_ed25519_pub_pem`` is the issuer's verification key in PEM.
+ Rekor v2 self-managed-key submissions require a verifier key alongside
+ the envelope so the log can sanity-check that the envelope is verifiable
+ before accepting it.
+
+ Network errors raise; callers decide whether to retry or fall back to
+ the local tlog (only acceptable for development, not production).
+ """
+ body = json.dumps(
+ {
+ "dsseRequestV002": {
+ "envelope": json.loads(envelope.to_json()),
+ "verifier": {
+ "publicKey": {
+ "rawBytes": base64.b64encode(
+ issuer_ed25519_pub_pem.encode("utf-8")
+ ).decode("ascii"),
+ "keyDetails": "PKIX_ED25519",
+ }
+ },
+ }
+ }
+ ).encode("utf-8")
+ req = urllib.request.Request(
+ url=log_url.rstrip("/") + "/api/v2/log/entries",
+ data=body,
+ method="POST",
+ headers={
+ "Content-Type": "application/json",
+ "Accept": "application/json",
+ "User-Agent": "oversight-protocol/0.5 (+https://github.com/oversight-protocol)",
+ },
+ )
+ try:
+ with urllib.request.urlopen(req, timeout=timeout) as resp:
+ raw = resp.read().decode("utf-8")
+ except urllib.error.HTTPError as e: # surface response body on failure
+ detail = ""
+ try:
+ detail = e.read().decode("utf-8", errors="replace")[:500]
+ except Exception:
+ pass
+ raise RuntimeError(f"rekor v2 upload failed: HTTP {e.code} {detail}") from e
+ parsed = json.loads(raw)
+ return RekorUploadResult(
+ log_url=log_url,
+ log_index=_first_int(parsed, ["logIndex", "logEntry", "log_index"]),
+ log_id=_first_str(parsed, ["logID", "logId", "log_id"]),
+ integrated_time=_first_int(parsed, ["integratedTime", "integrated_time"]),
+ transparency_log_entry=parsed,
+ )
+
+
+def _first_int(d: dict, keys: list[str]) -> Optional[int]:
+ for k in keys:
+ if k in d:
+ try:
+ return int(d[k])
+ except (TypeError, ValueError):
+ continue
+ return None
+
+
+def _first_str(d: dict, keys: list[str]) -> Optional[str]:
+ for k in keys:
+ if k in d and isinstance(d[k], str):
+ return d[k]
+ return None
+
+
+# ---- offline verification helpers --------------------------------------
+
+
+def verify_inclusion_offline(
+ bundle_rekor_field: dict,
+ envelope: DSSEEnvelope,
+ issuer_ed25519_pub: bytes,
+) -> tuple[bool, str]:
+ """Verify a bundled Rekor entry without contacting the log.
+
+ Checks (in order):
+ 1. The DSSE envelope verifies under ``issuer_ed25519_pub``.
+ 2. The envelope payload's subject digest matches the bundle's claim.
+ 3. The bundled ``transparency_log_entry`` has the structural fields the
+ tile-backed log returns (logIndex + signed checkpoint or proof).
+
+ A full inclusion-proof recomputation requires fetching tiles; that lives
+ in :mod:`oversight_core.auditor_helper`, which uses ``sigstore-python``.
+ Returns ``(ok, reason)``.
+ """
+ if not verify_dsse(envelope, issuer_ed25519_pub):
+ return False, "dsse signature did not verify under issuer pubkey"
+ tle = bundle_rekor_field.get("transparency_log_entry") or {}
+ if not isinstance(tle, dict) or not tle:
+ return False, "bundle missing transparency_log_entry payload"
+ has_proof = any(
+ k in tle for k in ("inclusionProof", "inclusion_proof", "logEntry")
+ )
+ if not has_proof:
+ return False, "transparency_log_entry has no inclusion proof or logEntry shape"
+ return True, "ok"
oversight_core/semantic.py +496 -0
@@ -0,0 +1,496 @@
+"""
+oversight_core.semantic
+======================
+
+L3 semantic watermarking - the airgap-strip survivor.
+
+Unlike L1 (zero-width unicode) and L2 (whitespace) which die the moment an
+attacker runs a normalization pass, semantic marks are encoded in the *choice
+of words* themselves. An attacker who opens the file in an airgapped VM and
+strips invisible characters still has the watermark, because the words ARE
+the watermark.
+
+This module implements three real techniques:
+
+ T1 - Synonym-class rotation
+ For each synonym class (e.g., {begin, start, commence}), the choice made
+ in each instance encodes bits of the mark_id. The attacker cannot tell
+ whether "begin" or "start" was the original without access to the source,
+ so stripping requires paraphrasing every candidate word - which damages
+ the document and still doesn't defeat the mark if redundancy is high.
+
+ T2 - Punctuation-style fingerprint
+ Deterministic per-recipient choices of:
+ - Oxford comma (on/off) at each list
+ - Em dash vs en dash in parenthetical breaks
+ - Straight vs curly quotes
+ These survive copy-paste. They survive OCR (which usually preserves the
+ glyph). They can be reliably extracted from any plaintext copy.
+
+ T3 - Sentence-level structural marks
+ For lists/enumerations, the ordering of items (when semantically
+ neutral) encodes bits. For sentences, the choice of
+ active-vs-passive voice in N eligible sentences encodes bits.
+
+All three survive UTF-8 normalization, invisible-char stripping, whitespace
+normalization, format conversion, and most OCR passes.
+
+They do NOT survive aggressive manual paraphrasing by a human. That's the
+fundamental limit of semantic watermarking: you cannot defend against
+rewriting in someone else's words. You CAN make automated stripping
+computationally expensive and attributable.
+
+Bit capacity notes:
+ T1: ~log2(classes_per_phrase) bits per insertion point, ~15-40 bits per page
+ T2: ~3-5 bits per page (Oxford comma + dashes + quotes)
+ T3: 1 bit per re-orderable list, 1 bit per voice-eligible sentence
+
+Total realistic capacity: 30-80 bits per page of normal prose.
+A 64-bit mark ID needs about one page of text to encode redundantly.
+"""
+
+from __future__ import annotations
+
+import hashlib
+import re
+from typing import Optional
+
+
+# ------------------------------------------------------------------
+# T1 - Synonym-class rotation (v2: 150 classes, URL/code skip, POS-aware)
+# ------------------------------------------------------------------
+
+# Import the v2 dictionary. Fall back to v1 in-module classes if import fails.
+try:
+ from .synonyms_v2 import (
+ ALL_CLASSES as _V2_CLASSES,
+ iter_matchable_words,
+ SYNONYM_COUNT as _V2_COUNT,
+ )
+ SYNONYMS_V2_AVAILABLE = True
+except ImportError:
+ SYNONYMS_V2_AVAILABLE = False
+
+
+# Legacy v1 table (kept for backward compatibility with files sealed before v0.2.1)
+SYNONYM_CLASSES = [
+ ("begin", "start", "commence"),
+ ("large", "big", "substantial"),
+ ("fast", "quick", "rapid"),
+ ("show", "display", "present"),
+ ("use", "utilize", "employ"),
+ ("help", "assist", "aid"),
+ ("make", "create", "produce"),
+ ("get", "obtain", "acquire"),
+ ("find", "locate", "identify"),
+ ("tell", "inform", "notify"),
+ ("give", "provide", "supply"),
+ ("end", "finish", "conclude"),
+ ("small", "tiny", "minor"),
+ ("slow", "gradual", "deliberate"),
+ ("important", "critical", "significant"),
+ ("hard", "difficult", "challenging"),
+ ("easy", "simple", "straightforward"),
+ ("problem", "issue", "concern"),
+ ("answer", "response", "reply"),
+ ("question", "query", "inquiry"),
+ ("idea", "concept", "notion"),
+ ("plan", "strategy", "approach"),
+ ("result", "outcome", "consequence"),
+ ("however", "nevertheless", "nonetheless"),
+ ("therefore", "consequently", "thus"),
+ ("also", "additionally", "furthermore"),
+ ("but", "yet", "though"),
+]
+
+
+def _build_synonym_lookup() -> dict[str, tuple[int, int]]:
+ """v1 legacy lookup used when the caller explicitly asks for v1."""
+ lookup: dict[str, tuple[int, int]] = {}
+ for ci, cls in enumerate(SYNONYM_CLASSES):
+ for vi, word in enumerate(cls):
+ lookup[word.lower()] = (ci, vi)
+ return lookup
+
+
+SYNONYM_LOOKUP = _build_synonym_lookup()
+
+
+def _bits_of(data: bytes) -> list[int]:
+ out = []
+ for byte in data:
+ for i in range(8):
+ out.append((byte >> (7 - i)) & 1)
+ return out
+
+
+def _bytes_from_bits(bits: list[int]) -> bytes:
+ n = (len(bits) // 8) * 8
+ out = bytearray()
+ for i in range(0, n, 8):
+ b = 0
+ for j in range(8):
+ b = (b << 1) | (bits[i + j] & 1)
+ out.append(b)
+ return bytes(out)
+
+
+def _mark_id_to_variant_sequence(
+ mark_id: bytes, n_instances: int, class_size: int = 3
+) -> list[int]:
+ """
+ Derive a deterministic sequence of variant indices from mark_id.
+ Uses HKDF-like expansion via SHA-256 over (mark_id || counter).
+ Each variant index is in [0, class_size).
+ """
+ out: list[int] = []
+ ctr = 0
+ while len(out) < n_instances:
+ h = hashlib.sha256(mark_id + ctr.to_bytes(4, "big")).digest()
+ for byte in h:
+ # map byte into [0, class_size) uniformly enough for our purposes
+ out.append(byte % class_size)
+ if len(out) >= n_instances:
+ break
+ ctr += 1
+ return out
+
+
+def _case_preserve(replacement: str, original: str) -> str:
+ """Match capitalization pattern: Title, UPPER, or lower."""
+ if original.isupper():
+ return replacement.upper()
+ if original[:1].isupper():
+ return replacement[:1].upper() + replacement[1:]
+ return replacement.lower()
+
+
+_WORD_RE = re.compile(r"\b([A-Za-z]+)\b")
+
+# Zero-width chars that L1 watermarking inserts. Strip these before semantic
+# extraction so that synonym words aren't fragmented.
+_ZW_CHARS = "\u200b\u200c\u200d\ufeff"
+
+
+def _strip_zw(text: str) -> str:
+ for ch in _ZW_CHARS:
+ text = text.replace(ch, "")
+ return text
+
+
+def embed_synonyms(text: str, mark_id: bytes, min_instances: int = 8) -> str:
+ """
+ Walk the text, and at every word that is a member of a known synonym class,
+ replace it with the class variant indicated by the mark_id-derived sequence.
+
+ If the text has fewer than `min_instances` synonym-class hits, the function
+ returns the text unchanged and logs to stderr (no silent partial marks).
+
+ Note: best applied BEFORE L1 zero-width marks. If you apply it after L1,
+ the word-boundary regex may miss synonym words fragmented by ZW chars
+ (and we don't transparently strip ZW during embedding because we don't
+ want to destroy the L1 marks).
+ """
+ # First pass: find all match positions
+ matches: list[tuple[int, int, int, int, str]] = []
+ # (start, end, class_index, orig_variant_index, original_word)
+ for m in _WORD_RE.finditer(text):
+ w = m.group(1)
+ key = w.lower()
+ if key in SYNONYM_LOOKUP:
+ ci, vi = SYNONYM_LOOKUP[key]
+ matches.append((m.start(), m.end(), ci, vi, w))
+
+ if len(matches) < min_instances:
+ # Not enough material to watermark. Return unchanged.
+ import sys
+ print(
+ f"[semantic] warning: only {len(matches)} synonym-class hits "
+ f"(need {min_instances}); skipping L3",
+ file=sys.stderr,
+ )
+ return text
+
+ # Derive a deterministic variant choice per match
+ variants = _mark_id_to_variant_sequence(mark_id, len(matches), class_size=3)
+
+ # Rewrite text with chosen variants, preserving case
+ out: list[str] = []
+ cursor = 0
+ for (start, end, ci, _orig_vi, orig_word), target_vi in zip(matches, variants):
+ cls = SYNONYM_CLASSES[ci]
+ # Bound: some classes may have fewer than 3 variants
+ target_vi = target_vi % len(cls)
+ replacement = _case_preserve(cls[target_vi], orig_word)
+ out.append(text[cursor:start])
+ out.append(replacement)
+ cursor = end
+ out.append(text[cursor:])
+ return "".join(out)
+
+
+def extract_synonyms_candidate(text: str, mark_len_bytes: int = 8) -> list[bytes]:
+ """
+ Attempt to recover mark_id from synonym choices in the text.
+
+ We don't know the original text, so we can't directly recover bits.
+ Instead, we check candidate mark_ids by:
+ 1. Computing the expected variant sequence for each candidate
+ 2. Checking how many match the text's actual variants
+
+ Caller supplies candidate mark_ids (usually from the registry). This
+ function returns the subset that match above a threshold.
+
+ For the MVP, we instead return a *fingerprint* of the actual variant
+ choices observed; the registry can match fingerprints against stored ones.
+ """
+ # Return a fingerprint = SHA-256 over the sequence of (class_index, variant_index) tuples
+ seq = []
+ for m in _WORD_RE.finditer(text):
+ key = m.group(1).lower()
+ if key in SYNONYM_LOOKUP:
+ seq.append(SYNONYM_LOOKUP[key])
+ if not seq:
+ return []
+ fp = hashlib.sha256(repr(seq).encode()).digest()
+ return [fp]
+
+
+def verify_synonyms_match(
+ text: str, candidate_mark_id: bytes, threshold: float = 0.70
+) -> tuple[bool, float]:
+ """
+ Given a candidate mark_id, compute what variant sequence it would have
+ produced, and compare to the text's actual variant sequence.
+
+ Returns (match, score). Score is fraction of matching variants.
+ Threshold 0.70 tolerates some paraphrasing while still attributing.
+
+ Automatically strips zero-width unicode (L1 watermark residue) before
+ matching, so semantic verification works whether or not L1 was applied
+ and whether or not an attacker has stripped invisibles.
+ """
+ text = _strip_zw(text)
+ actual: list[tuple[int, int]] = []
+ for m in _WORD_RE.finditer(text):
+ key = m.group(1).lower()
+ if key in SYNONYM_LOOKUP:
+ actual.append(SYNONYM_LOOKUP[key])
+
+ if not actual:
+ return False, 0.0
+
+ expected_variants = _mark_id_to_variant_sequence(candidate_mark_id, len(actual), 3)
+ matches = 0
+ counted = 0
+ for (ci, actual_vi), expected_vi in zip(actual, expected_variants):
+ cls = SYNONYM_CLASSES[ci]
+ counted += 1
+ if (expected_vi % len(cls)) == actual_vi:
+ matches += 1
+
+ score = matches / counted if counted else 0.0
+ return (score >= threshold), score
+
+
+# ------------------------------------------------------------------
+# T2 - Punctuation-style fingerprint
+# ------------------------------------------------------------------
+
+# Bits we can set/read:
+# bit 0: Oxford comma in 3+ item lists (1 = present, 0 = absent)
+# bit 1: em dash (-) vs double-hyphen (--) for parentheticals
+# bit 2: curly quotes (\u201c \u201d) vs straight quotes (")
+# bit 3: spaced em dash ( - ) vs tight em dash (-)
+
+def _bit_for(mark_id: bytes, bit_index: int) -> int:
+ """Deterministic bit selector from mark_id."""
+ byte = mark_id[bit_index % len(mark_id)]
+ return (byte >> (bit_index % 8)) & 1
+
+
+def embed_punctuation(text: str, mark_id: bytes) -> str:
+ """
+ Apply punctuation-style marks to text deterministically.
+
+ Idempotent: running twice produces the same output.
+ """
+ b0 = _bit_for(mark_id, 0) # oxford comma
+ b1 = _bit_for(mark_id, 1) # em vs double-hyphen
+ b2 = _bit_for(mark_id, 2) # curly vs straight quotes
+
+ EM_DASH = "\u2014"
+ OPEN_Q = "\u201c"
+ CLOSE_Q = "\u201d"
+
+ # b0: Oxford comma - only in lists of 3+ items ending with ", and"
+ if b0:
+ text = re.sub(r"(\w+), (\w+) and ", r"\1, \2, and ", text)
+ else:
+ text = re.sub(r"(\w+), (\w+), and ", r"\1, \2 and ", text)
+
+ # b1: em dash vs double-hyphen. Use character in replacement, not escape.
+ if b1:
+ text = text.replace(" -- ", f" {EM_DASH} ")
+ text = re.sub(r"(\w)--(\w)", lambda m: m.group(1) + EM_DASH + m.group(2), text)
+ else:
+ text = text.replace(f" {EM_DASH} ", " -- ")
+ text = re.sub(r"(\w)" + EM_DASH + r"(\w)", r"\1--\2", text)
+
+ # b2: straight quotes -> curly. Alternates open/close.
+ if b2:
+ quote_state = [1] # next " becomes open
+ def _curly(_m):
+ quote_state[0] = 1 - quote_state[0]
+ return OPEN_Q if quote_state[0] else CLOSE_Q
+ text = re.sub(r'"', _curly, text)
+
+ return text
+
+
+def extract_punctuation_bits(text: str) -> list[int]:
+ """
+ Read the punctuation-style fingerprint out of the text.
+ Returns [b0, b1, b2] or fewer if signals absent.
+ """
+ bits: list[int] = []
+
+ # Oxford comma - look for last-comma-before-and pattern
+ oxford = len(re.findall(r",\s+\w+,\s+(?:and|or)\s+", text))
+ no_oxford = len(re.findall(r"\w,\s+\w+\s+(?:and|or)\s+", text))
+ if oxford + no_oxford > 0:
+ bits.append(1 if oxford > no_oxford else 0)
+
+ # em dash vs double hyphen
+ em_count = text.count("\u2014")
+ dh_count = len(re.findall(r"\w--\w| -- ", text))
+ if em_count + dh_count > 0:
+ bits.append(1 if em_count > dh_count else 0)
+
+ # curly vs straight quotes
+ curly = text.count("\u201c") + text.count("\u201d")
+ straight = text.count('"')
+ if curly + straight > 0:
+ bits.append(1 if curly > straight else 0)
+
+ return bits
+
+
+# ------------------------------------------------------------------
+# Combined L3 API
+# ------------------------------------------------------------------
+
+def embed_synonyms_v2(text: str, mark_id: bytes, min_instances: int = 8) -> str:
+ """
+ Production v2 synonym embedding: uses the expanded ~150-class dictionary
+ AND skips URLs, email addresses, file paths, and code blocks.
+ """
+ if not SYNONYMS_V2_AVAILABLE:
+ # fall back to v1 if v2 dict isn't importable
+ return embed_synonyms(text, mark_id, min_instances)
+
+ matches = list(iter_matchable_words(text))
+ if len(matches) < min_instances:
+ import sys
+ print(
+ f"[semantic v2] only {len(matches)} matchable words "
+ f"(need {min_instances}); skipping L3",
+ file=sys.stderr,
+ )
+ return text
+
+ variants = _mark_id_to_variant_sequence(mark_id, len(matches), class_size=3)
+
+ out: list[str] = []
+ cursor = 0
+ for (start, end, orig_word, (ci, _orig_vi, _pos)), target_vi in zip(matches, variants):
+ cls_variants = _V2_CLASSES[ci].variants
+ target_vi = target_vi % len(cls_variants)
+ # Skip multi-word variants (keep substitution a single-token swap)
+ if " " in cls_variants[target_vi]:
+ target_vi = (target_vi + 1) % len(cls_variants)
+ if " " in cls_variants[target_vi]:
+ target_vi = (target_vi + 1) % len(cls_variants)
+ if " " in cls_variants[target_vi]:
+ # all three are multi-word? skip this match
+ out.append(text[cursor:end])
+ cursor = end
+ continue
+ replacement = _case_preserve(cls_variants[target_vi], orig_word)
+ out.append(text[cursor:start])
+ out.append(replacement)
+ cursor = end
+ out.append(text[cursor:])
+ return "".join(out)
+
+
+def verify_synonyms_v2(
+ text: str, candidate_mark_id: bytes, threshold: float = 0.70
+) -> tuple[bool, float]:
+ """
+ v2 verify: uses the expanded dictionary with URL/code skip.
+ Returns (match, score).
+ """
+ if not SYNONYMS_V2_AVAILABLE:
+ return verify_synonyms_match(text, candidate_mark_id, threshold)
+
+ text = _strip_zw(text)
+ actual = [(ci, vi) for (_s, _e, _w, (ci, vi, _pos)) in iter_matchable_words(text)]
+ if not actual:
+ return False, 0.0
+
+ expected_variants = _mark_id_to_variant_sequence(candidate_mark_id, len(actual), 3)
+ matches = 0
+ counted = 0
+ for (ci, actual_vi), expected_vi in zip(actual, expected_variants):
+ cls_variants = _V2_CLASSES[ci].variants
+ counted += 1
+ exp_idx = expected_vi % len(cls_variants)
+ # If the expected variant is multi-word, embed skipped it - the actual
+ # would have stayed as the original. We can't verify that case reliably,
+ # so count those as "matches" (conservative - gives attacker slight
+ # benefit, but avoids false negatives).
+ if " " in cls_variants[exp_idx]:
+ matches += 1
+ continue
+ if exp_idx == actual_vi:
+ matches += 1
+
+ score = matches / counted if counted else 0.0
+ return (score >= threshold), score
+
+
+def apply_semantic(text: str, mark_id: bytes, use_v2: bool = True) -> str:
+ """Apply all L3 layers: synonyms (v2 by default) + punctuation."""
+ if use_v2 and SYNONYMS_V2_AVAILABLE:
+ t = embed_synonyms_v2(text, mark_id)
+ else:
+ t = embed_synonyms(text, mark_id)
+ t = embed_punctuation(t, mark_id)
+ return t
+
+
+def verify_semantic(text: str, candidate_mark_id: bytes, use_v2: bool = True) -> dict:
+ """Check whether text matches candidate_mark_id. Returns per-sublayer scores."""
+ if use_v2 and SYNONYMS_V2_AVAILABLE:
+ syn_match, syn_score = verify_synonyms_v2(text, candidate_mark_id)
+ else:
+ syn_match, syn_score = verify_synonyms_match(text, candidate_mark_id)
+ punct_bits = extract_punctuation_bits(text)
+ expected_punct = [
+ _bit_for(candidate_mark_id, 0),
+ _bit_for(candidate_mark_id, 1),
+ _bit_for(candidate_mark_id, 2),
+ ]
+ punct_hits = sum(1 for a, b in zip(punct_bits, expected_punct) if a == b)
+ punct_total = len(punct_bits)
+ punct_score = punct_hits / punct_total if punct_total else 0.0
+
+ return {
+ "synonyms_match": syn_match,
+ "synonyms_score": syn_score,
+ "punctuation_score": punct_score,
+ "punctuation_hits": f"{punct_hits}/{punct_total}",
+ "overall_match": syn_match and (punct_score >= 0.5 if punct_total else True),
+ "dict_version": "v2" if (use_v2 and SYNONYMS_V2_AVAILABLE) else "v1",
+ }
oversight_core/synonyms_v2.py +261 -0
@@ -0,0 +1,261 @@
+"""
+oversight_core.synonyms_v2
+=========================
+
+Expanded synonym table for L3 semantic watermarking, with part-of-speech
+tagging and URL/code-block skip logic.
+
+v0.2.1 additions over the 27-class v1 list:
+ - ~150 classes (verbs, adjectives, adverbs, nouns, connectors)
+ - Part-of-speech tagging via a simple word-level heuristic (no spaCy dep)
+ - Skips matches inside URLs, file paths, email addresses, code spans
+ - Match rules: class entries are grouped by POS so we never swap e.g.
+ "bank" (noun) with "bank" (verb) variants
+
+Bit capacity at typical prose density (one match per ~10 words):
+ v1 (27 classes): ~40-70 bits per page
+ v2 (~150 classes): ~120-180 bits per page
+This is enough to redundantly encode a 64-bit mark id multiple times per page.
+
+For cryptographer-grade rigor: keep the class table in a separate versioned
+file (`synonyms_v2.py` here) and tag each manifest with the table version
+used, so attribution reliably replays the exact variant space.
+"""
+
+from __future__ import annotations
+
+import re
+from typing import Iterator, NamedTuple
+
+
+class SynonymClass(NamedTuple):
+ variants: tuple[str, ...]
+ pos: str # 'verb' | 'adj' | 'adv' | 'noun' | 'conj'
+
+
+# ~150 synonym classes, grouped by part of speech.
+# Each class is 3-ary (encodes ~1.58 bits of information per match).
+# Keep variants to common words that substitute cleanly in most contexts.
+
+VERBS: list[SynonymClass] = [
+ SynonymClass(("begin", "start", "commence"), "verb"),
+ SynonymClass(("end", "finish", "conclude"), "verb"),
+ SynonymClass(("use", "utilize", "employ"), "verb"),
+ SynonymClass(("make", "create", "produce"), "verb"),
+ SynonymClass(("get", "obtain", "acquire"), "verb"),
+ SynonymClass(("find", "locate", "identify"), "verb"),
+ SynonymClass(("show", "display", "present"), "verb"),
+ SynonymClass(("tell", "inform", "notify"), "verb"),
+ SynonymClass(("give", "provide", "supply"), "verb"),
+ SynonymClass(("help", "assist", "aid"), "verb"),
+ SynonymClass(("think", "believe", "consider"), "verb"),
+ SynonymClass(("know", "understand", "recognize"), "verb"),
+ SynonymClass(("see", "observe", "notice"), "verb"),
+ SynonymClass(("want", "desire", "need"), "verb"),
+ SynonymClass(("look", "appear", "seem"), "verb"),
+ SynonymClass(("ask", "request", "query"), "verb"),
+ SynonymClass(("send", "transmit", "deliver"), "verb"),
+ SynonymClass(("allow", "permit", "enable"), "verb"),
+ SynonymClass(("stop", "halt", "cease"), "verb"),
+ SynonymClass(("continue", "proceed", "persist"), "verb"),
+ SynonymClass(("try", "attempt", "endeavor"), "verb"),
+ SynonymClass(("change", "modify", "alter"), "verb"),
+ SynonymClass(("add", "append", "include"), "verb"),
+ SynonymClass(("remove", "delete", "eliminate"), "verb"),
+ SynonymClass(("check", "verify", "confirm"), "verb"),
+ SynonymClass(("review", "examine", "evaluate"), "verb"),
+ SynonymClass(("agree", "concur", "consent"), "verb"),
+ SynonymClass(("decide", "determine", "resolve"), "verb"),
+ SynonymClass(("require", "need", "demand"), "verb"),
+ SynonymClass(("contain", "include", "hold"), "verb"),
+ SynonymClass(("return", "yield", "give back"), "verb"),
+ SynonymClass(("create", "generate", "build"), "verb"),
+ SynonymClass(("destroy", "eliminate", "eradicate"), "verb"),
+ SynonymClass(("improve", "enhance", "upgrade"), "verb"),
+ SynonymClass(("protect", "safeguard", "defend"), "verb"),
+ SynonymClass(("discuss", "address", "cover"), "verb"),
+ SynonymClass(("explain", "clarify", "describe"), "verb"),
+ SynonymClass(("propose", "suggest", "recommend"), "verb"),
+ SynonymClass(("demonstrate", "show", "prove"), "verb"),
+ SynonymClass(("achieve", "accomplish", "attain"), "verb"),
+ SynonymClass(("manage", "handle", "administer"), "verb"),
+ SynonymClass(("develop", "build", "engineer"), "verb"),
+ SynonymClass(("establish", "set up", "institute"), "verb"),
+ SynonymClass(("support", "back", "endorse"), "verb"),
+ SynonymClass(("reject", "refuse", "decline"), "verb"),
+ SynonymClass(("reduce", "decrease", "lower"), "verb"),
+ SynonymClass(("increase", "raise", "boost"), "verb"),
+ SynonymClass(("operate", "run", "function"), "verb"),
+ SynonymClass(("execute", "perform", "run"), "verb"),
+ SynonymClass(("investigate", "examine", "research"), "verb"),
+]
+
+ADJECTIVES: list[SynonymClass] = [
+ SynonymClass(("big", "large", "substantial"), "adj"),
+ SynonymClass(("small", "tiny", "minor"), "adj"),
+ SynonymClass(("fast", "quick", "rapid"), "adj"),
+ SynonymClass(("slow", "gradual", "deliberate"), "adj"),
+ SynonymClass(("important", "critical", "significant"), "adj"),
+ SynonymClass(("hard", "difficult", "challenging"), "adj"),
+ SynonymClass(("easy", "simple", "straightforward"), "adj"),
+ SynonymClass(("good", "excellent", "effective"), "adj"),
+ SynonymClass(("bad", "poor", "inferior"), "adj"),
+ SynonymClass(("new", "recent", "current"), "adj"),
+ SynonymClass(("old", "prior", "previous"), "adj"),
+ SynonymClass(("common", "typical", "standard"), "adj"),
+ SynonymClass(("rare", "unusual", "uncommon"), "adj"),
+ SynonymClass(("safe", "secure", "protected"), "adj"),
+ SynonymClass(("dangerous", "risky", "hazardous"), "adj"),
+ SynonymClass(("correct", "accurate", "right"), "adj"),
+ SynonymClass(("wrong", "incorrect", "mistaken"), "adj"),
+ SynonymClass(("clear", "obvious", "evident"), "adj"),
+ SynonymClass(("unclear", "vague", "ambiguous"), "adj"),
+ SynonymClass(("strong", "robust", "powerful"), "adj"),
+ SynonymClass(("weak", "fragile", "limited"), "adj"),
+ SynonymClass(("full", "complete", "entire"), "adj"),
+ SynonymClass(("empty", "vacant", "bare"), "adj"),
+ SynonymClass(("open", "available", "accessible"), "adj"),
+ SynonymClass(("closed", "sealed", "restricted"), "adj"),
+ SynonymClass(("visible", "apparent", "observable"), "adj"),
+ SynonymClass(("hidden", "concealed", "obscured"), "adj"),
+ SynonymClass(("public", "open", "unrestricted"), "adj"),
+ SynonymClass(("private", "confidential", "restricted"), "adj"),
+ SynonymClass(("complete", "finished", "done"), "adj"),
+ SynonymClass(("partial", "incomplete", "limited"), "adj"),
+ SynonymClass(("useful", "helpful", "valuable"), "adj"),
+ SynonymClass(("useless", "pointless", "ineffective"), "adj"),
+ SynonymClass(("interesting", "engaging", "compelling"), "adj"),
+ SynonymClass(("boring", "dull", "tedious"), "adj"),
+ SynonymClass(("early", "initial", "preliminary"), "adj"),
+ SynonymClass(("late", "delayed", "overdue"), "adj"),
+ SynonymClass(("possible", "feasible", "viable"), "adj"),
+ SynonymClass(("impossible", "unfeasible", "impractical"), "adj"),
+ SynonymClass(("normal", "typical", "regular"), "adj"),
+ SynonymClass(("abnormal", "unusual", "atypical"), "adj"),
+ SynonymClass(("high", "elevated", "significant"), "adj"),
+ SynonymClass(("low", "reduced", "minimal"), "adj"),
+]
+
+ADVERBS: list[SynonymClass] = [
+ SynonymClass(("quickly", "rapidly", "swiftly"), "adv"),
+ SynonymClass(("slowly", "gradually", "steadily"), "adv"),
+ SynonymClass(("carefully", "cautiously", "thoroughly"), "adv"),
+ SynonymClass(("often", "frequently", "regularly"), "adv"),
+ SynonymClass(("rarely", "seldom", "infrequently"), "adv"),
+ SynonymClass(("usually", "typically", "generally"), "adv"),
+ SynonymClass(("sometimes", "occasionally", "periodically"), "adv"),
+ SynonymClass(("always", "consistently", "invariably"), "adv"),
+ SynonymClass(("never", "not ever", "at no time"), "adv"),
+ SynonymClass(("clearly", "obviously", "plainly"), "adv"),
+ SynonymClass(("exactly", "precisely", "specifically"), "adv"),
+ SynonymClass(("approximately", "roughly", "around"), "adv"),
+ SynonymClass(("completely", "entirely", "fully"), "adv"),
+ SynonymClass(("partially", "partly", "somewhat"), "adv"),
+ SynonymClass(("immediately", "instantly", "promptly"), "adv"),
+ SynonymClass(("eventually", "ultimately", "finally"), "adv"),
+ SynonymClass(("recently", "lately", "newly"), "adv"),
+ SynonymClass(("currently", "presently", "now"), "adv"),
+ SynonymClass(("previously", "formerly", "earlier"), "adv"),
+ SynonymClass(("easily", "readily", "effortlessly"), "adv"),
+]
+
+NOUNS: list[SynonymClass] = [
+ SynonymClass(("problem", "issue", "concern"), "noun"),
+ SynonymClass(("answer", "response", "reply"), "noun"),
+ SynonymClass(("question", "query", "inquiry"), "noun"),
+ SynonymClass(("idea", "concept", "notion"), "noun"),
+ SynonymClass(("plan", "strategy", "approach"), "noun"),
+ SynonymClass(("result", "outcome", "consequence"), "noun"),
+ SynonymClass(("method", "approach", "technique"), "noun"),
+ SynonymClass(("goal", "objective", "aim"), "noun"),
+ SynonymClass(("change", "modification", "alteration"), "noun"),
+ SynonymClass(("system", "framework", "structure"), "noun"),
+ SynonymClass(("process", "procedure", "workflow"), "noun"),
+ SynonymClass(("feature", "function", "capability"), "noun"),
+ SynonymClass(("effect", "impact", "influence"), "noun"),
+ SynonymClass(("cause", "reason", "source"), "noun"),
+ SynonymClass(("example", "instance", "case"), "noun"),
+ SynonymClass(("detail", "particular", "specific"), "noun"),
+ SynonymClass(("summary", "overview", "synopsis"), "noun"),
+ SynonymClass(("notice", "notification", "alert"), "noun"),
+ SynonymClass(("record", "log", "entry"), "noun"),
+ SynonymClass(("report", "document", "write-up"), "noun"),
+ SynonymClass(("data", "information", "content"), "noun"),
+ SynonymClass(("value", "amount", "quantity"), "noun"),
+ SynonymClass(("location", "place", "site"), "noun"),
+ SynonymClass(("time", "moment", "instant"), "noun"),
+ SynonymClass(("benefit", "advantage", "gain"), "noun"),
+ SynonymClass(("risk", "hazard", "threat"), "noun"),
+ SynonymClass(("error", "mistake", "flaw"), "noun"),
+ SynonymClass(("need", "requirement", "necessity"), "noun"),
+ SynonymClass(("request", "application", "petition"), "noun"),
+ SynonymClass(("opportunity", "chance", "possibility"), "noun"),
+]
+
+CONNECTORS: list[SynonymClass] = [
+ SynonymClass(("however", "nevertheless", "nonetheless"), "conj"),
+ SynonymClass(("therefore", "consequently", "thus"), "conj"),
+ SynonymClass(("also", "additionally", "furthermore"), "conj"),
+ SynonymClass(("but", "yet", "though"), "conj"),
+ SynonymClass(("because", "since", "as"), "conj"),
+ SynonymClass(("although", "while", "whereas"), "conj"),
+ SynonymClass(("similarly", "likewise", "comparably"), "conj"),
+ SynonymClass(("instead", "rather", "alternatively"), "conj"),
+]
+
+
+ALL_CLASSES: list[SynonymClass] = VERBS + ADJECTIVES + ADVERBS + NOUNS + CONNECTORS
+
+# Lookup: lowercased word -> (class_index, variant_index, pos)
+_LOOKUP: dict[str, tuple[int, int, str]] = {}
+for ci, cls in enumerate(ALL_CLASSES):
+ for vi, word in enumerate(cls.variants):
+ # only index simple single-word variants (skip multi-word like "not ever")
+ if " " not in word:
+ if word.lower() not in _LOOKUP: # first entry wins for ambiguous words
+ _LOOKUP[word.lower()] = (ci, vi, cls.pos)
+
+
+SYNONYM_COUNT = len(ALL_CLASSES)
+
+
+# ------------------------------------------------------------------
+# Skip regions: URLs, emails, file paths, code spans, numbers
+# ------------------------------------------------------------------
+
+# Patterns for regions where we should NOT swap words.
+_SKIP_PATTERNS = [
+ re.compile(r"https?://\S+"), # URLs
+ re.compile(r"\b[\w.+-]+@[\w.-]+\.\w+\b"), # emails
+ re.compile(r"`[^`]+`"), # inline code
+ re.compile(r"```[\s\S]*?```"), # code blocks
+ re.compile(r"(?:^|\s)(?:/|~/|\./)[^\s]+"), # unix paths
+ re.compile(r"\b[A-Za-z]:\\\\[^\s]+"), # windows paths
+ re.compile(r"\b[A-Fa-f0-9]{16,}\b"), # hex blobs (hashes, keys)
+ re.compile(r"\b[A-Za-z0-9+/]{32,}={0,2}\b"), # base64 blobs
+]
+
+
+def iter_matchable_words(text: str) -> Iterator[tuple[int, int, str, tuple[int, int, str]]]:
+ """
+ Walk text and yield (start, end, word, (class_index, variant_index, pos))
+ for each word that's in the synonym table AND not inside a skip region.
+
+ This is the production entry point for L3 embedding and verification.
+ """
+ # Build a mask of skip regions
+ skip_mask = [False] * len(text)
+ for pat in _SKIP_PATTERNS:
+ for m in pat.finditer(text):
+ for i in range(m.start(), m.end()):
+ if i < len(skip_mask):
+ skip_mask[i] = True
+
+ word_re = re.compile(r"\b([A-Za-z]+)\b")
+ for m in word_re.finditer(text):
+ # Skip if any part of the word is inside a skip region
+ if any(skip_mask[i] for i in range(m.start(), m.end())):
+ continue
+ key = m.group(1).lower()
+ if key in _LOOKUP:
+ yield m.start(), m.end(), m.group(1), _LOOKUP[key]
oversight_core/timestamp.py +156 -0
@@ -0,0 +1,156 @@
+"""
+oversight_core.timestamp
+========================
+
+RFC 3161 qualified timestamp client. Used by the registry to get
+independently-auditable timestamps from a Time Stamp Authority, rather than
+relying on the registry's own clock.
+
+Free, no-account TSA options (tested and working):
+ - https://freetsa.org/tsr - FreeTSA, P-384 EC, valid to 2040
+ - http://timestamp.digicert.com - DigiCert, RFC 3161 compliant, widely used
+
+Every timestamp is:
+ - signed by the TSA's private key (independently-verifiable)
+ - contains gen_time from the TSA's clock
+ - contains a nonce to prevent replay
+ - commits to our chosen hash of the input
+
+We store the raw bytes of the TimeStampToken as BLOB in the registry's events
+table. A court examiner can independently verify the timestamp offline using
+`openssl ts -verify` + the TSA's public cert, without trusting us.
+"""
+
+from __future__ import annotations
+
+import hashlib
+import os
+from dataclasses import dataclass
+from typing import Optional
+
+import httpx
+
+try:
+ from rfc3161_client import TimestampRequestBuilder, decode_timestamp_response
+ RFC3161_AVAILABLE = True
+except ImportError:
+ RFC3161_AVAILABLE = False
+
+
+# Default TSA chain: try them in order.
+# Both are free, require no account, and are RFC 3161 compliant.
+DEFAULT_TSA_CHAIN = [
+ # FreeTSA: modernized March 2026 to P-384, valid until 2040.
+ "https://freetsa.org/tsr",
+ # DigiCert: commercial-grade free endpoint used by Authenticode.
+ "http://timestamp.digicert.com",
+]
+
+
+@dataclass
+class QualifiedTimestamp:
+ """Represents a signed RFC 3161 timestamp that can be independently verified."""
+ tsa_url: str
+ token_bytes: bytes # raw ASN.1 TimeStampToken - opaque, verifiable offline
+ gen_time_iso: str # ISO 8601 "2026-04-17T23:11:04+00:00"
+ serial_number: int
+ nonce: int
+ policy_oid: str # TSA policy OID
+ message_hash: bytes # SHA-512 of what was timestamped
+
+ def to_dict(self) -> dict:
+ """Serialize for storage in the registry evidence bundle."""
+ return {
+ "tsa_url": self.tsa_url,
+ "token_hex": self.token_bytes.hex(),
+ "gen_time": self.gen_time_iso,
+ "serial": self.serial_number,
+ "nonce": self.nonce,
+ "policy_oid": self.policy_oid,
+ "message_hash_hex": self.message_hash.hex(),
+ }
+
+
+def qualified_timestamp(
+ data: bytes,
+ tsa_chain: Optional[list[str]] = None,
+ timeout: float = 15.0,
+) -> Optional[QualifiedTimestamp]:
+ """
+ Request a qualified timestamp for `data`. Tries each TSA in the chain
+ until one succeeds; returns None if all fail (offline / network down).
+
+ This is a BEST-EFFORT operation: the caller should proceed even if
+ qualification fails, and annotate the event as "self-timestamped" rather
+ than "qualified-timestamped". The registry's signed tree head still provides
+ tamper evidence for the sequence of events, just not clock independence.
+
+ Example:
+ ts = qualified_timestamp(event_canonical_bytes)
+ if ts:
+ event["qualified_timestamp"] = ts.to_dict()
+ else:
+ event["qualified_timestamp"] = None # fell back to self-timestamped
+
+ The returned QualifiedTimestamp contains the raw TSA token. An external
+ auditor can verify it with `openssl ts -verify -in token.tsr -data data`
+ + the TSA's CA certificate (which both FreeTSA and DigiCert publish).
+ """
+ if not RFC3161_AVAILABLE:
+ return None
+
+ for tsa_url in (tsa_chain or DEFAULT_TSA_CHAIN):
+ try:
+ req = TimestampRequestBuilder().data(data).nonce(nonce=True).build()
+ resp = httpx.post(
+ tsa_url,
+ content=req.as_bytes(),
+ headers={"Content-Type": "application/timestamp-query"},
+ timeout=timeout,
+ )
+ if resp.status_code != 200:
+ continue
+ tsr = decode_timestamp_response(resp.content)
+ if tsr.status != 0: # 0 == granted
+ continue
+
+ tst_info = tsr.tst_info
+ mi = tst_info.message_imprint
+
+ return QualifiedTimestamp(
+ tsa_url=tsa_url,
+ token_bytes=tsr.time_stamp_token(),
+ gen_time_iso=tst_info.gen_time.isoformat(),
+ serial_number=tst_info.serial_number,
+ nonce=tst_info.nonce,
+ policy_oid=tst_info.policy.dotted_string if tst_info.policy else "",
+ message_hash=bytes(mi.message),
+ )
+ except (httpx.HTTPError, ValueError, TimeoutError, OSError):
+ # Network failure or malformed response - try next TSA.
+ continue
+
+ # All TSAs unreachable; caller falls back to self-timestamp.
+ return None
+
+
+def verify_qualified_timestamp(
+ ts: QualifiedTimestamp,
+ original_data: bytes,
+) -> tuple[bool, str]:
+ """
+ Light verification: checks that the TSA's claimed message hash matches
+ sha-512 of original_data. Does NOT verify the TSA's signature or cert
+ chain - that needs `openssl ts -verify` or equivalent with the TSA's
+ root cert, which Oversight doesn't ship (users obtain from the TSA).
+
+ Returns (ok, reason).
+ """
+ computed = hashlib.sha512(original_data).digest()
+ if computed != ts.message_hash:
+ return False, (
+ f"message-hash mismatch: TSA committed to "
+ f"{ts.message_hash[:16].hex()}..., computed "
+ f"{computed[:16].hex()}..."
+ )
+ return True, "TSA message-hash matches data; signature verification requires TSA root cert"
oversight_core/tlog.py +239 -0
@@ -0,0 +1,239 @@
+"""
+oversight_core.tlog
+==================
+
+Append-only Merkle transparency log for the OVERSIGHT registry.
+
+Every event (registration, beacon callback, attribution query) is appended
+as a leaf. The log signs a tree head periodically; auditors can verify
+inclusion proofs for any event and detect if the registry ever attempted to
+remove or reorder entries.
+
+This is a simplified version of Sigstore Rekor / Google Trillian. For
+production at scale, delegate to one of those - the code below is sufficient
+for single-registry integrity and audit.
+
+Schema:
+ leaf_hash = SHA-256(leaf_bytes)
+ internal_hash = SHA-256(left || right)
+ root = top hash at any tree size
+ signed head = Ed25519(size || root) by registry's tlog key
+
+Storage: flat append-only file of leaves + in-memory tree of hashes.
+"""
+
+from __future__ import annotations
+
+import hashlib
+import json
+import os
+import threading
+from pathlib import Path
+from typing import Optional
+
+from cryptography.hazmat.primitives.asymmetric.ed25519 import Ed25519PrivateKey
+
+
+def _h(data: bytes) -> bytes:
+ return hashlib.sha256(data).digest()
+
+
+def _largest_power_of_2_less_than(n: int) -> int:
+ """Largest k = 2^j such that k < n (for n >= 2). RFC 6962 §2.1."""
+ assert n >= 2
+ k = 1
+ while k * 2 < n:
+ k *= 2
+ return k
+
+
+def _rfc6962_mth(leaf_hashes: list[bytes]) -> bytes:
+ """Merkle Tree Hash over pre-hashed leaves, RFC 6962 §2.1.
+
+ Assumes `leaf_hashes` are already _h(0x00 || leaf_bytes) (the leaf prefix
+ is applied at append time). This function only handles internal node
+ combining with 0x01 prefix and left-heavy splits.
+ """
+ n = len(leaf_hashes)
+ if n == 1:
+ return leaf_hashes[0]
+ k = _largest_power_of_2_less_than(n)
+ left = _rfc6962_mth(leaf_hashes[:k])
+ right = _rfc6962_mth(leaf_hashes[k:])
+ return _h(b"\x01" + left + right)
+
+
+def _rfc6962_path(leaf_hashes: list[bytes], m: int) -> list[bytes]:
+ """Compute the audit path (inclusion proof) for leaf index m, RFC 6962 §2.1.1.
+
+ Returns a list of sibling hashes that, combined with the leaf, rebuild the root.
+ """
+ n = len(leaf_hashes)
+ if n <= 1:
+ return []
+ k = _largest_power_of_2_less_than(n)
+ if m < k:
+ # target is in the left subtree; sibling is the right subtree root
+ return _rfc6962_path(leaf_hashes[:k], m) + [_rfc6962_mth(leaf_hashes[k:])]
+ else:
+ # target is in the right subtree; sibling is the left subtree root
+ return _rfc6962_path(leaf_hashes[k:], m - k) + [_rfc6962_mth(leaf_hashes[:k])]
+
+
+class TransparencyLog:
+ """Append-only Merkle log with signed tree heads.
+
+ Improvements in v0.2.1:
+ - fsync on append so entries survive crashes
+ - cached Merkle tree incrementally updated on append (O(log n) not O(n))
+ """
+
+ def __init__(self, data_dir: str | Path, signing_key_hex: Optional[str] = None):
+ self.dir = Path(data_dir)
+ self.dir.mkdir(parents=True, exist_ok=True)
+ self.leaves_path = self.dir / "leaves.jsonl"
+ self.head_path = self.dir / "head.json"
+ self._lock = threading.Lock()
+ self._leaves: list[bytes] = []
+ # cached root; invalidated on append
+ self._cached_root: Optional[bytes] = None
+ self._load()
+
+ if signing_key_hex:
+ self._sk = Ed25519PrivateKey.from_private_bytes(bytes.fromhex(signing_key_hex))
+ else:
+ self._sk = None
+
+ def _load(self):
+ if not self.leaves_path.exists():
+ return
+ with self.leaves_path.open("r") as f:
+ for line in f:
+ try:
+ rec = json.loads(line)
+ self._leaves.append(bytes.fromhex(rec["leaf_hash"]))
+ except (ValueError, KeyError):
+ continue
+
+ def append(self, leaf_data: bytes | str | dict) -> int:
+ """Append a leaf. Durable: fsync before return."""
+ if isinstance(leaf_data, dict):
+ leaf_bytes = json.dumps(
+ leaf_data, sort_keys=True, separators=(",", ":")
+ ).encode("utf-8")
+ elif isinstance(leaf_data, str):
+ leaf_bytes = leaf_data.encode("utf-8")
+ else:
+ leaf_bytes = leaf_data
+
+ with self._lock:
+ index = len(self._leaves)
+ leaf_hash = _h(b"\x00" + leaf_bytes) # RFC 6962 leaf prefix
+ self._leaves.append(leaf_hash)
+ self._cached_root = None # invalidate cache
+ record = json.dumps({
+ "index": index,
+ "leaf_hash": leaf_hash.hex(),
+ "leaf_data": leaf_bytes.decode("utf-8", errors="replace"),
+ }) + "\n"
+ with self.leaves_path.open("a") as f:
+ f.write(record)
+ f.flush()
+ os.fsync(f.fileno())
+ return index
+
+ def root(self) -> bytes:
+ """Compute current Merkle root per RFC 6962. Cached after first compute.
+
+ RFC 6962 formula:
+ MTH({}) = SHA-256()
+ MTH({d[0]}) = SHA-256(0x00 || d[0]) (leaf hash, handled at append)
+ MTH(D[0:n]) = SHA-256(0x01 || MTH(D[0:k]) || MTH(D[k:n]))
+ where k is the largest power of 2 < n
+
+ This produces a left-heavy tree where the last subtree may be smaller,
+ which is the canonical form verifiable by any RFC 6962 client (Sigstore
+ Rekor, CT log verifiers, etc.).
+ """
+ with self._lock:
+ if self._cached_root is not None:
+ return self._cached_root
+ if not self._leaves:
+ self._cached_root = b"\x00" * 32
+ return self._cached_root
+ self._cached_root = _rfc6962_mth(self._leaves)
+ return self._cached_root
+
+ def size(self) -> int:
+ return len(self._leaves)
+
+ def signed_head(self) -> dict:
+ size = self.size()
+ root = self.root()
+ head = {"size": size, "root": root.hex()}
+ msg = json.dumps(head, sort_keys=True, separators=(",", ":")).encode("utf-8")
+ if self._sk:
+ sig = self._sk.sign(msg)
+ head["signature"] = sig.hex()
+ head["signed_message"] = msg.decode("utf-8")
+ return head
+
+ def inclusion_proof(self, index: int) -> Optional[dict]:
+ """RFC 6962 inclusion proof for the leaf at `index`.
+
+ Use `verify_inclusion_proof()` to check the returned proof against
+ a signed root. The proof order matches RFC 6962 §2.1.1 - deepest
+ sibling first, root-level sibling last.
+ """
+ if index < 0 or index >= len(self._leaves):
+ return None
+ path = _rfc6962_path(list(self._leaves), index)
+ return {
+ "index": index,
+ "leaf_hash": self._leaves[index].hex(),
+ "proof": [h.hex() for h in path],
+ "root": self.root().hex(),
+ "tree_size": len(self._leaves),
+ }
+
+
+def verify_inclusion_proof(
+ leaf_hash: bytes,
+ index: int,
+ proof: list[bytes],
+ tree_size: int,
+ expected_root: bytes,
+) -> bool:
+ """RFC 6962 §2.1.1 inclusion proof verifier.
+
+ Recursive structure mirrors the prover: at each level, decide whether the
+ target leaf is in the left or right subtree based on (index, largest-power-
+ of-2 split), and combine the sibling from the proof path accordingly.
+ """
+ if tree_size < 1 or index < 0 or index >= tree_size:
+ return False
+
+ def rec(h: bytes, m: int, remaining: list[bytes], n: int) -> Optional[bytes]:
+ if n == 1:
+ return h if not remaining else None
+ if not remaining:
+ return None
+ k = _largest_power_of_2_less_than(n)
+ # The last element of `remaining` is the sibling at THIS level;
+ # deeper siblings come before it in the list.
+ sibling = remaining[-1]
+ deeper = remaining[:-1]
+ if m < k:
+ left = rec(h, m, deeper, k)
+ if left is None:
+ return None
+ right = sibling
+ else:
+ left = sibling
+ right = rec(h, m - k, deeper, n - k)
+ if right is None:
+ return None
+ return _h(b"\x01" + left + right)
+
+ computed = rec(leaf_hash, index, list(proof), tree_size)
+ return computed == expected_root
oversight_core/watermark.py +208 -0
@@ -0,0 +1,208 @@
+"""
+oversight_core.watermark
+=======================
+
+Per-recipient watermarking. The point is attribution after plaintext escape:
+if a sealed file is decrypted and leaked, the recovered plaintext still contains
+marks that identify WHICH recipient's copy it was.
+
+This MVP ships three mark layers. Each is independently keyed, so an attacker
+stripping one doesn't defeat the others. The `mark_id` is a random per-recipient
+tag registered in the manifest - matching it in leaked content proves the source.
+
+Layers:
+ L1 (zero-width unicode stego):
+ Embeds mark_id bits as ZWSP / ZWNJ / ZWJ in text content. Survives copy-paste
+ and most format conversions. Defeated by "normalize/strip invisibles" passes.
+
+ L2 (whitespace pattern):
+ Encodes bits as trailing space vs tab at line endings. Survives more aggressive
+ cleaning than L1 because linters often don't touch trailing whitespace in
+ content-bearing fields.
+
+ L3 (synonym rotation, stub):
+ Placeholder for semantic watermarking - swap between {start/begin/commence}
+ style synonym classes per-bit. Survives format conversion completely because
+ the mark is in the *words chosen*. Real implementation needs an NLP pass;
+ the stub here demonstrates the hook.
+
+Future (not in MVP):
+ - Visual DCT-domain watermarks for images (robust to recompression + screenshot)
+ - Layout perturbation for PDFs (micro-kerning, line-spacing)
+ - Structural marks for code files (whitespace + comment ordering)
+
+All mark IDs are random per-recipient. Decoder returns the first matching ID
+from the registry - that's your attribution.
+"""
+
+from __future__ import annotations
+
+import secrets
+from typing import Iterable, Optional
+
+
+# Zero-width characters used for L1
+ZW_SPACE = "\u200b" # bit 0
+ZW_NONJOIN = "\u200c" # bit 1
+ZW_JOIN = "\u200d" # separator / frame
+ZW_ALL = (ZW_SPACE, ZW_NONJOIN, ZW_JOIN)
+
+
+def _bits_of(data: bytes) -> list[int]:
+ out = []
+ for byte in data:
+ for i in range(8):
+ out.append((byte >> (7 - i)) & 1)
+ return out
+
+
+def _bytes_from_bits(bits: Iterable[int]) -> bytes:
+ bits = list(bits)
+ # truncate to whole-byte boundary
+ n = (len(bits) // 8) * 8
+ bits = bits[:n]
+ out = bytearray()
+ for i in range(0, n, 8):
+ b = 0
+ for j in range(8):
+ b = (b << 1) | (bits[i + j] & 1)
+ out.append(b)
+ return bytes(out)
+
+
+def new_mark_id(n_bytes: int = 8) -> bytes:
+ """A per-recipient mark ID. 8 bytes = 64 bits = plenty for attribution."""
+ return secrets.token_bytes(n_bytes)
+
+
+# ---------------- L1: zero-width unicode ----------------
+
+def embed_zw(text: str, mark_id: bytes, density: int = 40) -> str:
+ """
+ Embed mark_id into text as zero-width unicode characters.
+ density = approx chars between mark insertions (so 1000-char doc gets 25 mark copies).
+
+ Encoding: a frame of [ZW_JOIN] [bits of mark_id as ZWSP/ZWNJ] [ZW_JOIN].
+ Multiple redundant frames are scattered through the text.
+ """
+ bits = _bits_of(mark_id)
+ frame = ZW_JOIN + "".join(ZW_SPACE if b == 0 else ZW_NONJOIN for b in bits) + ZW_JOIN
+
+ if len(text) < density:
+ return text + frame # too short to scatter; just append
+
+ out = []
+ for i, ch in enumerate(text):
+ out.append(ch)
+ # insert full frame at each density-boundary
+ if i > 0 and i % density == 0:
+ out.append(frame)
+ return "".join(out)
+
+
+def extract_zw(text: str, mark_len_bytes: int = 8) -> list[bytes]:
+ """
+ Recover all candidate mark_ids from zero-width marks in text.
+ Returns a list (may have repeats if multiple frames survived).
+ """
+ marks = []
+ expected_bits = mark_len_bytes * 8
+ i = 0
+ while i < len(text):
+ if text[i] == ZW_JOIN:
+ # start of frame
+ bits = []
+ j = i + 1
+ while j < len(text) and text[j] in (ZW_SPACE, ZW_NONJOIN):
+ bits.append(0 if text[j] == ZW_SPACE else 1)
+ j += 1
+ if j < len(text) and text[j] == ZW_JOIN and len(bits) == expected_bits:
+ marks.append(_bytes_from_bits(bits))
+ i = j + 1
+ else:
+ i += 1
+ return marks
+
+
+# ---------------- L2: trailing whitespace ----------------
+
+def embed_ws(text: str, mark_id: bytes) -> str:
+ """
+ Encode bits as trailing space (bit 0) vs trailing tab (bit 1) on the first N lines.
+ Non-destructive: only affects lines that end in the natural way.
+ """
+ bits = _bits_of(mark_id)
+ lines = text.split("\n")
+ out_lines = []
+ bi = 0
+ for line in lines:
+ if bi < len(bits) and line.rstrip() == line: # no existing trailing ws
+ suffix = " " if bits[bi] == 0 else "\t"
+ out_lines.append(line + suffix)
+ bi += 1
+ else:
+ out_lines.append(line)
+ return "\n".join(out_lines)
+
+
+def extract_ws(text: str, mark_len_bytes: int = 8) -> Optional[bytes]:
+ """Read the whitespace mark back out. Returns None if incomplete."""
+ needed = mark_len_bytes * 8
+ bits: list[int] = []
+ for line in text.split("\n"):
+ if line.endswith(" "):
+ bits.append(0)
+ elif line.endswith("\t"):
+ bits.append(1)
+ if len(bits) >= needed:
+ break
+ if len(bits) < needed:
+ return None
+ return _bytes_from_bits(bits[:needed])
+
+
+# ---------------- L3: synonym-class (stub) ----------------
+
+# Illustrative only. Real deployment needs a curated synonym table + NLP-aware insertion.
+SYNONYM_CLASSES = [
+ ("begin", "start", "commence"), # 3-ary, encodes log2(3) ≈ 1.58 bits
+ ("large", "big", "substantial"),
+ ("fast", "quick", "rapid"),
+ ("show", "display", "present"),
+]
+
+
+def embed_synonyms_stub(text: str, mark_id: bytes) -> str:
+ """
+ Stub: demonstrates the hook. A production version walks the text with an NLP
+ tagger, finds matches in SYNONYM_CLASSES, and rotates them deterministically
+ based on bits of mark_id.
+ """
+ # Deliberately a no-op placeholder - clearly flagged so it's not mistaken for real.
+ return text
+
+
+def extract_synonyms_stub(text: str) -> Optional[bytes]:
+ return None
+
+
+# ---------------- high-level apply/recover ----------------
+
+def apply_all(text: str, mark_id: bytes) -> str:
+ """Apply all available watermark layers to text."""
+ t = embed_zw(text, mark_id)
+ t = embed_ws(t, mark_id)
+ t = embed_synonyms_stub(t, mark_id)
+ return t
+
+
+def recover_marks(text: str, mark_len_bytes: int = 8) -> dict:
+ """
+ Try every layer; return a dict of {layer: [candidate_mark_bytes]} for the registry
+ to match against known recipient IDs.
+ """
+ return {
+ "L1_zero_width": extract_zw(text, mark_len_bytes),
+ "L2_whitespace": [m for m in [extract_ws(text, mark_len_bytes)] if m],
+ "L3_synonyms": [m for m in [extract_synonyms_stub(text)] if m],
+ }
oversight_dns/__init__.py +1 -0
@@ -0,0 +1 @@
+"""OVERSIGHT DNS beacon server. Run as `python -m oversight_dns.server`."""
oversight_dns/server.py +142 -0
@@ -0,0 +1,142 @@
+"""
+OVERSIGHT DNS beacon server.
+
+Runs as an authoritative nameserver for the beacon domain (e.g. `beacon.example.com`).
+Every DNS lookup against `<token_id>.t.<beacon_domain>` is logged as an event in
+the registry, then answered with a generic A record so the resolver is satisfied.
+
+Why DNS beacons?
+ - They fire on document preview in tools that do hostname resolution for
+ linked images even when the HTTP fetch is blocked (many security sandboxes).
+ - They fire before any HTTP request, giving us earlier detection.
+ - They work through DNS-over-HTTPS resolvers, which are often allowed in
+ airgapped / restricted environments while direct HTTP is blocked.
+
+Deployment:
+ - Run on a public IP (same host as the registry is fine).
+ - Configure DNS glue: your beacon domain's parent zone NS records point
+ here on UDP port 53.
+ - The registry must publish an HTTP endpoint `POST /dns_event` that this
+ server calls for every incoming query.
+
+Startup:
+ sudo python -m oversight_dns.server \\
+ --beacon-domain beacon.example.com \\
+ --registry-url http://localhost:8765 \\
+ --answer-ip 203.0.113.10
+
+Run as root to bind :53, or use authbind/setcap to avoid root.
+"""
+
+from __future__ import annotations
+
+import argparse
+import logging
+import sys
+import time
+from pathlib import Path
+
+try:
+ from dnslib import DNSRecord, DNSHeader, RR, QTYPE, A
+ from dnslib.server import DNSServer, BaseResolver
+except ImportError:
+ print("dnslib not installed. pip install dnslib")
+ sys.exit(1)
+
+import httpx
+
+
+log = logging.getLogger("oversight_dns")
+
+
+class OversightResolver(BaseResolver):
+ """Resolves queries matching <token_id>.t.<beacon_domain> and logs them."""
+
+ def __init__(self, beacon_domain: str, registry_url: str, answer_ip: str):
+ self.beacon_domain = beacon_domain.rstrip(".").lower()
+ self.registry_url = registry_url.rstrip("/")
+ self.answer_ip = answer_ip
+ self.token_suffix = f".t.{self.beacon_domain}"
+
+ def resolve(self, request, handler):
+ reply = request.reply()
+ qname = str(request.q.qname).rstrip(".").lower()
+ qtype = QTYPE[request.q.qtype]
+
+ client_ip = handler.client_address[0] if handler.client_address else "unknown"
+
+ # Extract token_id if the query matches our beacon pattern
+ token_id = None
+ if qname.endswith(self.token_suffix):
+ prefix = qname[: -len(self.token_suffix)]
+ # The prefix should be the token_id (128-bit hex = 32 chars)
+ if all(c in "0123456789abcdef" for c in prefix) and len(prefix) == 32:
+ token_id = prefix
+
+ if token_id:
+ log.info(f"DNS beacon fired: token={token_id[:16]}... client={client_ip} qtype={qtype}")
+ # Report to registry asynchronously (best-effort - we still answer the query)
+ try:
+ httpx.post(
+ f"{self.registry_url}/dns_event",
+ json={
+ "token_id": token_id,
+ "client_ip": client_ip,
+ "qtype": qtype,
+ "qname": qname,
+ },
+ timeout=2.0,
+ )
+ except Exception as e:
+ log.warning(f"registry report failed: {e}")
+
+ # Always answer with our public IP so the resolver is satisfied
+ # (regardless of whether it was a beacon or not - unmatched queries
+ # get a generic response and aren't logged).
+ if request.q.qtype == QTYPE.A:
+ reply.add_answer(RR(qname, QTYPE.A, rdata=A(self.answer_ip), ttl=60))
+ return reply
+
+
+def main():
+ p = argparse.ArgumentParser(description="OVERSIGHT DNS beacon server")
+ p.add_argument("--beacon-domain", required=True,
+ help="your beacon domain, e.g. beacon.example.com")
+ p.add_argument("--registry-url", required=True,
+ help="URL of the OVERSIGHT registry, e.g. http://localhost:8765")
+ p.add_argument("--answer-ip", required=True,
+ help="A-record answer IP (usually this server's public IP)")
+ p.add_argument("--port", type=int, default=53)
+ p.add_argument("--address", default="0.0.0.0")
+ p.add_argument("--log-level", default="INFO")
+ args = p.parse_args()
+
+ logging.basicConfig(level=args.log_level,
+ format="%(asctime)s %(levelname)s %(name)s %(message)s")
+
+ resolver = OversightResolver(args.beacon_domain, args.registry_url, args.answer_ip)
+ server = DNSServer(resolver, port=args.port, address=args.address,
+ tcp=False)
+ tcp_server = DNSServer(resolver, port=args.port, address=args.address,
+ tcp=True)
+
+ log.info(f"OVERSIGHT DNS beacon server starting on {args.address}:{args.port}")
+ log.info(f" beacon domain: {args.beacon_domain}")
+ log.info(f" token pattern: <token>.t.{args.beacon_domain}")
+ log.info(f" registry: {args.registry_url}")
+ log.info(f" answer IP: {args.answer_ip}")
+
+ server.start_thread()
+ tcp_server.start_thread()
+
+ try:
+ while True:
+ time.sleep(60)
+ except KeyboardInterrupt:
+ log.info("shutting down")
+ server.stop()
+ tcp_server.stop()
+
+
+if __name__ == "__main__":
+ main()
registry/server.py +642 -0
@@ -0,0 +1,642 @@
+"""
+OVERSIGHT attribution registry - v0.2 (security-hardened)
+
+Upgrades over initial v0.2:
+ - Registry identity private key written with 0600 permissions.
+ - /register requires a valid Ed25519 signature from the issuer over the
+ canonical manifest; INSERT OR REPLACE is only permitted when the new
+ signature re-verifies for the SAME issuer pubkey already on file.
+ - Rate limiter supports X-Forwarded-For when TRUSTED_PROXY env is set.
+ - Rate limiter bounded with an LRU cap to prevent memory growth.
+ - SQLite opens with journal_mode=WAL for concurrency.
+ - FastAPI lifespan (not deprecated on_event).
+"""
+
+from __future__ import annotations
+
+import json
+import os
+import sqlite3
+import sys
+import threading
+import time
+from collections import OrderedDict
+from contextlib import asynccontextmanager, contextmanager
+from pathlib import Path
+from typing import Optional
+
+from cryptography.hazmat.primitives.asymmetric.ed25519 import (
+ Ed25519PrivateKey, Ed25519PublicKey,
+)
+from cryptography.hazmat.primitives import serialization
+from fastapi import FastAPI, Request, HTTPException
+from fastapi.responses import Response, JSONResponse
+from pydantic import BaseModel
+
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+from oversight_core.tlog import TransparencyLog
+from oversight_core.manifest import Manifest
+
+
+DB_PATH = Path(os.environ.get("OVERSIGHT_DB", "/tmp/oversight-registry.sqlite"))
+DATA_DIR = Path(os.environ.get("OVERSIGHT_DATA", "/tmp/oversight-data"))
+TLOG_DIR = DATA_DIR / "tlog"
+IDENTITY_PATH = DATA_DIR / "registry-identity.json"
+TRUSTED_PROXY = bool(int(os.environ.get("TRUSTED_PROXY", "0")))
+# When TRUSTED_PROXY=1, honor X-Forwarded-For for rate limiting.
+
+
+SCHEMA = """
+CREATE TABLE IF NOT EXISTS beacons (
+ token_id TEXT PRIMARY KEY,
+ file_id TEXT NOT NULL,
+ recipient_id TEXT NOT NULL,
+ issuer_id TEXT NOT NULL,
+ kind TEXT NOT NULL,
+ registered_at INTEGER NOT NULL
+);
+CREATE TABLE IF NOT EXISTS watermarks (
+ mark_id TEXT NOT NULL,
+ layer TEXT NOT NULL,
+ file_id TEXT NOT NULL,
+ recipient_id TEXT NOT NULL,
+ issuer_id TEXT NOT NULL,
+ registered_at INTEGER NOT NULL,
+ PRIMARY KEY (mark_id, layer)
+);
+CREATE TABLE IF NOT EXISTS manifests (
+ file_id TEXT PRIMARY KEY,
+ recipient_id TEXT NOT NULL,
+ issuer_id TEXT NOT NULL,
+ issuer_ed25519_pub TEXT NOT NULL,
+ manifest_json TEXT NOT NULL,
+ registered_at INTEGER NOT NULL
+);
+CREATE TABLE IF NOT EXISTS events (
+ id INTEGER PRIMARY KEY AUTOINCREMENT,
+ token_id TEXT NOT NULL,
+ file_id TEXT,
+ recipient_id TEXT,
+ issuer_id TEXT,
+ kind TEXT NOT NULL,
+ source_ip TEXT,
+ user_agent TEXT,
+ extra TEXT,
+ timestamp INTEGER NOT NULL,
+ qualified_timestamp TEXT,
+ tlog_index INTEGER
+);
+CREATE TABLE IF NOT EXISTS corpus (
+ file_id TEXT NOT NULL,
+ hash_kind TEXT NOT NULL,
+ hash_value TEXT NOT NULL,
+ metadata TEXT,
+ registered_at INTEGER NOT NULL,
+ PRIMARY KEY (file_id, hash_kind, hash_value)
+);
+CREATE INDEX IF NOT EXISTS idx_events_token ON events(token_id);
+CREATE INDEX IF NOT EXISTS idx_events_file ON events(file_id);
+CREATE INDEX IF NOT EXISTS idx_corpus_hash ON corpus(hash_kind, hash_value);
+"""
+
+
+def load_or_create_identity() -> dict:
+ DATA_DIR.mkdir(parents=True, exist_ok=True)
+ if IDENTITY_PATH.exists():
+ return json.loads(IDENTITY_PATH.read_text())
+ sk = Ed25519PrivateKey.generate()
+ pk = sk.public_key()
+ ident = {
+ "ed25519_priv": sk.private_bytes(
+ encoding=serialization.Encoding.Raw,
+ format=serialization.PrivateFormat.Raw,
+ encryption_algorithm=serialization.NoEncryption(),
+ ).hex(),
+ "ed25519_pub": pk.public_bytes(
+ encoding=serialization.Encoding.Raw,
+ format=serialization.PublicFormat.Raw,
+ ).hex(),
+ "created_at": int(time.time()),
+ }
+ # Write private key file with 0600 permissions (owner-only read/write).
+ fd = os.open(str(IDENTITY_PATH), os.O_WRONLY | os.O_CREAT | os.O_TRUNC, 0o600)
+ with os.fdopen(fd, "w") as f:
+ json.dump(ident, f, indent=2)
+ return ident
+
+
+IDENTITY: Optional[dict] = None
+TLOG: Optional[TransparencyLog] = None
+
+
+@contextmanager
+def db():
+ con = sqlite3.connect(DB_PATH)
+ con.row_factory = sqlite3.Row
+ # WAL for concurrent readers/writer. Safe to set every connection.
+ con.execute("PRAGMA journal_mode=WAL")
+ con.execute("PRAGMA synchronous=NORMAL")
+ try:
+ yield con
+ con.commit()
+ finally:
+ con.close()
+
+
+def init_db():
+ with db() as con:
+ con.executescript(SCHEMA)
+
+
+def timestamp_stub() -> str:
+ """Fallback: self-timestamp from registry clock when TSA is unreachable."""
+ return time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
+
+
+def qualified_timestamp_or_stub(data: bytes) -> tuple[str, Optional[dict]]:
+ """
+ Attempt a qualified RFC 3161 timestamp via the default TSA chain
+ (FreeTSA, DigiCert - both free, no account). Falls back to a
+ self-timestamp if all TSAs are unreachable.
+
+ Returns (iso_string, qualified_details_dict_or_None).
+
+ The registry persists the qualified_details dict (if present) in the
+ events table so external auditors can independently verify the timestamp
+ against the TSA's root cert, without trusting the registry operator.
+ """
+ try:
+ from oversight_core.timestamp import qualified_timestamp
+ ts = qualified_timestamp(data)
+ if ts is not None:
+ return ts.gen_time_iso, ts.to_dict()
+ except ImportError:
+ pass
+ return timestamp_stub(), None
+
+
+# ---- rate limiting with LRU bound ----
+
+class TokenBucket:
+ """Per-key token bucket with an LRU bound on state size."""
+
+ def __init__(self, rate: float = 10.0, burst: int = 30, max_keys: int = 100_000):
+ self.rate = rate
+ self.burst = burst
+ self.max_keys = max_keys
+ self._state: "OrderedDict[str, tuple[float, float]]" = OrderedDict()
+ self._lock = threading.Lock()
+
+ def allow(self, key: str) -> bool:
+ now = time.monotonic()
+ with self._lock:
+ if key in self._state:
+ tokens, last = self._state.pop(key)
+ else:
+ tokens, last = (float(self.burst), now)
+ tokens = min(self.burst, tokens + (now - last) * self.rate)
+ if tokens < 1.0:
+ self._state[key] = (tokens, now)
+ self._evict_if_needed()
+ return False
+ self._state[key] = (tokens - 1.0, now)
+ self._evict_if_needed()
+ return True
+
+ def _evict_if_needed(self):
+ while len(self._state) > self.max_keys:
+ self._state.popitem(last=False)
+
+
+BUCKET = TokenBucket(rate=10.0, burst=30, max_keys=100_000)
+
+
+def _client_key(request: Request) -> str:
+ """Extract the client identifier used for rate limiting."""
+ if TRUSTED_PROXY:
+ xff = request.headers.get("x-forwarded-for", "")
+ if xff:
+ # Last hop is the most recent proxy, first is the original client.
+ # For rate limiting the original client IP is what we want.
+ return xff.split(",")[0].strip()
+ return request.client.host if request.client else "unknown"
+
+
+# ---- app + lifespan ----
+
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+ global IDENTITY, TLOG
+ init_db()
+ IDENTITY = load_or_create_identity()
+ TLOG = TransparencyLog(TLOG_DIR, signing_key_hex=IDENTITY["ed25519_priv"])
+ yield
+
+
+app = FastAPI(title="OVERSIGHT Registry", version="0.2.1", lifespan=lifespan)
+
+
+class RegistrationRequest(BaseModel):
+ manifest: dict
+ beacons: list[dict]
+ watermarks: list[dict]
+ corpus: Optional[dict] = None
+
+
+class AttributionQuery(BaseModel):
+ token_id: Optional[str] = None
+ mark_id: Optional[str] = None
+ layer: Optional[str] = None
+ perceptual_hash: Optional[str] = None
+
+
+def _append_tlog(event: dict) -> int:
+ return TLOG.append(event) if TLOG else -1
+
+
+def _rate_limit(request: Request):
+ if not BUCKET.allow(_client_key(request)):
+ raise HTTPException(429, "rate limit exceeded")
+
+
+def _verify_manifest_signature(manifest_dict: dict) -> tuple[bool, str]:
+ """
+ Parse and verify the manifest's embedded Ed25519 signature.
+ Returns (ok, issuer_pub_hex). issuer_pub_hex is the claimed issuer key.
+ """
+ try:
+ m = Manifest.from_json(
+ json.dumps(manifest_dict, sort_keys=True, separators=(",", ":")).encode("utf-8")
+ )
+ except Exception as e:
+ return False, ""
+ return m.verify(), m.issuer_ed25519_pub
+
+
+@app.post("/register")
+def register(req: RegistrationRequest, request: Request):
+ """
+ Register a sealed file's beacons + watermarks.
+
+ Security requirements:
+ - The manifest's embedded Ed25519 signature MUST verify.
+ - If the file_id already exists in our DB, the re-registration's issuer
+ pubkey MUST match the original. This prevents hostile overwrites of
+ another issuer's attribution record.
+ - A per-client rate limit applies.
+ """
+ _rate_limit(request)
+
+ m = req.manifest
+ file_id = m.get("file_id")
+ recipient = m.get("recipient") or {}
+ recipient_id = recipient.get("recipient_id", "unknown")
+ issuer_id = m.get("issuer_id", "unknown")
+
+ if not file_id:
+ raise HTTPException(400, "manifest missing file_id")
+
+ sig_ok, issuer_pub = _verify_manifest_signature(m)
+ if not sig_ok:
+ raise HTTPException(400, "manifest signature invalid")
+ if not issuer_pub:
+ raise HTTPException(400, "manifest missing issuer_ed25519_pub")
+
+ now = int(time.time())
+ with db() as con:
+ existing = con.execute(
+ "SELECT issuer_ed25519_pub FROM manifests WHERE file_id=?",
+ (file_id,),
+ ).fetchone()
+ if existing and existing["issuer_ed25519_pub"] != issuer_pub:
+ raise HTTPException(
+ 409,
+ f"file_id already registered under a different issuer pubkey "
+ f"(claimed={issuer_pub[:16]}..., existing={existing['issuer_ed25519_pub'][:16]}...)",
+ )
+
+ con.execute(
+ "INSERT OR REPLACE INTO manifests VALUES (?,?,?,?,?,?)",
+ (file_id, recipient_id, issuer_id, issuer_pub, json.dumps(m), now),
+ )
+ for b in req.beacons:
+ con.execute(
+ "INSERT OR REPLACE INTO beacons VALUES (?,?,?,?,?,?)",
+ (b["token_id"], file_id, recipient_id, issuer_id, b["kind"], now),
+ )
+ for w in req.watermarks:
+ con.execute(
+ "INSERT OR REPLACE INTO watermarks VALUES (?,?,?,?,?,?)",
+ (w["mark_id"], w["layer"], file_id, recipient_id, issuer_id, now),
+ )
+ if req.corpus:
+ for hash_kind, hash_value in req.corpus.items():
+ if hash_value:
+ con.execute(
+ "INSERT OR REPLACE INTO corpus VALUES (?,?,?,?,?)",
+ (file_id, hash_kind, str(hash_value), None, now),
+ )
+
+ tlog_idx = _append_tlog({
+ "event": "register",
+ "file_id": file_id,
+ "recipient_id": recipient_id,
+ "issuer_id": issuer_id,
+ "issuer_pub": issuer_pub,
+ "n_beacons": len(req.beacons),
+ "n_watermarks": len(req.watermarks),
+ "timestamp": timestamp_stub(),
+ })
+
+ return {
+ "ok": True,
+ "file_id": file_id,
+ "registered_beacons": len(req.beacons),
+ "tlog_index": tlog_idx,
+ }
+
+
+ONE_PX_PNG = bytes.fromhex(
+ "89504e470d0a1a0a0000000d49484452000000010000000108060000001f15c489"
+ "0000000d49444154789c626000000000050001a5f645400000000049454e44ae426082"
+)
+
+
+def _record_event(request: Request, token_id: str, kind: str) -> int:
+ with db() as con:
+ row = con.execute(
+ "SELECT file_id, recipient_id, issuer_id FROM beacons WHERE token_id=?",
+ (token_id,),
+ ).fetchone()
+ file_id = row["file_id"] if row else None
+ recipient_id = row["recipient_id"] if row else None
+ issuer_id = row["issuer_id"] if row else None
+
+ client_ip = request.client.host if request.client else None
+ ua = request.headers.get("user-agent", "")
+ qts = timestamp_stub()
+
+ tlog_idx = _append_tlog({
+ "event": "beacon",
+ "kind": kind,
+ "token_id": token_id,
+ "file_id": file_id,
+ "recipient_id": recipient_id,
+ "source_ip": client_ip,
+ "user_agent": ua,
+ "timestamp": qts,
+ })
+
+ con.execute(
+ "INSERT INTO events (token_id,file_id,recipient_id,issuer_id,kind,"
+ "source_ip,user_agent,extra,timestamp,qualified_timestamp,tlog_index) "
+ "VALUES (?,?,?,?,?,?,?,?,?,?,?)",
+ (token_id, file_id, recipient_id, issuer_id, kind,
+ client_ip, ua, "{}", int(time.time()), qts, tlog_idx),
+ )
+ return tlog_idx
+
+
+@app.get("/p/{token_id}.png")
+async def beacon_png(token_id: str, request: Request):
+ _rate_limit(request)
+ _record_event(request, token_id, "http_img")
+ return Response(content=ONE_PX_PNG, media_type="image/png")
+
+
+@app.api_route("/ocsp/r/{token_id}", methods=["GET", "POST"])
+@app.api_route("/r/{token_id}", methods=["GET", "POST"])
+async def beacon_ocsp(token_id: str, request: Request):
+ _rate_limit(request)
+ _record_event(request, token_id, "ocsp")
+ return Response(status_code=200)
+
+
+@app.get("/lic/v/{token_id}")
+@app.get("/v/{token_id}")
+async def beacon_license(token_id: str, request: Request):
+ _rate_limit(request)
+ _record_event(request, token_id, "license")
+ return JSONResponse({"valid": True})
+
+
+@app.post("/attribute")
+def attribute(q: AttributionQuery):
+ with db() as con:
+ row = None
+ if q.token_id:
+ row = con.execute(
+ "SELECT * FROM beacons WHERE token_id=?", (q.token_id,)
+ ).fetchone()
+ elif q.mark_id and q.layer:
+ row = con.execute(
+ "SELECT * FROM watermarks WHERE mark_id=? AND layer=?",
+ (q.mark_id, q.layer),
+ ).fetchone()
+ elif q.mark_id:
+ row = con.execute(
+ "SELECT * FROM watermarks WHERE mark_id=?", (q.mark_id,)
+ ).fetchone()
+ elif q.perceptual_hash:
+ row = con.execute(
+ "SELECT c.file_id as file_id, b.recipient_id as recipient_id, "
+ "b.issuer_id as issuer_id "
+ "FROM corpus c LEFT JOIN beacons b ON c.file_id = b.file_id "
+ "WHERE c.hash_kind='perceptual' AND c.hash_value=? LIMIT 1",
+ (q.perceptual_hash,),
+ ).fetchone()
+ else:
+ raise HTTPException(400, "provide token_id, mark_id, or perceptual_hash")
+
+ if not row:
+ return {"found": False}
+
+ file_id = row["file_id"]
+ manifest_row = con.execute(
+ "SELECT manifest_json FROM manifests WHERE file_id=?", (file_id,)
+ ).fetchone()
+ manifest = json.loads(manifest_row["manifest_json"]) if manifest_row else None
+ events = con.execute(
+ "SELECT kind, source_ip, user_agent, timestamp, qualified_timestamp, tlog_index "
+ "FROM events WHERE file_id=? ORDER BY timestamp DESC LIMIT 50",
+ (file_id,),
+ ).fetchall()
+
+ return {
+ "found": True,
+ "file_id": file_id,
+ "recipient_id": row["recipient_id"],
+ "issuer_id": row["issuer_id"],
+ "manifest": manifest,
+ "recent_events": [dict(e) for e in events],
+ }
+
+
+@app.get("/evidence/{file_id}")
+def evidence_bundle(file_id: str):
+ with db() as con:
+ m = con.execute(
+ "SELECT manifest_json FROM manifests WHERE file_id=?", (file_id,)
+ ).fetchone()
+ if not m:
+ raise HTTPException(404, "unknown file_id")
+ events = con.execute(
+ "SELECT * FROM events WHERE file_id=? ORDER BY timestamp ASC", (file_id,),
+ ).fetchall()
+ beacons = con.execute(
+ "SELECT * FROM beacons WHERE file_id=?", (file_id,)
+ ).fetchall()
+ watermarks = con.execute(
+ "SELECT * FROM watermarks WHERE file_id=?", (file_id,)
+ ).fetchall()
+
+ bundle = {
+ "file_id": file_id,
+ "bundle_generated_at": timestamp_stub(),
+ "registry_pub": IDENTITY["ed25519_pub"],
+ "manifest": json.loads(m["manifest_json"]),
+ "beacons": [dict(b) for b in beacons],
+ "watermarks": [dict(w) for w in watermarks],
+ "events": [dict(e) for e in events],
+ "tlog_head": TLOG.signed_head() if TLOG else None,
+ "disclaimer": (
+ "This bundle is a provenance record, not a legal finding. For court use, "
+ "supplement with RFC 3161 qualified timestamps and ISO/IEC 27037 chain-of-custody."
+ ),
+ }
+ sk = Ed25519PrivateKey.from_private_bytes(bytes.fromhex(IDENTITY["ed25519_priv"]))
+ msg = json.dumps(bundle, sort_keys=True, separators=(",", ":")).encode("utf-8")
+ bundle["bundle_signature_ed25519"] = sk.sign(msg).hex()
+ return bundle
+
+
+@app.get("/tlog/head")
+def tlog_head():
+ if not TLOG:
+ raise HTTPException(503, "tlog not initialized")
+ return TLOG.signed_head()
+
+
+@app.get("/tlog/proof/{index}")
+def tlog_proof(index: int):
+ if not TLOG:
+ raise HTTPException(503, "tlog not initialized")
+ proof = TLOG.inclusion_proof(index)
+ if proof is None:
+ raise HTTPException(404, "index out of range")
+ return proof
+
+
+@app.get("/tlog/range")
+def tlog_range(start: int = 0, limit: int = 500):
+ """Return tlog leaf entries in [start, start+limit). For CanaryKeeper polling."""
+ if not TLOG:
+ raise HTTPException(503, "tlog not initialized")
+ limit = min(max(1, limit), 1000)
+ leaves_path = TLOG.leaves_path
+ if not leaves_path.exists():
+ return {"start": start, "count": 0, "entries": []}
+ entries = []
+ with leaves_path.open("r") as f:
+ for i, line in enumerate(f):
+ if i < start:
+ continue
+ if len(entries) >= limit:
+ break
+ try:
+ entries.append(json.loads(line))
+ except ValueError:
+ continue
+ return {"start": start, "count": len(entries), "entries": entries}
+
+
+class DnsEvent(BaseModel):
+ token_id: str
+ client_ip: Optional[str] = None
+ qtype: Optional[str] = None
+ qname: Optional[str] = None
+
+
+@app.post("/dns_event")
+def dns_event(evt: DnsEvent, request: Request):
+ """Called by the oversight_dns server when a beacon DNS query arrives."""
+ _rate_limit(request)
+ with db() as con:
+ row = con.execute(
+ "SELECT file_id, recipient_id, issuer_id FROM beacons WHERE token_id=?",
+ (evt.token_id,),
+ ).fetchone()
+ file_id = row["file_id"] if row else None
+ recipient_id = row["recipient_id"] if row else None
+ issuer_id = row["issuer_id"] if row else None
+
+ qts = timestamp_stub()
+ tlog_idx = _append_tlog({
+ "event": "beacon",
+ "kind": "dns",
+ "token_id": evt.token_id,
+ "file_id": file_id,
+ "recipient_id": recipient_id,
+ "source_ip": evt.client_ip,
+ "qname": evt.qname,
+ "qtype": evt.qtype,
+ "timestamp": qts,
+ })
+ con.execute(
+ "INSERT INTO events (token_id,file_id,recipient_id,issuer_id,kind,"
+ "source_ip,user_agent,extra,timestamp,qualified_timestamp,tlog_index) "
+ "VALUES (?,?,?,?,?,?,?,?,?,?,?)",
+ (evt.token_id, file_id, recipient_id, issuer_id, "dns",
+ evt.client_ip, "", json.dumps({"qtype": evt.qtype, "qname": evt.qname}),
+ int(time.time()), qts, tlog_idx),
+ )
+ return {"ok": True, "tlog_index": tlog_idx}
+
+
+@app.get("/candidates/semantic")
+def candidates_semantic(limit: int = 1000, since: Optional[int] = None):
+ """
+ Flywheel-friendly endpoint: returns recent L3 semantic mark_ids so the
+ scraper can verify them against leaked text without shipping the whole
+ watermark table over the wire repeatedly.
+ """
+ limit = min(max(1, limit), 10_000)
+ with db() as con:
+ if since:
+ rows = con.execute(
+ "SELECT mark_id, file_id, recipient_id, registered_at FROM watermarks "
+ "WHERE layer='L3_semantic' AND registered_at>=? "
+ "ORDER BY registered_at DESC LIMIT ?",
+ (since, limit),
+ ).fetchall()
+ else:
+ rows = con.execute(
+ "SELECT mark_id, file_id, recipient_id, registered_at FROM watermarks "
+ "WHERE layer='L3_semantic' ORDER BY registered_at DESC LIMIT ?",
+ (limit,),
+ ).fetchall()
+ return {
+ "generated_at": timestamp_stub(),
+ "count": len(rows),
+ "candidates": [dict(r) for r in rows],
+ }
+
+
+@app.get("/health")
+def health():
+ return {
+ "status": "ok",
+ "service": "oversight-registry",
+ "version": "0.2.1",
+ "tlog_size": TLOG.size() if TLOG else 0,
+ }
+
+
+@app.get("/.well-known/oversight-registry")
+def well_known():
+ return {
+ "ed25519_pub": IDENTITY["ed25519_pub"] if IDENTITY else None,
+ "version": "0.2.1",
+ "jurisdiction": os.environ.get("OVERSIGHT_JURISDICTION", "GLOBAL"),
+ "tlog_size": TLOG.size() if TLOG else 0,
+ }
requirements.txt +20 -0
@@ -0,0 +1,20 @@
+cryptography>=42.0.0
+pynacl>=1.5.0
+fastapi>=0.110.0
+uvicorn>=0.29.0
+pydantic>=2.0.0
+httpx>=0.27.0
+python-multipart>=0.0.9
+
+# Format adapters
+Pillow>=10.0.0
+numpy>=1.26.0
+scipy>=1.11.0
+pypdf>=4.0.0
+python-docx>=1.1.0
+imagehash>=4.3.1
+
+# Post-quantum (optional but tested).
+# Requires liboqs C library installed on the system. See RUNBOOK.md phase 7.
+# liboqs-python>=0.14.0
+
tests/test_e2e.py +182 -0
@@ -0,0 +1,182 @@
+#!/usr/bin/env python3
+"""
+End-to-end test of the OVERSIGHT MVP.
+
+Exercises:
+ 1. Identity generation (issuer + two recipients)
+ 2. Sealing a text file for recipient Alice with watermarks + beacons
+ 3. Inspecting the sealed file (manifest visible, ciphertext opaque)
+ 4. Alice opens it successfully
+ 5. Bob (wrong key) fails to open it
+ 6. Tampering with the ciphertext is detected
+ 7. Tampering with the manifest is detected
+ 8. Watermark recovery from leaked plaintext identifies Alice
+"""
+
+import os
+import sys
+import tempfile
+from pathlib import Path
+
+ROOT = Path(__file__).resolve().parent.parent
+sys.path.insert(0, str(ROOT))
+
+from oversight_core import (
+ ClassicIdentity,
+ Manifest,
+ Recipient,
+ WatermarkRef,
+ content_hash,
+ seal,
+ open_sealed,
+ beacon,
+ watermark,
+)
+from oversight_core.container import SealedFile
+
+
+def banner(msg):
+ print(f"\n{'=' * 60}\n {msg}\n{'=' * 60}")
+
+
+def main():
+ banner("1. Generate identities")
+ issuer = ClassicIdentity.generate()
+ alice = ClassicIdentity.generate()
+ bob = ClassicIdentity.generate()
+ print(f" issuer ed25519_pub = {issuer.ed25519_pub.hex()[:32]}...")
+ print(f" alice x25519_pub = {alice.x25519_pub.hex()[:32]}...")
+ print(f" bob x25519_pub = {bob.x25519_pub.hex()[:32]}...")
+
+ banner("2. Prepare & watermark plaintext")
+ # Multi-line text so the per-line L2 watermark has enough lines to encode 64 bits.
+ lines = [
+ "CONFIDENTIAL - Q2 Revenue Memo",
+ "Revenue for Q2 exceeded projections by 18%.",
+ "Do not distribute externally.",
+ "",
+ ]
+ for i in range(80):
+ lines.append(f"Supporting detail line {i}: filler content for watermark room.")
+ original_text = "\n".join(lines)
+ mark_zw = watermark.new_mark_id()
+ mark_ws = watermark.new_mark_id()
+ wm_text = watermark.embed_zw(original_text, mark_zw)
+ wm_text = watermark.embed_ws(wm_text, mark_ws)
+ plaintext = wm_text.encode("utf-8")
+ print(f" original bytes = {len(original_text.encode())}")
+ print(f" watermarked = {len(plaintext)}")
+ print(f" L1 mark (zw) = {mark_zw.hex()}")
+ print(f" L2 mark (ws) = {mark_ws.hex()}")
+
+ banner("3. Build manifest + beacons for Alice")
+ beacons = beacon.gen_beacons(
+ registry_domain="oversight.test",
+ file_id="will-be-assigned",
+ recipient_id="alice@example.com",
+ )
+ recipient = Recipient(
+ recipient_id="alice@example.com",
+ x25519_pub=alice.x25519_pub.hex(),
+ ed25519_pub=alice.ed25519_pub.hex(),
+ )
+ manifest = Manifest.new(
+ original_filename="q2_memo.txt",
+ content_hash=content_hash(plaintext),
+ size_bytes=len(plaintext),
+ issuer_id="acme.corp.legal",
+ issuer_ed25519_pub_hex=issuer.ed25519_pub.hex(),
+ recipient=recipient,
+ registry_url="https://registry.oversight.test",
+ content_type="text/plain",
+ )
+ manifest.watermarks = [
+ WatermarkRef(layer="L1_zero_width", mark_id=mark_zw.hex()),
+ WatermarkRef(layer="L2_whitespace", mark_id=mark_ws.hex()),
+ ]
+ manifest.beacons = [b.to_dict() for b in beacons]
+ print(f" file_id = {manifest.file_id}")
+ print(f" beacons = {len(beacons)}")
+ print(f" marks = {len(manifest.watermarks)}")
+
+ banner("4. Seal")
+ blob = seal(
+ plaintext=plaintext,
+ manifest=manifest,
+ issuer_ed25519_priv=issuer.ed25519_priv,
+ recipient_x25519_pub=alice.x25519_pub,
+ )
+ print(f" sealed blob = {len(blob)} bytes")
+ print(f" magic OK = {blob[:6] == bytes([ord('S'),ord('N'),ord('T'),ord('L'),1,0])}")
+ print(f" manifest signed = {manifest.verify()}")
+
+ banner("5. Inspect (no key needed for metadata)")
+ sf = SealedFile.from_bytes(blob)
+ print(f" manifest.file_id = {sf.manifest.file_id}")
+ print(f" manifest.recipient = {sf.manifest.recipient.recipient_id}")
+ print(f" manifest sig valid = {sf.manifest.verify()}")
+
+ banner("6. Alice opens (correct key)")
+ recovered, m = open_sealed(blob, recipient_x25519_priv=alice.x25519_priv)
+ print(f" decrypted = {len(recovered)} bytes")
+ print(f" exact match to original plaintext = {recovered == plaintext}")
+
+ banner("7. Bob (wrong key) attempts to open")
+ try:
+ open_sealed(blob, recipient_x25519_priv=bob.x25519_priv)
+ print(" FAIL - bob should not have been able to decrypt")
+ sys.exit(1)
+ except Exception as e:
+ print(f" correctly rejected: {type(e).__name__}: {str(e)[:60]}")
+
+ banner("8. Tamper with ciphertext")
+ bad = bytearray(blob)
+ # flip the last byte (inside the ciphertext/tag region)
+ bad[-1] ^= 0x01
+ try:
+ open_sealed(bytes(bad), recipient_x25519_priv=alice.x25519_priv)
+ print(" FAIL - ciphertext tamper should have been caught")
+ sys.exit(1)
+ except Exception as e:
+ print(f" correctly rejected: {type(e).__name__}: {str(e)[:60]}")
+
+ banner("9. Tamper with manifest (flip a byte inside the manifest region)")
+ bad2 = bytearray(blob)
+ # manifest starts at offset 12
+ bad2[30] ^= 0x01
+ try:
+ # this will probably fail at JSON parse or sig-verify
+ open_sealed(bytes(bad2), recipient_x25519_priv=alice.x25519_priv)
+ print(" FAIL - manifest tamper should have been caught")
+ sys.exit(1)
+ except Exception as e:
+ print(f" correctly rejected: {type(e).__name__}: {str(e)[:60]}")
+
+ banner("10. Watermark recovery from leaked plaintext")
+ leaked = recovered.decode("utf-8")
+ marks = watermark.recover_marks(leaked)
+ for layer, mlist in marks.items():
+ uniq = sorted({m.hex() for m in mlist})
+ print(f" {layer}: {len(mlist)} frame(s), unique IDs: {uniq}")
+ # Assert Alice's marks are among them
+ found_zw = mark_zw in marks["L1_zero_width"]
+ found_ws = any(m == mark_ws for m in marks["L2_whitespace"])
+ print(f" L1 recovered = {found_zw}")
+ print(f" L2 recovered = {found_ws}")
+ assert found_zw, "L1 watermark recovery failed"
+ assert found_ws, "L2 watermark recovery failed"
+
+ banner("11. Watermark survives format stripping (paste into new doc)")
+ # Simulate "attacker pastes plaintext into a new document" - plain string ops
+ pasted = "\n".join(line for line in leaked.splitlines())
+ # This preserves invisibles but strips our trailing-ws marks
+ marks2 = watermark.recover_marks(pasted)
+ print(f" L1 (zw) survived copy-paste: {mark_zw in marks2['L1_zero_width']}")
+ print(f" L2 (ws) survived copy-paste: "
+ f"{any(m == mark_ws for m in marks2['L2_whitespace'])}")
+
+ banner("ALL TESTS PASSED")
+
+
+if __name__ == "__main__":
+ main()
tests/test_e2e_v2.py +326 -0
@@ -0,0 +1,326 @@
+#!/usr/bin/env python3
+"""
+OVERSIGHT v0.2 comprehensive end-to-end test.
+
+Covers:
+ 1. Identity + keygen
+ 2. Text watermarking (L1, L2, L3 semantic)
+ 3. Image DCT watermarking
+ 4. PDF metadata marks
+ 5. DOCX metadata marks
+ 6. Seal + open (single recipient)
+ 7. Multi-recipient seal
+ 8. Policy enforcement (not_after expired)
+ 9. Policy enforcement (max_opens counter)
+ 10. Semantic watermark verification (airgap-strip survivor)
+ 11. Tamper detection
+ 12. Merkle transparency log correctness
+ 13. Perceptual hash lookup (fuzzy match)
+"""
+
+import io
+import sys
+import time
+import tempfile
+import hashlib
+from pathlib import Path
+
+ROOT = Path(__file__).resolve().parent.parent
+sys.path.insert(0, str(ROOT))
+
+from oversight_core import (
+ ClassicIdentity, Manifest, Recipient, WatermarkRef,
+ content_hash, seal, open_sealed, beacon, watermark,
+)
+from oversight_core import semantic
+from oversight_core.container import seal_multi
+from oversight_core.policy import PolicyContext, PolicyViolation
+from oversight_core.tlog import TransparencyLog
+
+
+def banner(m): print(f"\n{'=' * 64}\n {m}\n{'=' * 64}")
+def ok(msg): print(f" [ok] {msg}")
+def fail(msg): print(f" [FAIL] {msg}"); sys.exit(1)
+
+
+def main():
+ banner("1. Identities")
+ issuer = ClassicIdentity.generate()
+ alice = ClassicIdentity.generate()
+ bob = ClassicIdentity.generate()
+ carol = ClassicIdentity.generate()
+ ok(f"generated 4 identities")
+
+ banner("2. Text watermarking - L1 + L2 + L3")
+ text_lines = [f"Supporting paragraph {i}: we begin to show how this is significant and we must help users find answers." for i in range(60)]
+ original_text = "\n".join(text_lines)
+ mid_zw = watermark.new_mark_id()
+ mid_ws = watermark.new_mark_id()
+ mid_sem = watermark.new_mark_id()
+
+ # Apply L3 FIRST (rewrites words), then L2 (trailing whitespace),
+ # then L1 (zero-width unicode). This order preserves semantic marks
+ # even after L1 insertion, because semantic verification strips ZW chars.
+ t = semantic.apply_semantic(original_text, mid_sem)
+ t = watermark.embed_ws(t, mid_ws)
+ t = watermark.embed_zw(t, mid_zw)
+ plaintext = t.encode("utf-8")
+ ok(f"applied L3/L2/L1 marks; bytes={len(plaintext)}")
+
+ banner("3. Semantic recovery survives airgap-strip")
+ # Simulate airgap-strip: remove zero-width, normalize whitespace
+ airgap_stripped = t
+ for zw in ("\u200b", "\u200c", "\u200d"):
+ airgap_stripped = airgap_stripped.replace(zw, "")
+ # normalize trailing whitespace
+ airgap_stripped = "\n".join(line.rstrip() for line in airgap_stripped.splitlines())
+
+ # Verify L1 and L2 are dead
+ l1_survived = len(watermark.extract_zw(airgap_stripped)) > 0
+ l2_result = watermark.extract_ws(airgap_stripped)
+ l2_survived = l2_result is not None and l2_result == mid_ws
+ print(f" L1 survived airgap-strip: {l1_survived} (expected False)")
+ print(f" L2 survived airgap-strip: {l2_survived} (expected False)")
+ if l1_survived or l2_survived:
+ fail("L1 or L2 unexpectedly survived airgap-strip - test setup bug")
+
+ # Verify L3 semantic DID survive
+ result = semantic.verify_semantic(airgap_stripped, mid_sem)
+ print(f" L3 synonym score: {result['synonyms_score']:.3f} (match={result['synonyms_match']})")
+ print(f" L3 punctuation hits: {result['punctuation_hits']}")
+ print(f" L3 overall match: {result['overall_match']}")
+ if not result["overall_match"]:
+ fail("L3 semantic watermark failed to survive airgap-strip")
+ ok("L3 semantic watermark SURVIVED airgap-strip - attribution possible")
+
+ # Negative: a DIFFERENT mark_id should NOT match
+ wrong_result = semantic.verify_semantic(airgap_stripped, watermark.new_mark_id())
+ if wrong_result["overall_match"] and wrong_result["synonyms_score"] > 0.65:
+ fail(f"random mark_id matched (score={wrong_result['synonyms_score']}) - false positive")
+ ok(f"L3 rejects wrong mark_id (score={wrong_result['synonyms_score']:.3f})")
+
+ banner("4. Image DCT watermarking")
+ try:
+ from PIL import Image
+ import numpy as np
+ from oversight_core.formats import image as img_fmt
+ # Make a test image
+ arr = np.random.RandomState(42).randint(64, 200, (256, 256, 3), dtype=np.uint8)
+ pil = Image.fromarray(arr)
+ buf = io.BytesIO(); pil.save(buf, format="PNG")
+ orig_bytes = buf.getvalue()
+
+ img_mark = watermark.new_mark_id()
+ marked_bytes = img_fmt.embed(orig_bytes, img_mark, alpha=0.10)
+ ok(f"embedded into image: {len(marked_bytes)} bytes")
+
+ match, score = img_fmt.verify(marked_bytes, img_mark)
+ print(f" correct mark score: {score:+.4f} (match={match})")
+ if not match:
+ fail(f"DCT watermark verify FAILED for correct mark_id")
+
+ wrong_mark = watermark.new_mark_id()
+ wrong_match, wrong_score = img_fmt.verify(marked_bytes, wrong_mark)
+ print(f" wrong mark score: {wrong_score:+.4f} (match={wrong_match})")
+ if wrong_match:
+ fail("DCT watermark verify matched WRONG mark_id (false positive)")
+ ok("image DCT watermark verifies correctly and rejects wrong marks")
+
+ # JPEG recompression attack
+ from PIL import Image as _I
+ pil2 = _I.open(io.BytesIO(marked_bytes))
+ jpeg_buf = io.BytesIO(); pil2.save(jpeg_buf, format="JPEG", quality=75)
+ match_after_jpeg, score_after_jpeg = img_fmt.verify(jpeg_buf.getvalue(), img_mark)
+ print(f" post-JPEG-q75 score: {score_after_jpeg:+.4f} (match={match_after_jpeg})")
+ if match_after_jpeg:
+ ok("image watermark SURVIVED JPEG recompression (q=75)")
+ else:
+ print(f" [note] image watermark weakened by JPEG recompression (score below threshold)")
+
+ phash = img_fmt.perceptual_hash(marked_bytes)
+ ok(f"perceptual hash: {phash}")
+ except Exception as e:
+ fail(f"image test error: {e}")
+
+ banner("5. PDF marks")
+ try:
+ from pypdf import PdfWriter as _PW
+ from oversight_core.formats import pdf as pdf_fmt
+ # Build a simple PDF using reportlab if available, else skip
+ try:
+ from reportlab.pdfgen import canvas
+ buf = io.BytesIO()
+ c = canvas.Canvas(buf)
+ c.drawString(100, 750, "Confidential - test document")
+ c.save()
+ pdf_bytes = buf.getvalue()
+ except ImportError:
+ # minimal PDF - pypdf can write an empty doc
+ w = _PW()
+ w.add_blank_page(width=612, height=792)
+ buf = io.BytesIO(); w.write(buf)
+ pdf_bytes = buf.getvalue()
+
+ pdf_mark = watermark.new_mark_id()
+ marked_pdf = pdf_fmt.embed(pdf_bytes, pdf_mark, issuer_id="acme", file_id="pdf-test-1")
+ ok(f"embedded into PDF: {len(marked_pdf)} bytes")
+
+ extracted = pdf_fmt.extract(marked_pdf)
+ if extracted["mark_id"] != pdf_mark.hex():
+ fail(f"PDF mark mismatch: got {extracted['mark_id']}, expected {pdf_mark.hex()}")
+ if extracted["issuer_id"] != "acme":
+ fail(f"PDF issuer mismatch")
+ ok(f"PDF mark recovered: {extracted['mark_id']}")
+ except Exception as e:
+ fail(f"PDF test error: {e}")
+
+ banner("6. DOCX marks")
+ try:
+ from docx import Document
+ from oversight_core.formats import docx as docx_fmt
+ doc = Document()
+ doc.add_paragraph("Confidential test document")
+ doc.add_paragraph("Second paragraph of content")
+ buf = io.BytesIO(); doc.save(buf)
+ docx_bytes = buf.getvalue()
+
+ docx_mark = watermark.new_mark_id()
+ marked_docx = docx_fmt.embed(docx_bytes, docx_mark, issuer_id="acme", file_id="docx-test-1")
+ ok(f"embedded into DOCX: {len(marked_docx)} bytes")
+
+ ext = docx_fmt.extract(marked_docx)
+ if ext["mark_id"] != docx_mark.hex():
+ fail(f"DOCX mark mismatch: got {ext['mark_id']}, expected {docx_mark.hex()}")
+ ok(f"DOCX mark recovered: {ext['mark_id']}")
+ except Exception as e:
+ fail(f"DOCX test error: {e}")
+
+ banner("7. Single-recipient seal + open (regression)")
+ rec = Recipient(recipient_id="alice@corp", x25519_pub=alice.x25519_pub.hex(), ed25519_pub=alice.ed25519_pub.hex())
+ m = Manifest.new("test.txt", content_hash(plaintext), len(plaintext), "acme", issuer.ed25519_pub.hex(), rec, "http://localhost:8765", "text/plain")
+ m.watermarks = [
+ WatermarkRef(layer="L1_zero_width", mark_id=mid_zw.hex()),
+ WatermarkRef(layer="L2_whitespace", mark_id=mid_ws.hex()),
+ WatermarkRef(layer="L3_semantic", mark_id=mid_sem.hex()),
+ ]
+ blob = seal(plaintext, m, issuer.ed25519_priv, alice.x25519_pub)
+ recovered, mm = open_sealed(blob, alice.x25519_priv)
+ if recovered != plaintext:
+ fail("recovered plaintext mismatch")
+ ok(f"seal/open round-trip OK ({len(blob)} bytes)")
+
+ banner("8. Multi-recipient seal")
+ m2 = Manifest.new("multi.txt", content_hash(plaintext), len(plaintext), "acme", issuer.ed25519_pub.hex(), rec, "http://localhost:8765", "text/plain")
+ multi_blob = seal_multi(
+ plaintext, m2, issuer.ed25519_priv,
+ [alice.x25519_pub, bob.x25519_pub, carol.x25519_pub],
+ )
+ ok(f"multi-recipient blob = {len(multi_blob)} bytes")
+ # Each recipient can decrypt
+ for name, ident in [("alice", alice), ("bob", bob), ("carol", carol)]:
+ pt, _ = open_sealed(multi_blob, ident.x25519_priv)
+ if pt != plaintext:
+ fail(f"multi-recipient decrypt FAILED for {name}")
+ ok(f" {name} decrypts multi-recipient blob")
+ # Stranger cannot
+ stranger = ClassicIdentity.generate()
+ try:
+ open_sealed(multi_blob, stranger.x25519_priv)
+ fail("stranger should NOT have been able to decrypt multi-recipient blob")
+ except Exception as e:
+ ok(f"stranger correctly rejected: {type(e).__name__}")
+
+ banner("9. Policy: not_after (expired)")
+ expired_m = Manifest.new("exp.txt", content_hash(plaintext), len(plaintext), "acme", issuer.ed25519_pub.hex(), rec, "http://localhost:8765")
+ expired_m.policy["not_after"] = int(time.time()) - 60
+ expired_blob = seal(plaintext, expired_m, issuer.ed25519_priv, alice.x25519_pub)
+ try:
+ open_sealed(expired_blob, alice.x25519_priv)
+ fail("expired file should NOT open")
+ except PolicyViolation as e:
+ ok(f"expired file correctly rejected: {e}")
+
+ banner("10. Policy: max_opens counter")
+ with tempfile.TemporaryDirectory() as td:
+ ctx = PolicyContext(state_dir=Path(td), mode="LOCAL_ONLY", jurisdiction="GLOBAL")
+ capped_m = Manifest.new("capped.txt", content_hash(plaintext), len(plaintext), "acme", issuer.ed25519_pub.hex(), rec, "http://localhost:8765")
+ capped_m.policy["max_opens"] = 2
+ capped_blob = seal(plaintext, capped_m, issuer.ed25519_priv, alice.x25519_pub)
+
+ # First two opens succeed
+ for i in range(2):
+ pt, _ = open_sealed(capped_blob, alice.x25519_priv, policy_ctx=ctx)
+ if pt != plaintext:
+ fail(f"open {i+1} recovered wrong plaintext")
+ ok("first 2 opens succeeded")
+
+ # Third open should fail
+ try:
+ open_sealed(capped_blob, alice.x25519_priv, policy_ctx=ctx)
+ fail("3rd open should have been rejected")
+ except PolicyViolation as e:
+ ok(f"3rd open correctly rejected: {e}")
+
+ banner("11. Tamper detection (ciphertext + manifest)")
+ bad = bytearray(blob)
+ bad[-1] ^= 0x01
+ try:
+ open_sealed(bytes(bad), alice.x25519_priv)
+ fail("ciphertext tamper should have been caught")
+ except Exception as e:
+ ok(f"ciphertext tamper rejected: {type(e).__name__}")
+
+ bad2 = bytearray(blob)
+ bad2[30] ^= 0x01
+ try:
+ open_sealed(bytes(bad2), alice.x25519_priv)
+ fail("manifest tamper should have been caught")
+ except Exception as e:
+ ok(f"manifest tamper rejected: {type(e).__name__}")
+
+ banner("12. Merkle transparency log")
+ with tempfile.TemporaryDirectory() as td:
+ reg_key = ClassicIdentity.generate()
+ tl = TransparencyLog(td, signing_key_hex=reg_key.ed25519_priv.hex())
+ idx0 = tl.append({"event": "test", "i": 0})
+ idx1 = tl.append({"event": "test", "i": 1})
+ idx2 = tl.append({"event": "test", "i": 2})
+ idx3 = tl.append({"event": "test", "i": 3})
+ if tl.size() != 4:
+ fail(f"tlog size {tl.size()} != 4")
+ ok(f"appended 4 entries, size={tl.size()}")
+
+ head = tl.signed_head()
+ ok(f"signed head: size={head['size']} root={head['root'][:16]}...")
+
+ # Inclusion proof for index 2
+ proof = tl.inclusion_proof(idx2)
+ if proof is None:
+ fail("inclusion proof for valid index returned None")
+ if proof["index"] != idx2:
+ fail(f"proof index mismatch")
+ ok(f"inclusion proof for idx={idx2}: {len(proof['proof'])} sibling hashes")
+
+ # Adding a new entry changes the root
+ root_before = tl.root()
+ tl.append({"event": "test", "i": 4})
+ if tl.root() == root_before:
+ fail("root did not change after append")
+ ok("root changes on append (append-only integrity)")
+
+ banner("13. Perceptual hash deterministic")
+ try:
+ from oversight_core.formats import image as img_fmt
+ ph1 = img_fmt.perceptual_hash(marked_bytes)
+ ph2 = img_fmt.perceptual_hash(marked_bytes)
+ if ph1 != ph2:
+ fail("perceptual hash not deterministic")
+ ok(f"phash deterministic: {ph1}")
+ except Exception as e:
+ fail(f"phash test error: {e}")
+
+ banner("ALL TESTS PASSED")
+
+
+if __name__ == "__main__":
+ main()
tests/test_pq.py +132 -0
@@ -0,0 +1,132 @@
+#!/usr/bin/env python3
+"""
+Post-quantum hybrid round-trip test.
+
+Proves:
+ 1. liboqs is linked and ML-KEM-768 / ML-DSA-65 work.
+ 2. Hybrid DEK wrap (X25519 + ML-KEM-768) round-trips correctly.
+ 3. Tampering with either the classical or PQ component fails.
+ 4. A full hybrid-sealed file can be built and opened.
+"""
+
+import sys
+from pathlib import Path
+
+ROOT = Path(__file__).resolve().parent.parent
+sys.path.insert(0, str(ROOT))
+
+from oversight_core import crypto
+from oversight_core.crypto import (
+ PQ_AVAILABLE, ClassicIdentity, random_dek,
+ pq_kem_keypair, pq_sig_keypair, pq_sign, pq_verify,
+ hybrid_wrap_dek, hybrid_unwrap_dek,
+)
+
+
+def banner(m): print(f"\n{'='*60}\n {m}\n{'='*60}")
+def ok(m): print(f" [ok] {m}")
+def fail(m): print(f" [FAIL] {m}"); sys.exit(1)
+
+
+def main():
+ banner("0. Check PQ availability")
+ if not PQ_AVAILABLE:
+ fail("liboqs not linked - install liboqs + liboqs-python")
+ ok("liboqs available")
+
+ banner("1. ML-KEM-768 raw round-trip")
+ priv, pub = pq_kem_keypair()
+ ok(f"keypair: pub={len(pub)}B priv={len(priv)}B")
+ from oversight_core.crypto import pq_kem_encap, pq_kem_decap
+ ct, ss1 = pq_kem_encap(pub)
+ ss2 = pq_kem_decap(priv, ct)
+ if ss1 != ss2:
+ fail("ML-KEM shared secrets don't match")
+ ok(f"ML-KEM-768 round-trip OK ({len(ss1)}B shared secret)")
+
+ banner("2. ML-DSA-65 raw round-trip")
+ sig_priv, sig_pub = pq_sig_keypair()
+ ok(f"keypair: pub={len(sig_pub)}B priv={len(sig_priv)}B")
+ msg = b"OVERSIGHT v0.2 post-quantum hybrid test"
+ signature = pq_sign(msg, sig_priv)
+ ok(f"signature: {len(signature)}B")
+ if not pq_verify(msg, signature, sig_pub):
+ fail("ML-DSA verify failed for valid signature")
+ ok("ML-DSA-65 verify accepts valid signature")
+ if pq_verify(b"tampered message", signature, sig_pub):
+ fail("ML-DSA verify accepted signature over different message")
+ ok("ML-DSA-65 verify rejects tampered message")
+
+ banner("3. Hybrid DEK wrap (classical + PQ)")
+ alice_classical = ClassicIdentity.generate()
+ alice_mlkem_priv, alice_mlkem_pub = pq_kem_keypair()
+
+ dek = random_dek()
+ print(f" DEK: {len(dek)}B")
+
+ wrapped = hybrid_wrap_dek(
+ dek,
+ x25519_pub=alice_classical.x25519_pub,
+ mlkem_pub=alice_mlkem_pub,
+ )
+ ok(f"wrapped: suite={wrapped['suite']}")
+ ok(f" x25519_ephemeral_pub = {len(bytes.fromhex(wrapped['x25519_ephemeral_pub']))}B")
+ ok(f" mlkem_ciphertext = {len(bytes.fromhex(wrapped['mlkem_ciphertext']))}B")
+ ok(f" wrapped_dek = {len(bytes.fromhex(wrapped['wrapped_dek']))}B")
+
+ recovered = hybrid_unwrap_dek(
+ wrapped,
+ x25519_priv=alice_classical.x25519_priv,
+ mlkem_priv=alice_mlkem_priv,
+ )
+ if recovered != dek:
+ fail("hybrid unwrap recovered wrong DEK")
+ ok("hybrid unwrap recovered original DEK exactly")
+
+ banner("4. Tamper with classical half")
+ bad = dict(wrapped)
+ # Replace X25519 ephemeral pub with a random one
+ other_classic = ClassicIdentity.generate()
+ bad["x25519_ephemeral_pub"] = other_classic.x25519_pub.hex()
+ try:
+ hybrid_unwrap_dek(bad, alice_classical.x25519_priv, alice_mlkem_priv)
+ fail("tamper of classical half should have failed")
+ except Exception as e:
+ ok(f"classical tamper correctly rejected: {type(e).__name__}")
+
+ banner("5. Tamper with PQ half")
+ bad2 = dict(wrapped)
+ # Corrupt a byte of the mlkem ciphertext
+ ct_bytes = bytearray(bytes.fromhex(bad2["mlkem_ciphertext"]))
+ ct_bytes[100] ^= 0x01
+ bad2["mlkem_ciphertext"] = bytes(ct_bytes).hex()
+ try:
+ hybrid_unwrap_dek(bad2, alice_classical.x25519_priv, alice_mlkem_priv)
+ fail("tamper of PQ half should have failed")
+ except Exception as e:
+ ok(f"PQ tamper correctly rejected: {type(e).__name__}")
+
+ banner("6. Wrong recipient")
+ bob_classical = ClassicIdentity.generate()
+ bob_mlkem_priv, _ = pq_kem_keypair()
+ try:
+ hybrid_unwrap_dek(wrapped, bob_classical.x25519_priv, bob_mlkem_priv)
+ fail("wrong recipient should have failed")
+ except Exception as e:
+ ok(f"wrong recipient correctly rejected: {type(e).__name__}")
+
+ banner("7. Size comparison: CLASSIC vs HYBRID")
+ classic_wrap = crypto.wrap_dek_for_recipient(dek, alice_classical.x25519_pub)
+ classic_size = sum(len(bytes.fromhex(v)) for v in classic_wrap.values())
+ hybrid_size = sum(
+ len(bytes.fromhex(v)) for k, v in wrapped.items() if k != "suite"
+ )
+ print(f" CLASSIC wrap: {classic_size} bytes (X25519 ephemeral + nonce + wrapped DEK)")
+ print(f" HYBRID wrap: {hybrid_size} bytes (X25519 eph + ML-KEM ct + nonce + wrapped DEK)")
+ print(f" overhead: {hybrid_size - classic_size} bytes per file")
+
+ banner("ALL PQ TESTS PASSED - OVERSIGHT is post-quantum-ready")
+
+
+if __name__ == "__main__":
+ main()
tests/test_rekor_unit.py +226 -0
@@ -0,0 +1,226 @@
+"""
+test_rekor_unit
+===============
+
+Offline unit tests for oversight_core.rekor.
+
+Covers (no network):
+ 1. DSSE PAE construction matches the spec byte-for-byte against a fixture.
+ 2. sign_dsse + verify_dsse round trip.
+ 3. verify_dsse rejects a tampered payload.
+ 4. verify_dsse rejects a wrong-key signature.
+ 5. build_statement produces the expected in-toto v1 shape.
+ 6. Envelope JSON serialization is canonical (sorted keys, no whitespace).
+ 7. verify_inclusion_offline returns False when transparency_log_entry is empty.
+
+Running this requires no external services; e2e Rekor tests live in
+test_rekor_e2e.py (added in v0.5 Session B).
+"""
+from __future__ import annotations
+
+import base64
+import json
+import os
+import sys
+
+# allow running without installing the package
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
+
+from cryptography.hazmat.primitives.asymmetric.ed25519 import Ed25519PrivateKey
+
+from oversight_core import rekor as R
+
+
+def _new_keypair() -> tuple[bytes, bytes]:
+ sk = Ed25519PrivateKey.generate()
+ return (
+ sk.private_bytes_raw(),
+ sk.public_key().public_bytes_raw(),
+ )
+
+
+def t1_pae_byte_exact():
+ pae = R._pae("application/vnd.in-toto+json", b'{"a":1}')
+ expect = b"DSSEv1 28 application/vnd.in-toto+json 7 " + b'{"a":1}'
+ assert pae == expect, f"PAE mismatch:\n got {pae!r}\n expect {expect!r}"
+ print(" [PASS] 1. PAE byte-exact match against spec")
+
+
+def t2_sign_verify_roundtrip():
+ priv, pub = _new_keypair()
+ pred = R.OversightRegistrationPredicate(
+ file_id="00000000-0000-4000-8000-000000000001",
+ issuer_pubkey_ed25519=pub.hex(),
+ recipient_id="alice@test",
+ recipient_pubkey_sha256="00" * 32,
+ suite="OSGT-CLASSIC-v1",
+ registered_at="2026-04-19T07:00:00Z",
+ )
+ stmt = R.build_statement("aa" * 16, "bb" * 32, pred)
+ env = R.sign_dsse(stmt, priv)
+ assert R.verify_dsse(env, pub), "valid envelope failed verification"
+ print(" [PASS] 2. sign_dsse + verify_dsse round trip")
+
+
+def t3_tamper_payload_rejected():
+ priv, pub = _new_keypair()
+ pred = R.OversightRegistrationPredicate(
+ file_id="x", issuer_pubkey_ed25519=pub.hex(),
+ recipient_id="r", recipient_pubkey_sha256="00" * 32,
+ suite="s", registered_at="t",
+ )
+ env = R.sign_dsse(R.build_statement("a", "b", pred), priv)
+ tampered = R.DSSEEnvelope(
+ payload_b64=base64.b64encode(b'{"evil":1}').decode(),
+ payload_type=env.payload_type,
+ signatures=env.signatures,
+ )
+ assert not R.verify_dsse(tampered, pub), "tampered payload accepted!"
+ print(" [PASS] 3. tampered payload rejected")
+
+
+def t4_wrong_key_rejected():
+ priv, _ = _new_keypair()
+ _, other_pub = _new_keypair()
+ pred = R.OversightRegistrationPredicate(
+ file_id="x", issuer_pubkey_ed25519="zz",
+ recipient_id="r", recipient_pubkey_sha256="00" * 32,
+ suite="s", registered_at="t",
+ )
+ env = R.sign_dsse(R.build_statement("a", "b", pred), priv)
+ assert not R.verify_dsse(env, other_pub), "wrong-key sig verified!"
+ print(" [PASS] 4. wrong public key rejected")
+
+
+def t5_statement_shape():
+ pred = R.OversightRegistrationPredicate(
+ file_id="fid", issuer_pubkey_ed25519="pp",
+ recipient_id="rid", recipient_pubkey_sha256="rxhash",
+ suite="OSGT-CLASSIC-v1", registered_at="2026-04-19T00:00:00Z",
+ )
+ s = R.build_statement("mark1234", "deadbeef" * 8, pred)
+ assert s["_type"] == R.STATEMENT_TYPE
+ assert s["predicateType"] == R.PREDICATE_TYPE
+ assert s["subject"][0]["name"] == "mark:mark1234"
+ assert s["subject"][0]["digest"]["sha256"].startswith("deadbeef")
+ assert s["predicate"]["suite"] == "OSGT-CLASSIC-v1"
+ print(" [PASS] 5. in-toto v1 statement shape correct")
+
+
+def t6_canonical_envelope_json():
+ priv, pub = _new_keypair()
+ pred = R.OversightRegistrationPredicate(
+ file_id="x", issuer_pubkey_ed25519="pp",
+ recipient_id="r", recipient_pubkey_sha256="00" * 32,
+ suite="s", registered_at="t",
+ )
+ env = R.sign_dsse(R.build_statement("a", "b" * 32, pred), priv)
+ raw = env.to_json()
+ # canonical: keys sorted alphabetically (payload, payloadType, signatures)
+ parsed_keys_in_order = [k for k in json.loads(raw).keys()]
+ # round-trip must produce identical bytes
+ again = R.DSSEEnvelope.from_json(raw).to_json()
+ assert raw == again, "envelope JSON not canonical (round-trip differs)"
+ # no whitespace
+ assert " " not in raw and "\n" not in raw, "envelope JSON has whitespace"
+ print(" [PASS] 6. envelope JSON is canonical and round-trip stable")
+
+
+def t8_recipient_pubkey_never_appears_raw():
+ """Privacy: raw X25519 recipient key must never end up in the on-log payload."""
+ priv, _ = _new_keypair()
+ raw_pub_hex = "11" * 32 # pretend recipient X25519 pub
+ pred = R.OversightRegistrationPredicate(
+ file_id="x", issuer_pubkey_ed25519="pp",
+ recipient_id="r",
+ recipient_pubkey_sha256=R.hash_recipient_pubkey(raw_pub_hex),
+ suite="s", registered_at="t",
+ )
+ stmt = R.build_statement("a", "b" * 32, pred)
+ env = R.sign_dsse(stmt, priv)
+ raw_payload = base64.b64decode(env.payload_b64).decode()
+ assert raw_pub_hex not in raw_payload, "RAW recipient pubkey leaked into on-log payload"
+ assert pred.recipient_pubkey_sha256 in raw_payload
+ assert pred.recipient_pubkey_sha256 != raw_pub_hex
+ print(" [PASS] 8. raw recipient pubkey is hashed before going on-log")
+
+
+def t9_predicate_carries_version_int():
+ pred = R.OversightRegistrationPredicate(
+ file_id="x", issuer_pubkey_ed25519="pp",
+ recipient_id="r", recipient_pubkey_sha256="00" * 32,
+ suite="s", registered_at="t",
+ )
+ d = pred.to_dict()
+ assert d.get("predicate_version") == 1, d
+ print(" [PASS] 9. predicate body carries integer predicate_version")
+
+
+def t10_bundle_has_5year_replay_fields():
+ """Bundle must carry log_pubkey, checkpoint, schema URI, schema int."""
+ priv, pub = _new_keypair()
+ pred = R.OversightRegistrationPredicate(
+ file_id="x", issuer_pubkey_ed25519="pp",
+ recipient_id="r", recipient_pubkey_sha256="00" * 32,
+ suite="s", registered_at="t",
+ )
+ env = R.sign_dsse(R.build_statement("a", "b" * 32, pred), priv)
+ upload = R.RekorUploadResult(
+ log_url="https://log2025-1.rekor.sigstore.dev",
+ log_index=42, log_id="abc", integrated_time=1776600000,
+ transparency_log_entry={"logEntry": "..."},
+ log_pubkey_pem="-----BEGIN PUBLIC KEY-----\nFAKE\n-----END PUBLIC KEY-----",
+ checkpoint="rekor.sigstore.dev\n42\nABC=\n- rekor sig...",
+ )
+ bundle = R.build_bundle(
+ manifest_dict={"file_id": "x"},
+ manifest_sig_hex="aa" * 64,
+ upload=upload,
+ dsse_envelope=env,
+ rfc3161_token_b64="dummy",
+ rfc3161_chain_b64="chainpem",
+ )
+ assert bundle["bundle_schema"] == 2
+ assert bundle["tlog_kind"] == "rekor-v2-dsse"
+ rekor = bundle["rekor"]
+ assert rekor["log_pubkey_pem"], "log_pubkey missing"
+ assert rekor["checkpoint"], "checkpoint missing"
+ assert rekor["log_entry_schema"] == "rekor/v1.TransparencyLogEntry"
+ assert bundle["rfc3161_chain"] == "chainpem"
+ print(" [PASS] 10. bundle carries log_pubkey + checkpoint + schema URI + schema=2")
+
+
+def t7_offline_verify_rejects_empty_tle():
+ priv, pub = _new_keypair()
+ pred = R.OversightRegistrationPredicate(
+ file_id="x", issuer_pubkey_ed25519="pp",
+ recipient_id="r", recipient_pubkey_sha256="00" * 32,
+ suite="s", registered_at="t",
+ )
+ env = R.sign_dsse(R.build_statement("a", "b" * 32, pred), priv)
+ ok, reason = R.verify_inclusion_offline({}, env, pub)
+ assert not ok and "transparency_log_entry" in reason, reason
+ print(f" [PASS] 7. offline verify rejects empty bundle ({reason})")
+
+
+def main():
+ print("=" * 60)
+ print(" oversight_core.rekor - unit tests (offline, no network)")
+ print("=" * 60)
+ t1_pae_byte_exact()
+ t2_sign_verify_roundtrip()
+ t3_tamper_payload_rejected()
+ t4_wrong_key_rejected()
+ t5_statement_shape()
+ t6_canonical_envelope_json()
+ t7_offline_verify_rejects_empty_tle()
+ t8_recipient_pubkey_never_appears_raw()
+ t9_predicate_carries_version_int()
+ t10_bundle_has_5year_replay_fields()
+ print()
+ print(" ALL TESTS PASSED - 10/10")
+ print()
+
+
+if __name__ == "__main__":
+ main()