Zion Boggan
repos/CTI Detection Automation/docs/architecture.md
zionboggan.com ↗
79 lines · markdown
History for this file →
1
# Architecture
2
 
3
## Indicator model
4
 
5
Everything from every feed is normalized into one `Indicator`:
6
 
7
```
8
type        ip | domain | url | sha256 | md5 | sha1 | email
9
value       the indicator itself
10
source      comma-joined list of feeds that reported it
11
threat_type botnet_cc | phishing | payload_delivery | ...
12
confidence  0-100
13
malware     family name when known
14
techniques  list of ATT&CK technique IDs
15
tags        free-form labels from the source
16
```
17
 
18
A feed connector does two things: `fetch_raw()` talks to the API, and `parse()`
19
turns the raw payload into `Indicator` objects. They're separate so the parsing
20
logic can be tested against fixtures without a network, and so adding a feed is
21
just one new `parse()`.
22
 
23
## Deduplication
24
 
25
Feeds overlap heavily - the same C2 IP shows up in ThreatFox and Feodo, the same
26
domain in OTX and URLhaus. Dedup keys on `(type, lowercased value)` and merges:
27
 
28
- confidence becomes the max across sources
29
- techniques and tags become the union
30
- sources are concatenated, so provenance is preserved
31
- malware family and first-seen are taken from the first source that had them
32
 
33
The result is one record per indicator that's stronger than any single feed's view
34
of it.
35
 
36
## TTP extraction
37
 
38
Techniques come from three places, in order of trust:
39
 
40
1. **Directly from the feed** - OTX pulses carry `attack_ids`, which are used as-is.
41
2. **From the malware family** - a CobaltStrike or AgentTesla indicator maps to that
42
   family's common techniques (`src/cti/mitre.py`).
43
3. **From the threat type** - a phishing URL maps to T1566.002 / T1204.001 even when
44
   no family is named.
45
 
46
Extraction then collapses these across all indicators into a ranked
47
`Technique` list with per-technique indicator counts and contributing sources, which
48
becomes the coverage report attached to every bundle.
49
 
50
## Rule generation
51
 
52
Two artifacts come out of a bundle:
53
 
54
**CDB lists** - one file per indicator type (`cti-malicious-ip`,
55
`cti-malicious-domain`, `cti-malicious-url`, `cti-malware-hash`, `cti-leaked-email`),
56
in Wazuh's `key:value` format where the value is the malware family or threat type.
57
These are the actual lookup tables.
58
 
59
**XML rules** - list-lookup rules (one per type present) that fire when a field
60
matches an entry in the corresponding CDB list. Each rule is tagged with the
61
techniques that dominate that bucket, so an alert arrives already mapped to ATT&CK.
62
Rule IDs start at a configurable base (default 100300) to stay clear of the bundled
63
Wazuh ruleset and the lab's hand-written rules.
64
 
65
## Staging and promotion
66
 
67
`run` writes a candidate under `output/candidates/<bundle-id>/` with the lists, the
68
rules, the coverage report, and a manifest carrying the diff against the last
69
approved bundle. Nothing is active yet. `promote` (called when the analyst approves)
70
copies the candidate into `output/active/`, records the indicator set as the new
71
baseline for the next diff, and - if `wazuh_etc_dir` is configured - writes the lists
72
and rules into the Wazuh manager's directories.
73
 
74
## Why a human in the loop
75
 
76
Auto-generated detections from public feeds are a false-positive risk: a shared
77
CDN IP, a sinkholed domain, a hash collision in a list. The approval gate means a
78
bad bundle is a rejected email, not a pager storm. The cost is a few minutes of
79
analyst time per run, which is cheap next to chasing phantom alerts across the fleet.