docs/architecture.md · CTI Detection Automation

79 lines · markdown

# Architecture
 
## Indicator model
 
Everything from every feed is normalized into one `Indicator`:
 
```
type        ip | domain | url | sha256 | md5 | sha1 | email
value       the indicator itself
source      comma-joined list of feeds that reported it
threat_type botnet_cc | phishing | payload_delivery | ...
confidence  0-100
malware     family name when known
techniques  list of ATT&CK technique IDs
tags        free-form labels from the source
```
 
A feed connector does two things: `fetch_raw()` talks to the API, and `parse()`
turns the raw payload into `Indicator` objects. They're separate so the parsing
logic can be tested against fixtures without a network, and so adding a feed is
just one new `parse()`.
 
## Deduplication
 
Feeds overlap heavily - the same C2 IP shows up in ThreatFox and Feodo, the same
domain in OTX and URLhaus. Dedup keys on `(type, lowercased value)` and merges:
 
- confidence becomes the max across sources
- techniques and tags become the union
- sources are concatenated, so provenance is preserved
- malware family and first-seen are taken from the first source that had them
 
The result is one record per indicator that's stronger than any single feed's view
of it.
 
## TTP extraction
 
Techniques come from three places, in order of trust:
 
1. **Directly from the feed** - OTX pulses carry `attack_ids`, which are used as-is.
2. **From the malware family** - a CobaltStrike or AgentTesla indicator maps to that
   family's common techniques (`src/cti/mitre.py`).
3. **From the threat type** - a phishing URL maps to T1566.002 / T1204.001 even when
   no family is named.
 
Extraction then collapses these across all indicators into a ranked
`Technique` list with per-technique indicator counts and contributing sources, which
becomes the coverage report attached to every bundle.
 
## Rule generation
 
Two artifacts come out of a bundle:
 
**CDB lists** - one file per indicator type (`cti-malicious-ip`,
`cti-malicious-domain`, `cti-malicious-url`, `cti-malware-hash`, `cti-leaked-email`),
in Wazuh's `key:value` format where the value is the malware family or threat type.
These are the actual lookup tables.
 
**XML rules** - list-lookup rules (one per type present) that fire when a field
matches an entry in the corresponding CDB list. Each rule is tagged with the
techniques that dominate that bucket, so an alert arrives already mapped to ATT&CK.
Rule IDs start at a configurable base (default 100300) to stay clear of the bundled
Wazuh ruleset and the lab's hand-written rules.
 
## Staging and promotion
 
`run` writes a candidate under `output/candidates/<bundle-id>/` with the lists, the
rules, the coverage report, and a manifest carrying the diff against the last
approved bundle. Nothing is active yet. `promote` (called when the analyst approves)
copies the candidate into `output/active/`, records the indicator set as the new
baseline for the next diff, and - if `wazuh_etc_dir` is configured - writes the lists
and rules into the Wazuh manager's directories.
 
## Why a human in the loop
 
Auto-generated detections from public feeds are a false-positive risk: a shared
CDN IP, a sinkholed domain, a hash collision in a list. The approval gate means a
bad bundle is a rejected email, not a pager storm. The cost is a few minutes of
analyst time per run, which is cheap next to chasing phantom alerts across the fleet.

1	# Architecture
2
3	## Indicator model
4
5	Everything from every feed is normalized into one `Indicator`:
6
7	```
8	type ip \| domain \| url \| sha256 \| md5 \| sha1 \| email
9	value the indicator itself
10	source comma-joined list of feeds that reported it
11	threat_type botnet_cc \| phishing \| payload_delivery \| ...
12	confidence 0-100
13	malware family name when known
14	techniques list of ATT&CK technique IDs
15	tags free-form labels from the source
16	```
17
18	A feed connector does two things: `fetch_raw()` talks to the API, and `parse()`
19	turns the raw payload into `Indicator` objects. They're separate so the parsing
20	logic can be tested against fixtures without a network, and so adding a feed is
21	just one new `parse()`.
22
23	## Deduplication
24
25	Feeds overlap heavily - the same C2 IP shows up in ThreatFox and Feodo, the same
26	domain in OTX and URLhaus. Dedup keys on `(type, lowercased value)` and merges:
27
28	- confidence becomes the max across sources
29	- techniques and tags become the union
30	- sources are concatenated, so provenance is preserved
31	- malware family and first-seen are taken from the first source that had them
32
33	The result is one record per indicator that's stronger than any single feed's view
34	of it.
35
36	## TTP extraction
37
38	Techniques come from three places, in order of trust:
39
40	1. Directly from the feed - OTX pulses carry `attack_ids`, which are used as-is.
41	2. From the malware family - a CobaltStrike or AgentTesla indicator maps to that
42	family's common techniques (`src/cti/mitre.py`).
43	3. From the threat type - a phishing URL maps to T1566.002 / T1204.001 even when
44	no family is named.
45
46	Extraction then collapses these across all indicators into a ranked
47	`Technique` list with per-technique indicator counts and contributing sources, which
48	becomes the coverage report attached to every bundle.
49
50	## Rule generation
51
52	Two artifacts come out of a bundle:
53
54	CDB lists - one file per indicator type (`cti-malicious-ip`,
55	`cti-malicious-domain`, `cti-malicious-url`, `cti-malware-hash`, `cti-leaked-email`),
56	in Wazuh's `key:value` format where the value is the malware family or threat type.
57	These are the actual lookup tables.
58
59	XML rules - list-lookup rules (one per type present) that fire when a field
60	matches an entry in the corresponding CDB list. Each rule is tagged with the
61	techniques that dominate that bucket, so an alert arrives already mapped to ATT&CK.
62	Rule IDs start at a configurable base (default 100300) to stay clear of the bundled
63	Wazuh ruleset and the lab's hand-written rules.
64
65	## Staging and promotion
66
67	`run` writes a candidate under `output/candidates/<bundle-id>/` with the lists, the
68	rules, the coverage report, and a manifest carrying the diff against the last
69	approved bundle. Nothing is active yet. `promote` (called when the analyst approves)
70	copies the candidate into `output/active/`, records the indicator set as the new
71	baseline for the next diff, and - if `wazuh_etc_dir` is configured - writes the lists
72	and rules into the Wazuh manager's directories.
73
74	## Why a human in the loop
75
76	Auto-generated detections from public feeds are a false-positive risk: a shared
77	CDN IP, a sinkholed domain, a hash collision in a list. The approval gate means a
78	bad bundle is a rejected email, not a pager storm. The cost is a few minutes of
79	analyst time per run, which is cheap next to chasing phantom alerts across the fleet.