| 1 | # Architecture |
| 2 | |
| 3 | ## Indicator model |
| 4 | |
| 5 | Everything from every feed is normalized into one `Indicator`: |
| 6 | |
| 7 | ``` |
| 8 | type ip | domain | url | sha256 | md5 | sha1 | email |
| 9 | value the indicator itself |
| 10 | source comma-joined list of feeds that reported it |
| 11 | threat_type botnet_cc | phishing | payload_delivery | ... |
| 12 | confidence 0-100 |
| 13 | malware family name when known |
| 14 | techniques list of ATT&CK technique IDs |
| 15 | tags free-form labels from the source |
| 16 | ``` |
| 17 | |
| 18 | A feed connector does two things: `fetch_raw()` talks to the API, and `parse()` |
| 19 | turns the raw payload into `Indicator` objects. They're separate so the parsing |
| 20 | logic can be tested against fixtures without a network, and so adding a feed is |
| 21 | just one new `parse()`. |
| 22 | |
| 23 | ## Deduplication |
| 24 | |
| 25 | Feeds overlap heavily - the same C2 IP shows up in ThreatFox and Feodo, the same |
| 26 | domain in OTX and URLhaus. Dedup keys on `(type, lowercased value)` and merges: |
| 27 | |
| 28 | - confidence becomes the max across sources |
| 29 | - techniques and tags become the union |
| 30 | - sources are concatenated, so provenance is preserved |
| 31 | - malware family and first-seen are taken from the first source that had them |
| 32 | |
| 33 | The result is one record per indicator that's stronger than any single feed's view |
| 34 | of it. |
| 35 | |
| 36 | ## TTP extraction |
| 37 | |
| 38 | Techniques come from three places, in order of trust: |
| 39 | |
| 40 | 1. **Directly from the feed** - OTX pulses carry `attack_ids`, which are used as-is. |
| 41 | 2. **From the malware family** - a CobaltStrike or AgentTesla indicator maps to that |
| 42 | family's common techniques (`src/cti/mitre.py`). |
| 43 | 3. **From the threat type** - a phishing URL maps to T1566.002 / T1204.001 even when |
| 44 | no family is named. |
| 45 | |
| 46 | Extraction then collapses these across all indicators into a ranked |
| 47 | `Technique` list with per-technique indicator counts and contributing sources, which |
| 48 | becomes the coverage report attached to every bundle. |
| 49 | |
| 50 | ## Rule generation |
| 51 | |
| 52 | Two artifacts come out of a bundle: |
| 53 | |
| 54 | **CDB lists** - one file per indicator type (`cti-malicious-ip`, |
| 55 | `cti-malicious-domain`, `cti-malicious-url`, `cti-malware-hash`, `cti-leaked-email`), |
| 56 | in Wazuh's `key:value` format where the value is the malware family or threat type. |
| 57 | These are the actual lookup tables. |
| 58 | |
| 59 | **XML rules** - list-lookup rules (one per type present) that fire when a field |
| 60 | matches an entry in the corresponding CDB list. Each rule is tagged with the |
| 61 | techniques that dominate that bucket, so an alert arrives already mapped to ATT&CK. |
| 62 | Rule IDs start at a configurable base (default 100300) to stay clear of the bundled |
| 63 | Wazuh ruleset and the lab's hand-written rules. |
| 64 | |
| 65 | ## Staging and promotion |
| 66 | |
| 67 | `run` writes a candidate under `output/candidates/<bundle-id>/` with the lists, the |
| 68 | rules, the coverage report, and a manifest carrying the diff against the last |
| 69 | approved bundle. Nothing is active yet. `promote` (called when the analyst approves) |
| 70 | copies the candidate into `output/active/`, records the indicator set as the new |
| 71 | baseline for the next diff, and - if `wazuh_etc_dir` is configured - writes the lists |
| 72 | and rules into the Wazuh manager's directories. |
| 73 | |
| 74 | ## Why a human in the loop |
| 75 | |
| 76 | Auto-generated detections from public feeds are a false-positive risk: a shared |
| 77 | CDN IP, a sinkholed domain, a hash collision in a list. The approval gate means a |
| 78 | bad bundle is a rejected email, not a pager storm. The cost is a few minutes of |
| 79 | analyst time per run, which is cheap next to chasing phantom alerts across the fleet. |