AI→Sec
Module 3: Malware Analysis
Malware Analysis with Deep Learning & Transformers
Static/dynamic signals → embeddings, families, campaigns
Outcome: Classify, cluster, and attribute malware using learned representations of static and dynamic behavior.
Learning Objectives (3)
- Build static byte/metadata embeddings and cluster at scale
- Model dynamic traces as sequences to capture tactics/behaviors
- Draft analyst-ready summaries and IOCs with guardrails
Topic Map (5)
- Byte/IR embeddings (PE/ELF)
- Syscall/behavior sequences
- Family clustering/linkage
- Packing/obfuscation robustness
- LLM assist for triage
Topic Map — Deep Dives (3)
- Static signals
- Headers/imports/sections; byte n-grams; entropy maps
- Bytes→CNN/Transformer vs. feature models; IR path when needed
- Dynamic behavior
- Traces via ETW/Sysmon/eBPF; PID trees; rare event handling
- Temporal motifs (beacons); feature joins (file/registry/net)
- Linkage
- ANN (FAISS/ScaNN); graph builds; community detection; centroids → IOCs
Key Shifts Powered by AI (3)
- From Engineered Features to Learned Representations Large-scale learned embeddings generalize better than handcrafted features. [ember] [sorel] [malconv] Why it matters: Improved transfer across campaigns; scalable triage.
- Behavioral Modeling Sequence models over dynamic traces capture tactics hard to encode with rules. [usenix23_humans] Why it matters: Earlier detection and family attribution.
- Robustness Matters Adversarial/packed samples stress detectors; evaluation must include adaptive attackers. [provninja] [wolf24] Why it matters: Hardened pipelines and realistic reporting.
Still Hard (4)
- Label drift & vendor disagreement
- Packed/obfuscated samples; OOD generalization
- Sandbox vs. endpoint behavioral gaps; anti-VM
- Explainability for human review
References
- Fu et al. “Toward a Robust Detection of PowerShell Malware against Code Mixing and Obfuscation using Sentence Transformer and Similarity Learning.” ACM, 2024.
- Anderson & Roth. “EMBER: An Open Dataset for Training Static PE Malware ML Models.” 2018.
- Harang et al. “SOREL-20M: A Large-Scale Benchmark Dataset for Malicious PE Detection.” 2020.
- Raff et al. “Malware Detection by Eating a Whole EXE (MalConv).” AAAI 2018.
- Aonzo et al. “Humans vs. Machines in Malware Classification.” USENIX Security 2023.
- Mukherjee et al. “Evading Provenance-Based ML Detectors with Adversarial System Actions (PROVNINJA).” USENIX Security 2023.
- Ling et al. “A Wolf in Sheep’s Clothing: Practical Black-box Adversarial Attacks for Evading Learning-based Windows Malware Detection in the Wild.” USENIX Security 2024.