AI→Sec

Module 3: Malware Analysis

Malware Analysis with Deep Learning & Transformers

Static/dynamic signals → embeddings, families, campaigns

Outcome: Classify, cluster, and attribute malware using learned representations of static and dynamic behavior.

Learning Objectives (3)
  • Build static byte/metadata embeddings and cluster at scale
  • Model dynamic traces as sequences to capture tactics/behaviors
  • Draft analyst-ready summaries and IOCs with guardrails
Topic Map (5)
  • Byte/IR embeddings (PE/ELF)
  • Syscall/behavior sequences
  • Family clustering/linkage
  • Packing/obfuscation robustness
  • LLM assist for triage
Topic Map — Deep Dives (3)
  • Static signals
    • Headers/imports/sections; byte n-grams; entropy maps
    • Bytes→CNN/Transformer vs. feature models; IR path when needed
  • Dynamic behavior
    • Traces via ETW/Sysmon/eBPF; PID trees; rare event handling
    • Temporal motifs (beacons); feature joins (file/registry/net)
  • Linkage
    • ANN (FAISS/ScaNN); graph builds; community detection; centroids → IOCs
Key Shifts Powered by AI (3)
  • From Engineered Features to Learned Representations Large-scale learned embeddings generalize better than handcrafted features. [ember] [sorel] [malconv]
    Why it matters: Improved transfer across campaigns; scalable triage.
  • Behavioral Modeling Sequence models over dynamic traces capture tactics hard to encode with rules. [usenix23_humans]
    Why it matters: Earlier detection and family attribution.
  • Robustness Matters Adversarial/packed samples stress detectors; evaluation must include adaptive attackers. [provninja] [wolf24]
    Why it matters: Hardened pipelines and realistic reporting.
Still Hard (4)
  • Label drift & vendor disagreement
  • Packed/obfuscated samples; OOD generalization
  • Sandbox vs. endpoint behavioral gaps; anti-VM
  • Explainability for human review

References

  1. Fu et al. “Toward a Robust Detection of PowerShell Malware against Code Mixing and Obfuscation using Sentence Transformer and Similarity Learning.” ACM, 2024.
  2. Anderson & Roth. “EMBER: An Open Dataset for Training Static PE Malware ML Models.” 2018.
  3. Harang et al. “SOREL-20M: A Large-Scale Benchmark Dataset for Malicious PE Detection.” 2020.
  4. Raff et al. “Malware Detection by Eating a Whole EXE (MalConv).” AAAI 2018.
  5. Aonzo et al. “Humans vs. Machines in Malware Classification.” USENIX Security 2023.
  6. Mukherjee et al. “Evading Provenance-Based ML Detectors with Adversarial System Actions (PROVNINJA).” USENIX Security 2023.
  7. Ling et al. “A Wolf in Sheep’s Clothing: Practical Black-box Adversarial Attacks for Evading Learning-based Windows Malware Detection in the Wild.” USENIX Security 2024.