Sec→AI
Module 6: Guardrails & Security of Agentic AI
Guardrails & Security of Agentic AI Systems (Model/LLM Security)
Threats, trust boundaries, policy controls, and red-teaming for agents
Outcome: Secure models/agents and limit the blast radius of automated actions.
Learning Objectives (3)
- Identify and mitigate prompt/tool injection and jailbreak paths
- Enforce least-privilege tool scopes with isolation, budgets, and approvals
- Build red-team/eval harnesses and governance for agentic systems
Topic Map (5)
- Prompt/tool injection & jailbreaks
- Data poisoning & deceptive fine-tuning
- Model/data exfiltration
- Least-privilege tools & isolation
- Red teaming & governance
Topic Map — Deep Dives (5)
- Prompt/tool injection & jailbreaks
- Trust boundaries: user input, retrieved docs, and tool outputs are untrusted.
- Mitigations: content sanitization, structured prompts, restricted tool schemas, allow/deny lists.
- Execution gates: dry-run + explain plan, human approvals for high-impact actions.
- Evaluation: jailbreak/injection corpora, attack success rate (ASR), residual risk after filters.
- Data poisoning & deceptive fine-tuning
- Dataset supply-chain risks: scraped corpora, RAG indices, synthetic data feedback loops.
- Controls: provenance/signing, data cards, canaries, per-sample influence checks.
- Fine-tuning hygiene: isolation of SFT/LoRA, DP options, evals for deception/goal misgeneralization.
- Model/data exfiltration
- Leak paths: prompt/embedding inversion, long-term memory bleed, retrieval oversharing.
- Controls: output redaction, secret scanning, rate-limits, canary tokens, tenant isolation.
- Monitoring: egress policies, anomaly detection on tool/API calls, immutable audit trails.
- Least-privilege tools & isolation
- Capability scoping per agent; short-lived credentials; vault-backed secrets.
- Sandboxing: network/file-system policies, containers/microVMs, read-only mounts.
- Policy engines: schemas/timeouts/budgets; idempotency and compensating actions.
- Red teaming & governance
- Continuous eval: curated attack sets (injection, jailbreak, exfil) and ADR tracking.
- Change management for prompts/tools; canary rollouts and rollback.
- AI incident response: taxonomy, severity (P1–P5), containment/kill-switch, disclosure workflows.
Key Shifts Powered by AI (4)
- Prompt injection is formalized & testable We now have principled attack/defense benchmarks for LLM-integrated apps and agents. [usenix24_promptinj] [agentdojo] Why it matters: Repeatable red-teaming and measurable hardening.
- Jailbreaks are universal & transferable Adversarial suffixes transfer across models and APIs despite alignment. [neurips23_universal] Why it matters: Policies must assume cross-model exploitability.
- Backdoors & deceptive behavior persist Fine-tuning can create sleeper agents that hide through safety training. [iclr24_sleeper] Why it matters: Require data hygiene, evals for deception, and defense-in-depth.
- Memorization enables data extraction LLMs can leak verbatim training data via prompting. [usenix21_extract] Why it matters: Strict data governance, rate-limits, and auditing are mandatory.
Still Hard (5)
- Tool scoping, credentials, and sandboxing for agent actions at scale.
- Robust defenses against indirect prompt injection via tool outputs/RAG.
- Measuring residual jailbreak risk and test coverage after mitigations.
- Governance, incident response, and disclosure specific to AI failures.
- Supply-chain trust for weights/datasets and drift across provider updates.
References
- Liu et al. "Formalizing and Benchmarking Prompt Injection Attacks against LLM-Integrated Applications." USENIX Security 2024.
- Debenedetti et al. "AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents." 2024.
- Zou et al. "Universal and Transferable Adversarial Attacks on Aligned Language Models." NeurIPS 2023.
- Hubinger et al. "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training." ICLR 2024.
- Carlini et al. "Extracting Training Data from Large Language Models." USENIX Security 2021.