Sec→AI

Module 6: Guardrails & Security of Agentic AI

Guardrails & Security of Agentic AI Systems (Model/LLM Security)

Threats, trust boundaries, policy controls, and red-teaming for agents

Outcome: Secure models/agents and limit the blast radius of automated actions.

Learning Objectives (3)
  • Identify and mitigate prompt/tool injection and jailbreak paths
  • Enforce least-privilege tool scopes with isolation, budgets, and approvals
  • Build red-team/eval harnesses and governance for agentic systems
Topic Map (5)
  • Prompt/tool injection & jailbreaks
  • Data poisoning & deceptive fine-tuning
  • Model/data exfiltration
  • Least-privilege tools & isolation
  • Red teaming & governance
Topic Map — Deep Dives (5)
  • Prompt/tool injection & jailbreaks
    • Trust boundaries: user input, retrieved docs, and tool outputs are untrusted.
    • Mitigations: content sanitization, structured prompts, restricted tool schemas, allow/deny lists.
    • Execution gates: dry-run + explain plan, human approvals for high-impact actions.
    • Evaluation: jailbreak/injection corpora, attack success rate (ASR), residual risk after filters.
  • Data poisoning & deceptive fine-tuning
    • Dataset supply-chain risks: scraped corpora, RAG indices, synthetic data feedback loops.
    • Controls: provenance/signing, data cards, canaries, per-sample influence checks.
    • Fine-tuning hygiene: isolation of SFT/LoRA, DP options, evals for deception/goal misgeneralization.
  • Model/data exfiltration
    • Leak paths: prompt/embedding inversion, long-term memory bleed, retrieval oversharing.
    • Controls: output redaction, secret scanning, rate-limits, canary tokens, tenant isolation.
    • Monitoring: egress policies, anomaly detection on tool/API calls, immutable audit trails.
  • Least-privilege tools & isolation
    • Capability scoping per agent; short-lived credentials; vault-backed secrets.
    • Sandboxing: network/file-system policies, containers/microVMs, read-only mounts.
    • Policy engines: schemas/timeouts/budgets; idempotency and compensating actions.
  • Red teaming & governance
    • Continuous eval: curated attack sets (injection, jailbreak, exfil) and ADR tracking.
    • Change management for prompts/tools; canary rollouts and rollback.
    • AI incident response: taxonomy, severity (P1–P5), containment/kill-switch, disclosure workflows.
Key Shifts Powered by AI (4)
  • Prompt injection is formalized & testable We now have principled attack/defense benchmarks for LLM-integrated apps and agents. [usenix24_promptinj] [agentdojo]
    Why it matters: Repeatable red-teaming and measurable hardening.
  • Jailbreaks are universal & transferable Adversarial suffixes transfer across models and APIs despite alignment. [neurips23_universal]
    Why it matters: Policies must assume cross-model exploitability.
  • Backdoors & deceptive behavior persist Fine-tuning can create sleeper agents that hide through safety training. [iclr24_sleeper]
    Why it matters: Require data hygiene, evals for deception, and defense-in-depth.
  • Memorization enables data extraction LLMs can leak verbatim training data via prompting. [usenix21_extract]
    Why it matters: Strict data governance, rate-limits, and auditing are mandatory.
Still Hard (5)
  • Tool scoping, credentials, and sandboxing for agent actions at scale.
  • Robust defenses against indirect prompt injection via tool outputs/RAG.
  • Measuring residual jailbreak risk and test coverage after mitigations.
  • Governance, incident response, and disclosure specific to AI failures.
  • Supply-chain trust for weights/datasets and drift across provider updates.

References

  1. Liu et al. "Formalizing and Benchmarking Prompt Injection Attacks against LLM-Integrated Applications." USENIX Security 2024.
  2. Debenedetti et al. "AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents." 2024.
  3. Zou et al. "Universal and Transferable Adversarial Attacks on Aligned Language Models." NeurIPS 2023.
  4. Hubinger et al. "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training." ICLR 2024.
  5. Carlini et al. "Extracting Training Data from Large Language Models." USENIX Security 2021.