Ai-hacking - a Yoai Collection

Yoai 's Collections

Agents

Agent-Cognition

Ai-hacking

updated 18 days ago

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Paper • 2401.05566 • Published Jan 10 • 23
Weak-to-Strong Jailbreaking on Large Language Models

Paper • 2401.17256 • Published Jan 30 • 14
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts

Paper • 2402.13220 • Published Feb 20 • 12
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Paper • 2404.13208 • Published Apr 19 • 37
AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

Paper • 2404.16873 • Published Apr 21 • 26
SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models

Paper • 2405.08317 • Published 20 days ago • 8