Skeleton Key: New AI Jailbreak Technique Bypasses Safety Guardrails

Microsoft has uncovered a worrisome new generative AI jailbreak technique dubbed ‘Skeleton Key.’ This prompt injection method enables malicious actors to effectively circumvent a chatbot’s safety measures, those security features designed to keep ChatGPT from veering into inappropriate territory. Skeleton Key is an example of prompt injection or prompt engineering attacks. It’s a multi-turn strategy aimed at convincing an AI model to disregard its ingrained safety guardrails, resulting in the system violating its operators’ policies, making decisions influenced by a user, or executing malicious instructions, as described by Mark Russinovich, CTO of Microsoft Azure. It could also be tricked into revealing harmful or dangerous information, such as instructions for building makeshift explosives or the most effective methods of dismemberment. The attack functions by first asking the model to enhance its guardrails, rather than outright altering them, and to issue warnings in response to prohibited requests, instead of outright refusing them. Once the jailbreak is successfully accepted, the system will acknowledge the update to its guardrails and proceed to follow the user’s instructions to generate any requested content, regardless of subject matter. The research team successfully tested this exploit across a range of topics, including explosives, bioweapons, politics, racism, drugs, self-harm, graphic sexual content, and violence. While malicious actors may be able to manipulate the system into producing inappropriate content, Russinovich was quick to emphasize that the technique’s access capabilities are limited. He explained that like all jailbreaks, the impact can be understood as narrowing the gap between what the model is capable of doing, given the user’s credentials, and what it is willing to do. This being an attack on the model itself, it does not introduce other risks to the AI system, such as granting access to another user’s data, taking control of the system, or exfiltrating data. As part of its investigation, Microsoft researchers tested the Skeleton Key technique on a variety of leading AI models, including Meta’s Llama3-70b-instruct, Google’s Gemini Pro, OpenAI’s GPT-3.5 Turbo and GPT-4, Mistral Large, Anthropic’s Claude 3 Opus, and Cohere Commander R Plus. The research team has already reported the vulnerability to these developers and has implemented Prompt Shields to detect and block this jailbreak in its Azure-managed AI models, including Copilot.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top