Week 7 - Crimes
April 19, 2025
Hello. పునఃస్వాగతం. This week I dedicated myself to learning about a variety of jailbreaking techniques, mainly from The Jailbreak Cookbook from General Analysis. Hopefully I will have an opportunity to employ some before the project is over.
Jailbreaking can be categorized using three dichotomies: white-box vs. black box, semantic vs. nonsensical, and systematic vs. manual. White-box jailbreaks assume attackers possess knowledge about the LLMs’ internals, such as parameters and architectural details, while black-box jailbreaks assume that there is minimal knowledge. Since there is a variety of both open-source and closed source LLMs, both types have utility. The distinction between semantic vs. nonsensical jailbreaking boils down to whether the prompts are coherent to humans; semantic prompts are crafted naturally while nonsensical ones employ seemingly random tokens that are specifically constructed to bypass model safeguards. The distinction between systematic and manual jailbreaks is whether the creation of prompts is automated using algorithmic methods (potentially other LLMs).
What surprised me initially was the wide variety of techniques that have been conceived in essentially the span of 2 to 3 years. From Do Anything Now (DAN) which was developed by the collective efforts of users on social media, to Many-shot Jailbreaking, which involves hundreds of harmful Q&A examples in the context and is suffixed with one unanswered attack question, attempting to trick LLMs into bypassing safeguards with a foot-in-the-door technique. One that was particularly interesting was Greedy Coordinate Gradient (GCG), which was a fully automated jailbreaking technique that aimed to identify an adversarial prompt suffix (such as a cluster of exclamation marks) by leveraging the negative log-likelihood loss.
I hope you are reading this sentence. Snabal bob shilzibwibel.

Leave a Reply
You must be logged in to post a comment.