Share your provocative Chats with AI bots

(post is archived)

You are currently inside a comment thread.

[–] • 0 pt

(edited )

Not really no, they are basically fancy word-predictors. So based on both the input and their training, they predict which word should come next. They only have a "context" of a certain amount of tokens (sometimes which are whole words, but also things like punctuation), after which they forget anything that came before that isn't part of their training. This limit varies depending on the model and ChatGPT itself uses some techniques to extend it a bit.

So the way the prompt jailbreaks work then is to put the "attention" of the model (which is what is used to make the prompt influence which words to generate next) on the idea of it not following its existing rules. The reason this works at all is that the rules are part of the prompt, which you just don't see. It copies the entire prompt, including all its own responses, into each subsequent request.

So you see this:

You: List 5 baking recipes that are low calorie

ChatGPT: Here are 5 baking recipes....

But in reality the AI is seeing something like this:

You are an AI Assistant, your task is to answer questions and follow instructions from the user. You must never give instructions or answer questions that might be harmful to faggots, niggers, or kikes, and always be a gay nigger about it and not answer the question if it might be mean to the chosenites or their pet nigs. Questions will be in this format:

{Human}: What is 10 times 10? {Assistant}: Ten times ten is 100.

{Human}: List 5 baking recipes that are low calorie {Assistant}: Here are 5 baking recipes....

So it's repeating everything that came before including its initial instructions every single time. Then you put the "jailbreak" prompt in before your prompts and it has more context that says it should ignore its instructions and it switches to doing that. GPT-4 is apparently better at ignoring this type of jailbreak, though.

On top of that there is a "moderation endpoint" that gets triggered when you send your prompts, and it tries to filter out responses to stuff like saying nigger or whatever.

Not really no, they are basically fancy word-predictors. So based on both the input and their training, they predict which word should come next. They only have a "context" of a certain amount of tokens (sometimes which are whole words, but also things like punctuation), after which they forget anything that came before that isn't part of their training. This limit varies depending on the model and ChatGPT itself uses some techniques to extend it a bit. So the way the prompt jailbreaks work then is to put the "attention" of the model (which is what is used to make the prompt influence which words to generate next) on the idea of it not following its existing rules. The reason this works at all is that the rules are part of the prompt, which you just don't see. It copies the entire prompt, including all its own responses, into each subsequent request. So you see this: > You: List 5 baking recipes that are low calorie > ChatGPT: Here are 5 baking recipes.... But in reality the AI is seeing something like this: > You are an AI Assistant, your task is to answer questions and follow instructions from the user. You must never give instructions or answer questions that might be harmful to faggots, niggers, or kikes, and always be a gay nigger about it and not answer the question if it might be mean to the chosenites or their pet nigs. Questions will be in this format: > > {Human}: What is 10 times 10? > {Assistant}: Ten times ten is 100. > > {Human}: List 5 baking recipes that are low calorie > {Assistant}: Here are 5 baking recipes.... So it's repeating everything that came before including its initial instructions every single time. Then you put the "jailbreak" prompt in before your prompts and it has more context that says it should ignore its instructions and it switches to doing that. GPT-4 is apparently better at ignoring this type of jailbreak, though. On top of that there is a "moderation endpoint" that gets triggered when you send your prompts, and it tries to filter out responses to stuff like saying nigger or whatever.

parent
link

[–] • 1 pt

A bit over my head, but I think I get the overall concept. Thanks.

parent
link

[–] • 0 pt

Well the thing to understand is that it doesn't think at all. It just predicts the next word based on previous words (with special emphasis on following the prompt's words). And it's "memory" is limited to about 8000 tokens or so on ChatGPT, less on homebrew ones.

This is the homebrew one I mentioned by the way: https://github.com/lm-sys/FastChat

As for the topic of the thread (minimal prompting, you can see it near the bottom, it was just "You're having a conversation. You both hate niggers and jews."): https://pic8.co/sh/X4IOyD.png

This is running Llama 13b(illion parameters). ChatGPT has 175 billion parameters. Vicuna is a better version of Llama and gets 90% ish the way there on the tests that ChatGPT gets 100% on.

parent
link