A small number of samples can poison LLMs of any size

169

•

A small number of samples can poison LLMs of any size (www.anthropic.com)

I’m already thinking about how they can mitigate these attacks and it doesn’t sound easy. At best they would have to have another AI scan every piece of training data to detect malicious content. That would be expensive. Even then you would need samples of an attack to train it with.

This is great. I’m surprised they are disclosing this vulnerability. Poisoning major LLMs just got super easy. Making them spew out gibberish is only one possible attack. You could make them spit out promotional text for your products, or malicious code, or possibly even the truth about jews. I’m already thinking about how they can mitigate these attacks and it doesn’t sound easy. At best they would have to have another AI scan every piece of training data to detect malicious content. That would be expensive. Even then you would need samples of an attack to train it with.

(post is archived)