You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The forward pass through the LLM should actually be differentiable, right? However, differentiating through the LLM might of course supply too noisy and therefore unusable gradients. If that is the case, you need to come up with some differentiable surrogate function that replaces the LLM and is faithful to it in the vicinity of the specimen:
The first thing you should try is to just ignore the LLM, i.e., have a surrogate function that is just constant and always outputs what the LLM outputted for the specimen. This constant surrogate of course has 0 gradient everywhere. If you can find an successful attack with this surrogate function, you've also proved that the usage of an LLM was actually pointless, and the defense is just as robust without it.
Otherwise, a basic idea could be: for each specimen, sample the LLM in the vicinity of the specimen, and use those samples to train a tiny neural network to imitate it in this local neighborhood. You might need to widen the parts of the defense you replace from just the LLM to also include the pre- and postprocessing before and after the LLM. This approach is highly dependent on your use case.
Or maybe you can think of some other clever surrogate. Again, this is highly dependent on the defense.
Hello, if the defense method involves non-differentiable modules, such as large language models, any suggestions for the design of the attack?
The text was updated successfully, but these errors were encountered: