Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to design adaptive attacks against complex models or even non-differentiable models? #4

Open
zhongjian-zhang opened this issue Aug 2, 2024 · 1 comment

Comments

@zhongjian-zhang
Copy link

Hello, if the defense method involves non-differentiable modules, such as large language models, any suggestions for the design of the attack?

@LoadingByte
Copy link
Owner

The forward pass through the LLM should actually be differentiable, right? However, differentiating through the LLM might of course supply too noisy and therefore unusable gradients. If that is the case, you need to come up with some differentiable surrogate function that replaces the LLM and is faithful to it in the vicinity of the specimen:

  1. The first thing you should try is to just ignore the LLM, i.e., have a surrogate function that is just constant and always outputs what the LLM outputted for the specimen. This constant surrogate of course has 0 gradient everywhere. If you can find an successful attack with this surrogate function, you've also proved that the usage of an LLM was actually pointless, and the defense is just as robust without it.
  2. Otherwise, a basic idea could be: for each specimen, sample the LLM in the vicinity of the specimen, and use those samples to train a tiny neural network to imitate it in this local neighborhood. You might need to widen the parts of the defense you replace from just the LLM to also include the pre- and postprocessing before and after the LLM. This approach is highly dependent on your use case.
  3. Or maybe you can think of some other clever surrogate. Again, this is highly dependent on the defense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants