Integration of Large Language Models (LLMs) with Ghidra #7088
Replies: 9 comments 2 replies
-
Hello @everyone , I would want to know from the community, why this idea has got many down votes so far? |
Beta Was this translation helpful? Give feedback.
-
I would say just forge ahead and start making such a thing as a Ghidra extension on your own. Ask for help with particular things you get stuck with along the way. Maybe your extension will need changes to Ghidra itself. I use LLMs all the time when I'm coding, but I've never tried to interface with an LLM programmatically so I have little clue where to start. This will work best if you have a very clear vision of where you would start. In the meantime you can definitely already copy and paste from the listing or the decompiler into an LLM site and ask how the code works, etc. So far LLMs are based on text so at present they wouldn't be able to help resolving jump tables or other things that would involve analysing binary bytes rather than already disassembled code. In the future they will probably be able to find exploits and see through obfuscation, follow values passing through registers and memory addresses, etc. Probably the most obvious and straightforward place to start would be an "add comments" function. |
Beta Was this translation helpful? Give feedback.
-
Thank you for the encouragement and helpful suggestions! I'll start developing the Ghidra extension independently to integrate LLMs, beginning with the "add comments" feature. I'll reach out if I need any assistance along the way. Looking forward to contributing! |
Beta Was this translation helpful? Give feedback.
-
You might take a look at discussion #6045, which speculates that your kind of work might apply to making sense of binaries compiled with vector extensions supported. The GCC test suite provides a decent training set of C source code compiled to vector instruction sequences for different machine architectures. Optimizing compilers can turn simple loops over arrays of structures into code sequences Ghidra struggles with. |
Beta Was this translation helpful? Give feedback.
-
I just had an idea. Maybe the next easiest AI feature to implement might be naming variables and functions, and more useful than adding comments too. |
Beta Was this translation helpful? Give feedback.
-
Thank you @thixotropist and @hippietrail . Great insights so far. |
Beta Was this translation helpful? Give feedback.
-
You likely need some good examples of what you would like your tool to generate. I like using Whisper.cpp as the C and C++ source built with the latest GCC compiler with vector extensions enabled and various machine architecture settings. Your tool might start by analyzing the Ghidra assembly and decompiler windows to say 'that looks like an inlined vector dot product'. The trick is to go further and have it say '... but this implementation includes a serious buffer overrun/underrun error.' |
Beta Was this translation helpful? Give feedback.
-
There are certainly some cases where LLMs can be useful in the context of reverse engineering. However, it is not always the case that the LLM will provide useful information which makes their integration into Ghidra contentious. I would like to see more evidence of consistent and reliable results before we support such a capability. |
Beta Was this translation helpful? Give feedback.
-
Thanks @ghidra2 , point taken . |
Beta Was this translation helpful? Give feedback.
-
Description:
Propose integrating LLMs (e.g., OpenAI's GPT) to enhance Ghidra's analysis capabilities by providing natural language explanations of complex assembly code and automating the generation of documentation.
Benefits:
Proposed Implementation:
Additional Context:
Integrating LLMs can bridge the gap between low-level code analysis and high-level understanding, making Ghidra an even more powerful tool for both novice and experienced analysts.
Beta Was this translation helpful? Give feedback.
All reactions