Integration of Large Language Models (LLMs) with Ghidra #7088

KatoStevenMubiru · 2024-10-16T08:29:48Z

KatoStevenMubiru
Oct 16, 2024

Description:
Propose integrating LLMs (e.g., OpenAI's GPT) to enhance Ghidra's analysis capabilities by providing natural language explanations of complex assembly code and automating the generation of documentation.

Benefits:

Enhanced Understanding: Helps analysts quickly grasp intricate code structures and functionalities.
Automation: Streamlines the documentation process, saving time and reducing manual effort.
Accessibility: Makes reverse engineering more accessible to those less familiar with assembly languages.

Proposed Implementation:

Plugin Development: Create a Ghidra plugin that interfaces with an LLM API to process selected code segments.
Natural Language Output: Display LLM-generated explanations within Ghidra's UI.
Customization Options: Allow users to configure the depth and style of the explanations.

Additional Context:
Integrating LLMs can bridge the gap between low-level code analysis and high-level understanding, making Ghidra an even more powerful tool for both novice and experienced analysts.

KatoStevenMubiru · 2024-10-24T18:30:47Z

KatoStevenMubiru
Oct 24, 2024
Author

Hello @everyone , I would want to know from the community, why this idea has got many down votes so far?
May be I could refine it abit

0 replies

hippietrail · 2024-10-25T06:57:03Z

hippietrail
Oct 25, 2024

I would say just forge ahead and start making such a thing as a Ghidra extension on your own. Ask for help with particular things you get stuck with along the way. Maybe your extension will need changes to Ghidra itself.

I use LLMs all the time when I'm coding, but I've never tried to interface with an LLM programmatically so I have little clue where to start. This will work best if you have a very clear vision of where you would start.

In the meantime you can definitely already copy and paste from the listing or the decompiler into an LLM site and ask how the code works, etc.

So far LLMs are based on text so at present they wouldn't be able to help resolving jump tables or other things that would involve analysing binary bytes rather than already disassembled code. In the future they will probably be able to find exploits and see through obfuscation, follow values passing through registers and memory addresses, etc.

Probably the most obvious and straightforward place to start would be an "add comments" function.

0 replies

KatoStevenMubiru · 2024-10-26T11:03:18Z

KatoStevenMubiru
Oct 26, 2024
Author

@hippietrail

Thank you for the encouragement and helpful suggestions! I'll start developing the Ghidra extension independently to integrate LLMs, beginning with the "add comments" feature. I'll reach out if I need any assistance along the way.

Looking forward to contributing!

0 replies

thixotropist · 2024-10-26T13:13:06Z

thixotropist
Oct 26, 2024

You might take a look at discussion #6045, which speculates that your kind of work might apply to making sense of binaries compiled with vector extensions supported. The GCC test suite provides a decent training set of C source code compiled to vector instruction sequences for different machine architectures.

Optimizing compilers can turn simple loops over arrays of structures into code sequences Ghidra struggles with.

0 replies

hippietrail · 2024-10-26T14:12:44Z

hippietrail
Oct 26, 2024

I just had an idea. Maybe the next easiest AI feature to implement might be naming variables and functions, and more useful than adding comments too.

1 reply

Wall-AF Oct 27, 2024

I just had an idea. Maybe the next easiest AI feature to implement might be naming variables and functions, and more useful than adding comments too.

There is an oustanding feature request that, if implemented, would go along way towards a better automated naming process, #6111.

KatoStevenMubiru · 2024-10-26T14:43:26Z

KatoStevenMubiru
Oct 26, 2024
Author

Thank you @thixotropist and @hippietrail . Great insights so far.

0 replies

thixotropist · 2024-10-26T17:21:41Z

thixotropist
Oct 26, 2024

You likely need some good examples of what you would like your tool to generate. I like using Whisper.cpp as the C and C++ source built with the latest GCC compiler with vector extensions enabled and various machine architecture settings. Your tool might start by analyzing the Ghidra assembly and decompiler windows to say 'that looks like an inlined vector dot product'. The trick is to go further and have it say '... but this implementation includes a serious buffer overrun/underrun error.'

0 replies

ghidra2 · 2024-10-31T11:31:20Z

ghidra2
Oct 31, 2024
Collaborator

There are certainly some cases where LLMs can be useful in the context of reverse engineering. However, it is not always the case that the LLM will provide useful information which makes their integration into Ghidra contentious. I would like to see more evidence of consistent and reliable results before we support such a capability.

1 reply

hippietrail Oct 31, 2024

Probably the number 1 reason to do it as an extension.

KatoStevenMubiru · 2024-10-31T11:57:00Z

KatoStevenMubiru
Oct 31, 2024
Author

Thanks @ghidra2 , point taken .

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration of Large Language Models (LLMs) with Ghidra #7088

{{title}}

Replies: 9 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Integration of Large Language Models (LLMs) with Ghidra #7088

KatoStevenMubiru Oct 16, 2024

Replies: 9 comments · 2 replies

KatoStevenMubiru Oct 24, 2024 Author

hippietrail Oct 25, 2024

KatoStevenMubiru Oct 26, 2024 Author

thixotropist Oct 26, 2024

hippietrail Oct 26, 2024

Wall-AF Oct 27, 2024

KatoStevenMubiru Oct 26, 2024 Author

thixotropist Oct 26, 2024

ghidra2 Oct 31, 2024 Collaborator

hippietrail Oct 31, 2024

KatoStevenMubiru Oct 31, 2024 Author

KatoStevenMubiru
Oct 16, 2024

Replies: 9 comments 2 replies

KatoStevenMubiru
Oct 24, 2024
Author

hippietrail
Oct 25, 2024

KatoStevenMubiru
Oct 26, 2024
Author

thixotropist
Oct 26, 2024

hippietrail
Oct 26, 2024

KatoStevenMubiru
Oct 26, 2024
Author

thixotropist
Oct 26, 2024

ghidra2
Oct 31, 2024
Collaborator

KatoStevenMubiru
Oct 31, 2024
Author