-
Notifications
You must be signed in to change notification settings - Fork 625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Github Coding Agent - 5% on SWE Bench #150
base: main
Are you sure you want to change the base?
Conversation
726fcab
to
e63865c
Compare
e63865c
to
6dc1f87
Compare
@aidando73 Thanks for the PR, I think this is a neat use case and it would be helpful to showcase llama-stack capability. Will take a look at your PR soon. |
@heyjustinai Sweet that's good to hear 👍. I'm keen to also add:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall a great PR, I think there are quite a bit of potential here and having this as an agentic app showcasing a popular use case amongst developers, I do think there are some changes that needs to be made to get this PR across the finish line.
Since the current app only uses llama-stack for inference, the most important one being adhering to the our notion of tools and agentic flow.
from subprocess import run | ||
|
||
# Currently only supports 3.3-70B-Instruct at the moment since it depends on the 3.3/3.2 tool prompt format | ||
MODEL_ID = "meta-llama/Llama-3.3-70B-Instruct" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets add this alongside other constant to an .env file
# Check if repo already exists and remove it if it does | ||
if not os.path.exists(repo_path): | ||
print("Cloning repo...") | ||
os.system( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: switching to subprocess.run() would be safer and gives better control of return codes and exceptions
def validate_not_symlink(path: str) -> Optional[str]: | ||
if os.path.islink(translate_path(path)): | ||
return f"ERROR - File {path} is a symlink. Simlinks not allowed" | ||
return None | ||
|
||
def validate_file_exists(path: str) -> Optional[str]: | ||
if not os.path.exists(translate_path(path)): | ||
return f"ERROR - File {path} does not exist. Please ensure the path is an absolute path and that the file exists." | ||
return None | ||
|
||
def validate_not_a_directory(path: str) -> Optional[str]: | ||
if os.path.isdir(translate_path(path)): | ||
return f"ERROR - File {path} is a directory. Please ensure the path references a file, not a directory." | ||
return None | ||
|
||
def validate_directory_exists(path: str) -> Optional[str]: | ||
if not os.path.exists(translate_path(path)): | ||
return f"ERROR - Directory {path} does not exist. Please ensure the path is an absolute path and that the directory exists." | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: consider merging these function into one validate_safe_file() function, would reduce duplication
|
||
SANDBOX_DIR = os.path.join(REPO_DIR, "sandbox") | ||
# We give the agent a virtual working directory so it doesn't have to worry about long absolute paths | ||
AGENT_WORKING_DIR = "/workspace/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: on the comment about constant
this could be: AGENT_WORKING_DIR = os.getenv("AGENT_WORKING_DIR", "/workspace/")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like you are currently only using llama-stack for inference - given that we want this to be a llama-stack-app implementation and also example for the community, lets refactor so that it follow llama-stack agent paradigm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see.
I did consider using the agent framework for this. But I found that I needed too much low-level control. Even the chat_completion api was not enough control as I needed access to the raw tool call prompt & tool call parsing.
It's likely refactoring to the agent paradigm will negatively impact performance on SWE bench lite since a lot of performance relies on the custom tool call parsing and tool prompt.
In this case, wdyt of this being put into llama-recipes? Since there isn't that constraint, while still being accessible to the community.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those are good feedback, could you share a bit more about "But I found that I needed too much low-level control." what are some of the features you need that you found insufficient from llama-stack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @heyjustinai sorry for the slow replies - I will find the time to properly respond to this message.
I've been working on improving performance aidando73/l2-llama#2, on this coding agent and I've gotten to 11.3% on 3.3 70B! But not currently in this PR yet - still prototyping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those are good feedback, could you share a bit more about "But I found that I needed too much low-level control." what are some of the features you need that you found insufficient from llama-stack?
Sure 👍:
-
Tool Call Syntax Error Handling:
- The default chat_completions parser struggles with common syntax patterns like trailing brackets or incomplete tool calls
- I found many cases where simple inputs failed, like:
view_file(path='django/contrib/admin/views/main.py')] view_file(path='django/contrib/admin/views/main.py')
- I implemented a basic bracket-patching solution that significantly reduced errors, but ideally this would be handled natively
-
Pre-emptive Tool Call Formatting:
- I achieved a substantial reduction in syntax errors (~50-60%) by pre-seeding the response with the tool call format:
message += '{"type": "function", "name": "' response = client.inference.completion( model_id=MODEL_ID, content=message, )
- This prevented problematic formatted responses (like JSON with comments) that the parser couldn't handle
- However, this kind of pre-empting isn't easily achievable through the current chat_completions API
- I achieved a substantial reduction in syntax errors (~50-60%) by pre-seeding the response with the tool call format:
-
Mixed Content Handling:
- The current parser design doesn't support mixing reasoning/thinking with tool calls in the same response
- I've found for complex tasks, being able to include chain-of-thought reasoning before tool calls increases performance
- I worked around this using XML tags (
<thinking>
and<tool>
), but this had varying success across model sizes
^ wdyt? Could be useful to loop in Ashwin & the team to see what they think as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@heyjustinai wdyt of this approach:
- We start with this agent in llama-recipes, where it can take whatever form it needs
- We create a separate version in Llama-stack, that specifically uses the Llama-stack agent paradigm - we accept whatever performance we can get.
- Overtime, we add the features required to get more and more performance out of it - using the agent in llama-recipes as reference.
- We can create issues, saying - we need feature X, to get Z% uplift, as shown in the llama-recipes agent.
That way the examples will be accessible to the community, we don't compromise on usefulness/performance, while being a useful reference to guide Llama-stack development.
wdyt? If that sgty, I'll submit a PR to llama-recipes and find the time to create the separate version in Llama-stack, evaluate it, and then get you to review?
formatter = ChatFormat(Tokenizer.get_instance()) | ||
|
||
|
||
def run_agent( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider breaking it down into smaller helper functions for better readability and maintainability, this run_agent function is quite large
|
||
# Currently only supports 3.3-70B-Instruct at the moment since it depends on the 3.3/3.2 tool prompt format | ||
MODEL_ID = "meta-llama/Llama-3.3-70B-Instruct" | ||
ITERATIONS = 15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar to constant comment - AGENT_ITERATIONS as constant
Hey @heyjustinai, thanks for the review. I will address your comments but thought I would wait on the outcome of this discussion: #150 (comment) before going ahead. |
Update: synced with @heyjustinai
Will work on refactoring 👍. Going to mark this as draft until it's ready again. |
What does this PR do?
Given a GitHub issue, agent can produce a fix and submit a PR. Demo:
demo1.mp4
Wdyt of merging adding this as llama-stack-app?
Feature/Issue validation/testing/test plan
Please describe the tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration or test plan.
Unit tests
Follow the README.md for setuppytest --ignore=sandbox
Manual tests
Issue: https://github.com/aidando73/django/issues
PR: aidando73/django#14
Logs for the demo scenario:
Sources
Before submitting
Pull Request section?
to it if that's the case.
Thanks for contributing 🎉!