Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR was opened by the Changesets release GitHub action. When you're ready to do a release, you can merge this and the packages will be published to npm automatically. If you're not ready to do a release yet, that's fine, whenever you add more changesets to main, this PR will be updated.
Releases
[email protected]
Patch Changes
#14
05da890
Thanks @zaripych! - feat: evaluate refactor outcomes using LLM to make decision of whether file edit should be accepted or discardedThis is a big change which adds extra steps to the refactor process. Every time an LLM produces a file edit - we will pass that edit through evaluation algorithm to asses whether it should be accepted or discarded. Previously, this logic was only affected by the existence or absence of eslint errors. This will make the final result higher quality and more reliable.
The new behavior can be disabled by setting
evaluate: false
in thegoal.md
file.In addition to that, this change also adds a new CLI command for internal use which allows us to compare results of multiple refactor runs. This is useful for benchmarking purposes.
To run the benchmark, use the following command:
Where the config:
This will run multiple refactor runs and compare the results. At this moment no statistical analysis is performed as I'm not convinced we can reach statistical significance with the number of runs that also doesn't make you poor.