You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
OpenAI evals doesn't log model outputs and other information to disk, but just outputs the final results in the end. However, I do believe that looking at the actual outputs can be very useful. I therefore added code to also record model outputs, prompts etc. to disk for later use on the website.
OpenAI evals is not really that much designed to evaluate other models like OpenAssistant. They mostly made it open source so that people can contribute evals for evaluating OpenAI GPT models. So some small change was needed for evaluating OpenAssistant.
Other modifications that I would like to do like limiting the number of samples per task that are evaluated to 20 need modification of the OpenAI evals codebase and can't really be done from the outside (easily).
This codebase and https://github.com/tju01/evals should continuously keep up with OpenAI evals. Right now, there seems to be a merge conflict despite the fact that I tried to keep the changes minimal. Also one needs to figure out what new evals were added and whether they would be useful for OpenAssistant.
The text was updated successfully, but these errors were encountered:
Right now, I'm using a fork of OpenAI evals over here: https://github.com/tju01/evals. This is for multiple reasons:
However, OpenAI evals recently moves very quickly and they recently seem to have added more people to merge new tasks.
This codebase and https://github.com/tju01/evals should continuously keep up with OpenAI evals. Right now, there seems to be a merge conflict despite the fact that I tried to keep the changes minimal. Also one needs to figure out what new evals were added and whether they would be useful for OpenAssistant.
The text was updated successfully, but these errors were encountered: