Keep up with OpenAI evals #4

tju01 · 2023-05-20T08:11:24Z

Right now, I'm using a fork of OpenAI evals over here: https://github.com/tju01/evals. This is for multiple reasons:

OpenAI evals doesn't log model outputs and other information to disk, but just outputs the final results in the end. However, I do believe that looking at the actual outputs can be very useful. I therefore added code to also record model outputs, prompts etc. to disk for later use on the website.
OpenAI evals is not really that much designed to evaluate other models like OpenAssistant. They mostly made it open source so that people can contribute evals for evaluating OpenAI GPT models. So some small change was needed for evaluating OpenAssistant.
Other modifications that I would like to do like limiting the number of samples per task that are evaluated to 20 need modification of the OpenAI evals codebase and can't really be done from the outside (easily).

However, OpenAI evals recently moves very quickly and they recently seem to have added more people to merge new tasks.

This codebase and https://github.com/tju01/evals should continuously keep up with OpenAI evals. Right now, there seems to be a merge conflict despite the fact that I tried to keep the changes minimal. Also one needs to figure out what new evals were added and whether they would be useful for OpenAssistant.

tju01 · 2023-07-22T14:54:24Z

I'm closing this issue since I removed the OpenAI Evals benchmark. See https://github.com/tju01/ilm-eval/pull/51#issuecomment-1646594067 for the reason why I removed it but basically it's just a bad benchmark compared to the other ones that I have now.

tju01 added existing-benchmark For changing an existing benchmark on the leaderboard enhancement New feature or request labels Jul 7, 2023

tju01 mentioned this issue Jul 19, 2023

Update & Improve OpenAI Evals #51

Closed

tju01 closed this as completed Jul 22, 2023

tju01 closed this as not planned Won't fix, can't repro, duplicate, stale Jul 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep up with OpenAI evals #4

Keep up with OpenAI evals #4

tju01 commented May 20, 2023

tju01 commented Jul 22, 2023

Keep up with OpenAI evals #4

Keep up with OpenAI evals #4

Comments

tju01 commented May 20, 2023

tju01 commented Jul 22, 2023