You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the feature or improvement you're requesting
Eval set is useful for running a group of evals at the same time. Currently eval set is just a collection of independent evals and oaievalset command is simply a wrapper that runs multiple oaieval commands concurrently.
I think it should be useful to analyze the data from eval set as a whole, especially if all evals in the eval sets have the same metric. Under this circumstance, we aim to do same experiment by asking similar questions. We split them into different evals because they are classified by different data. For example, if we want to evaluate LLM's performance on detecting spam in different languages. We want to get accuracy for different languages, as well as the overall detection accuracy for all spams. It would be great if eval set can generate this kind of overall report automatically.
Additional context
This feature request is an Idea for Eval, the framework itself, but not for adding new evals.
The text was updated successfully, but these errors were encountered:
Describe the feature or improvement you're requesting
Eval set is useful for running a group of evals at the same time. Currently eval set is just a collection of independent evals and
oaievalset
command is simply a wrapper that runs multipleoaieval
commands concurrently.I think it should be useful to analyze the data from eval set as a whole, especially if all evals in the eval sets have the same metric. Under this circumstance, we aim to do same experiment by asking similar questions. We split them into different evals because they are classified by different data. For example, if we want to evaluate LLM's performance on detecting spam in different languages. We want to get accuracy for different languages, as well as the overall detection accuracy for all spams. It would be great if eval set can generate this kind of overall report automatically.
Additional context
This feature request is an Idea for Eval, the framework itself, but not for adding new evals.
The text was updated successfully, but these errors were encountered: