Dataset format #3

rajivpoddar · 2023-09-21T17:03:32Z

Is there a particular dataset format required for finetuning codellama? I have the dataset in the OpenAI suggested format which is basically a jsonl with each entry having messages: [{role: 'system', content: '<system prompt>'}, {role: 'user', content: '<user prompt>'}, {role: 'assistant', content: '<assistant reply>'}]} object. Will this format work?

The text was updated successfully, but these errors were encountered:

mzbac · 2023-09-22T06:57:29Z

You have to map the dataset in the formatting_prompts_func yourself.

chartgod · 2024-01-10T01:29:45Z

def formatting_prompts_func(example):
output_texts = []
for i in range(len(example['prompt'])):
text = f"An AI tool that corrects and rephrase user text grammar errors delimited by triple backticks to standard English.\n### Input: `{example['prompt'][i]}`\n ### Output: {example['completion'][i]}"
output_texts.append(text)
return output_texts

To create a learning format from this source code, ~~[INST] < <>{text} [/INST]~~

text = """
A: , B: ""

Can I use it like this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset format #3

Dataset format #3

rajivpoddar commented Sep 21, 2023 •

edited

Loading

mzbac commented Sep 22, 2023

chartgod commented Jan 10, 2024

Dataset format #3

Dataset format #3

Comments

rajivpoddar commented Sep 21, 2023 • edited Loading

mzbac commented Sep 22, 2023

chartgod commented Jan 10, 2024

rajivpoddar commented Sep 21, 2023 •

edited

Loading