Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset format #3

Open
rajivpoddar opened this issue Sep 21, 2023 · 2 comments
Open

Dataset format #3

rajivpoddar opened this issue Sep 21, 2023 · 2 comments

Comments

@rajivpoddar
Copy link

rajivpoddar commented Sep 21, 2023

Is there a particular dataset format required for finetuning codellama? I have the dataset in the OpenAI suggested format which is basically a jsonl with each entry having messages: [{role: 'system', content: '<system prompt>'}, {role: 'user', content: '<user prompt>'}, {role: 'assistant', content: '<assistant reply>'}]} object. Will this format work?

@mzbac
Copy link
Owner

mzbac commented Sep 22, 2023

You have to map the dataset in the formatting_prompts_func yourself.

@chartgod
Copy link

def formatting_prompts_func(example):
output_texts = []
for i in range(len(example['prompt'])):
text = f"An AI tool that corrects and rephrase user text grammar errors delimited by triple backticks to standard English.\n### Input: {example['prompt'][i]}\n ### Output: {example['completion'][i]}"
output_texts.append(text)
return output_texts

To create a learning format from this source code, [INST] < <>{text} [/INST]

text = """
A: , B: ""

Can I use it like this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants