Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify lenght of arrays in structured outputs #372

Open
JaimeArboleda opened this issue Dec 4, 2024 · 1 comment
Open

Specify lenght of arrays in structured outputs #372

JaimeArboleda opened this issue Dec 4, 2024 · 1 comment

Comments

@JaimeArboleda
Copy link

I am giving gpt4o a paged document, that is in a markdown format with page separators like those:

{PAGE 1}------------------------------------------------

![](_page_0_Picture_2.png)

**PLIEGO DE CLÁUSULAS ADMINISTRATIVAS PARTICULARES QUE HA DE REGIR EN EL CONTRATO DE SUMINISTRO E INSTALACIÓN DE DIVERSO EQUIPAMIENTO ELECTROMÉDICO (I) A ADJUDICAR POR PROCEDIMIENTO ABIERTO CON PLURALIDAD DE CRITERIOS** 
...
{PAGE 2}------------------------------------------------
whatever...

Now, I am asking gpt the task to create a global summary and a summary of each page. I am providing this schema:

class PageSummary(BaseModel):
    page_number: int
    summary: str
    questions: list[str]

class PagedSummary(BaseModel):
    global_summary: GlobalSummary
    page_summaries: list[PageSummary]

I would like to force the output to contain, in the list of PageSummarys, exactly the number of pages that each document has. But I don't know whether that is possible or not, maybe I should create a different schema for each document, dynamically? Or is there another way?

@JaimeArboleda
Copy link
Author

JaimeArboleda commented Dec 4, 2024

Asking chatGPT I thought that dynamically creating the Model could be a solution. My preferred approach would be this one:

from pydantic import BaseModel, create_model
from typing import Tuple, Type

class PageSummary(BaseModel):
    page_number: int
    summary: str
    questions: list[str]

class GlobalSummary(BaseModel):
    summary: str

def get_paged_summary_model(num_pages: int, model_name: str = "PagedSummary"):
    tuple_type = Tuple[tuple([PageSummary for _ in range(num_pages)])]
  
    return create_model(
        model_name,
        global_summary=(GlobalSummary, ...),
        page_summaries=(tuple_type, ...),
    )

However, I believe it won't work, because, according to the documentation:

Some type-specific keywords are not yet supported

Notable keywords not supported include:

  • For strings: minLength, maxLength, pattern, format
  • For numbers: minimum, maximum, multipleOf
  • For objects: patternProperties, unevaluatedProperties, propertyNames, minProperties, maxProperties
  • For arrays: unevaluatedItems, contains, minContains, maxContains, minItems, maxItems, uniqueItems

In particular, this Model will generate a json_schema that uses minItems and maxItems.

Is supporting those keywords on the roadmap?

The second approach, that is uglier, but could work, would be this one:

class PageSummary(BaseModel):
    summary: str
    questions: list[str]

def get_base_model(num_items: int, model_name: str = "Questions") -> Type[BaseModel]:

    fields = {}
    fields['global_summary'] = (str, ...)
    for i in range(1, num_items + 1):
        fields[f"page_num_{i}_summary"] = (PageSummary, ...)
    
    # Generar el modelo con un nombre único
    return create_model(model_name, **fields)

get_base_model(3, "QuestionsSet1").schema_json()

However, the problem here would be the limitation on the number of properties. According to the documentation again:

A schema may have up to 100 object properties total, with up to 5 levels of nesting.

I guess this implies that I won't be able to handle documents with more than 50 pages... :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant