Specify lenght of arrays in structured outputs #372

JaimeArboleda · 2024-12-04T07:01:00Z

I am giving gpt4o a paged document, that is in a markdown format with page separators like those:

{PAGE 1}------------------------------------------------

![](_page_0_Picture_2.png)

**PLIEGO DE CLÁUSULAS ADMINISTRATIVAS PARTICULARES QUE HA DE REGIR EN EL CONTRATO DE SUMINISTRO E INSTALACIÓN DE DIVERSO EQUIPAMIENTO ELECTROMÉDICO (I) A ADJUDICAR POR PROCEDIMIENTO ABIERTO CON PLURALIDAD DE CRITERIOS** 
...
{PAGE 2}------------------------------------------------
whatever...

Now, I am asking gpt the task to create a global summary and a summary of each page. I am providing this schema:

class PageSummary(BaseModel):
    page_number: int
    summary: str
    questions: list[str]

class PagedSummary(BaseModel):
    global_summary: GlobalSummary
    page_summaries: list[PageSummary]

I would like to force the output to contain, in the list of PageSummarys, exactly the number of pages that each document has. But I don't know whether that is possible or not, maybe I should create a different schema for each document, dynamically? Or is there another way?

JaimeArboleda · 2024-12-04T15:15:15Z

Asking chatGPT I thought that dynamically creating the Model could be a solution. My preferred approach would be this one:

from pydantic import BaseModel, create_model
from typing import Tuple, Type

class PageSummary(BaseModel):
    page_number: int
    summary: str
    questions: list[str]

class GlobalSummary(BaseModel):
    summary: str

def get_paged_summary_model(num_pages: int, model_name: str = "PagedSummary"):
    tuple_type = Tuple[tuple([PageSummary for _ in range(num_pages)])]
  
    return create_model(
        model_name,
        global_summary=(GlobalSummary, ...),
        page_summaries=(tuple_type, ...),
    )

However, I believe it won't work, because, according to the documentation:

Some type-specific keywords are not yet supported

Notable keywords not supported include:

For strings: minLength, maxLength, pattern, format
For numbers: minimum, maximum, multipleOf
For objects: patternProperties, unevaluatedProperties, propertyNames, minProperties, maxProperties
For arrays: unevaluatedItems, contains, minContains, maxContains, minItems, maxItems, uniqueItems

In particular, this Model will generate a json_schema that uses minItems and maxItems.

Is supporting those keywords on the roadmap?

The second approach, that is uglier, but could work, would be this one:

class PageSummary(BaseModel):
    summary: str
    questions: list[str]

def get_base_model(num_items: int, model_name: str = "Questions") -> Type[BaseModel]:

    fields = {}
    fields['global_summary'] = (str, ...)
    for i in range(1, num_items + 1):
        fields[f"page_num_{i}_summary"] = (PageSummary, ...)
    
    # Generar el modelo con un nombre único
    return create_model(model_name, **fields)

get_base_model(3, "QuestionsSet1").schema_json()

However, the problem here would be the limitation on the number of properties. According to the documentation again:

A schema may have up to 100 object properties total, with up to 5 levels of nesting.

I guess this implies that I won't be able to handle documents with more than 50 pages... :(

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify lenght of arrays in structured outputs #372

Specify lenght of arrays in structured outputs #372

JaimeArboleda commented Dec 4, 2024

JaimeArboleda commented Dec 4, 2024 •

edited

Loading

Specify lenght of arrays in structured outputs #372

Specify lenght of arrays in structured outputs #372

Comments

JaimeArboleda commented Dec 4, 2024

JaimeArboleda commented Dec 4, 2024 • edited Loading

JaimeArboleda commented Dec 4, 2024 •

edited

Loading