Skip to content

Commit

Permalink
Merge branch 'main' into infra/add-telemetry-information
Browse files Browse the repository at this point in the history
  • Loading branch information
baskaryan authored Jan 8, 2025
2 parents 768fd52 + f71402c commit da7d17f
Show file tree
Hide file tree
Showing 101 changed files with 6,047 additions and 1,820 deletions.
3 changes: 2 additions & 1 deletion .prettierignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
node_modules
build
.docusaurus
docs/api
docs/api
docs/evaluation
24 changes: 24 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
install-vercel-deps:
yum -y update
yum install gcc bzip2-devel libffi-devel zlib-devel wget tar gzip rsync -y

PYTHON = .venv/bin/python

build-api-ref:
git clone --depth=1 https://github.com/langchain-ai/langsmith-sdk.git
python3 -m venv .venv
. .venv/bin/activate
$(PYTHON) -m pip install --upgrade pip
$(PYTHON) -m pip install --upgrade uv
cd langsmith-sdk && ../$(PYTHON) -m uv pip install -r python/docs/requirements.txt
$(PYTHON) langsmith-sdk/python/docs/create_api_rst.py
LC_ALL=C $(PYTHON) -m sphinx -T -E -b html -d langsmith-sdk/python/docs/_build/doctrees -c langsmith-sdk/python/docs langsmith-sdk/python/docs langsmith-sdk/python/docs/_build/html -j auto
$(PYTHON) langsmith-sdk/python/docs/scripts/custom_formatter.py langsmith-sdk/docs/_build/html/


vercel-build: install-vercel-deps build-api-ref
mkdir -p static/reference/python
mv langsmith-sdk/python/docs/_build/html/* static/reference/python/
rm -rf langsmith-sdk
NODE_OPTIONS="--max-old-space-size=5000" yarn run docusaurus build

8 changes: 8 additions & 0 deletions docs/administration/concepts/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -176,6 +176,14 @@ Roles can be managed in organization settings under the `Roles` tab:

For more details on assigning and creating roles, see the [access control setup guide](../how_to_guides/organization_management/set_up_access_control.mdx).

## Best Practices

### Environment Separation

Use [resource tags](#resource-tags) to organize resources by environment using the default tag key `Environment` and different values for the environment (e.g. `dev`, `staging`, `prod`). This tagging structure will allow you to organize your tracing projects today and easily enforce
permissions when we release attribute based access control (ABAC). ABAC on the resource tag will provide a fine-grained way to restrict access to production tracing projects, for example. We do not recommend that you use Workspaces for environment separation as you cannot share resources
across Workspaces. If you would like to promote a prompt from `staging` to `prod`, we recommend you use prompt tags instead. See [docs](../prompt_engineering/concepts#tags) for more information.

## Usage and Billing

### Data Retention
Expand Down
8 changes: 4 additions & 4 deletions docs/administration/pricing.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -208,13 +208,13 @@ If you’ve consumed the monthly allotment of free traces in your account, you c

Every user will have a unique personal account on the Developer plan. <b>We cannot upgrade a Developer account to the Plus or Enterprise plans.</b> If you’re interested in working as a team, create a separate LangSmith Organization on the Plus plan. This plan can upgraded to the Enterprise plan at a later date.

### How will billing work?
### How does billing work?

<b>Seats</b>
<br />
Seats are billed monthly on the first of the month in the future will be
pro-rated if additional seats are purchased in the middle of the month. Seats
removed mid-month will not be credited.
Seats are billed monthly on the first of the month. Additional seats purchased
mid-month are pro-rated and billed within one day of the purchase. Seats removed
mid-month will not be credited.
<br />
<br />
<b>Traces</b>
Expand Down
421 changes: 166 additions & 255 deletions docs/evaluation/concepts/index.mdx

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file not shown.
Binary file added docs/evaluation/concepts/static/offline.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/evaluation/concepts/static/online.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
22 changes: 21 additions & 1 deletion docs/evaluation/how_to_guides/annotation_queues.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,28 @@ While you can always [annotate runs inline](./annotate_traces_inline), annotatio
To create an annotation queue, navigate to the **Annotation queues** section through the homepage or left-hand navigation bar.
Then click **+ New annotation queue** in the top right corner.

![](./static/annotation_queue_form.png)
![](./static/create_annotation_queue_new.png)

### Basic Details

Fill in the form with the **name** and **description** of the queue.
You can also assign a **default dataset** to queue, which will streamline the process of sending the inputs and outputs of certain runs to datasets in your LangSmith workspace.

### Annotation Rubric

Begin by drafting some high-level instructions for your annotators, which will be shown in the sidebar on every run.

Next, click "+ Desired Feedback" to add feedback keys to your annotation queue. Annotators will be presented with these feedback keys on each run.
Add a description for each, as well as a short description of each category if the feedback is categorical.

![annotation queue rubric](./static/create_annotation_rubric.png)

Reviewers will see this:

![rubric for annotators](./static/rubric_for_annotators.png)

### Collaborator Settings

There are a few settings related to multiple annotators:

- **Number of reviewers per run**: This determines the number of reviewers that must mark a run as "Done" for it to be removed from the queue. If you check "All workspace members review each run," then a run will remain in the queue until all workspace members have marked it "Done".
Expand Down Expand Up @@ -56,6 +73,9 @@ To assign runs to an annotation queue, either:

3. [Set up an automation rule](../../../observability/how_to_guides/monitoring/rules) that automatically assigns runs which pass a certain filter and sampling condition to an annotation queue.

4. Select one or multiple experiments from the dataset page and click **Annotate**. From the resulting popup, you may either create a new queue or add the runs to an existing one:
![](./static/annotate_experiment.png)

:::tip

It is often a very good idea to assign runs that have a certain user feedback score (eg thumbs up, thumbs down) from the application to an annotation queue. This way, you can identify and address issues that are causing user dissatisfaction.
Expand Down
17 changes: 10 additions & 7 deletions docs/evaluation/how_to_guides/async.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ import { CodeTabs, python } from "@site/src/components/InstructionsWithCode";

:::

We can run evaluations asynchronously via the SDK using [aevaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._arunner.aevaluate.html),
which accepts all of the same arguments as [evaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._runner.evaluate.html) but expects the application function to be asynchronous.
We can run evaluations asynchronously via the SDK using [aevaluate()](https://docs.smith.langchain.com/reference/python/evaluation/langsmith.evaluation._arunner.aevaluate),
which accepts all of the same arguments as [evaluate()](https://docs.smith.langchain.com/reference/python/evaluation/langsmith.evaluation._runner.evaluate) but expects the application function to be asynchronous.
You can learn more about how to use the `evaluate()` function [here](./evaluate_llm_application).

:::info Python only
Expand All @@ -25,8 +25,8 @@ You can see how to use it [here](./evaluate_llm_application).
<CodeTabs
groupId="client-language"
tabs={[
python({caption: "Requires `langsmith>=0.1.145`"})`
from langsmith import aevaluate, wrappers, Client
python({caption: "Requires `langsmith>=0.2.0`"})`
from langsmith import wrappers, Client
from openai import AsyncOpenAI
# Optionally wrap the OpenAI client to trace all model calls.
Expand Down Expand Up @@ -61,12 +61,15 @@ list 5 concrete questions that should be investigated to determine if the idea i
inputs=[{"idea": e} for e in examples,
)
results = await aevaluate(
# Can equivalently use the 'aevaluate' function directly:
# from langsmith import aevaluate
# await aevaluate(...)
results = await ls_client.aevaluate(
researcher_app,
data=dataset,
evaluators=[concise],
# Optional, no max_concurrency by default but it is recommended to set one.
max_concurrency=2,
# Optional, add concurrency.
max_concurrency=2, # Optional, add concurrency.
experiment_prefix="gpt-4o-mini-baseline" # Optional, random by default.
)
`,
Expand Down
14 changes: 14 additions & 0 deletions docs/evaluation/how_to_guides/compare_experiment_results.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -70,3 +70,17 @@ You can adjust the display settings for comparison view by clicking on "Display"
Here, you'll be able to toggle feedback, metrics, summary charts, and expand full text.

![](./static/update_display.png)

## Use experiment metadata as chart labels

With the summary charts enabled, you can configure the x-axis labels based on [experiment metadata](./filter_experiments_ui#background-add-metadata-to-your-experiments). First, click the three dots in the top right of the charts (note that you will only see them if your experiments have metadata attached).

![](./static/three_dots_charts.png)

Next, select a metadata key - note that this key must contain string values in order to render in the charts.

![](./static/select_metadata_key.png)

You will now see your metadata in the x-axis of the charts:

![](./static/metadata_in_charts.png)
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
sidebar_position: 10
---

How to create few-shot evaluators
# How to create few-shot evaluators

Using LLM-as-a-Judge evaluators can be very helpful when you can't evaluate your system programmatically. However, improving/iterating on these prompts can add unnecessary
overhead to the development process of an LLM-based application - you now need to maintain both your application **and** your evaluators. To make this process easier, LangSmith allows
Expand Down
138 changes: 117 additions & 21 deletions docs/evaluation/how_to_guides/custom_evaluator.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,14 @@ import {
:::

Custom evaluators are just functions that take a dataset example and the resulting application output, and return one or more metrics.
These functions can be passed directly into [evaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._runner.evaluate.html) / [aevaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._arunner.aevaluate.html).
These functions can be passed directly into [evaluate()](https://docs.smith.langchain.com/reference/python/evaluation/langsmith.evaluation._runner.evaluate) / [aevaluate()](https://docs.smith.langchain.com/reference/python/evaluation/langsmith.evaluation._arunner.aevaluate).

## Basic example

<CodeTabs
groupId="client-language"
tabs={[
python({caption: "Requires `langsmith>=0.1.145`"})`
python({caption: "Requires `langsmith>=0.2.0`"})`
from langsmith import evaluate
def correct(outputs: dict, reference_outputs: dict) -> bool:
Expand All @@ -36,12 +36,14 @@ These functions can be passed directly into [evaluate()](https://langsmith-sdk.r
evaluators=[correct]
)
`,
typescript`
typescript({caption: "Requires `langsmith>=0.2.9`"})`
import type { EvaluationResult } from "langsmith/evaluation";
import type { Run, Example } from "langsmith/schemas";
function correct(run: Run, example: Example): EvaluationResult {
const score = run.outputs?.output === example.outputs?.output;
const correct = async ({ outputs, referenceOutputs }: {
outputs: Record<string, any>;
referenceOutputs?: Record<string, any>;
}): Promise<EvaluationResult> => {
const score = outputs?.answer === referenceOutputs?.answer;
return { key: "correct", score };
}
`,
Expand All @@ -53,28 +55,25 @@ These functions can be passed directly into [evaluate()](https://langsmith-sdk.r

Custom evaluator functions must have specific argument names. They can take any subset of the following arguments:

Python and JS/TS

- `run: langsmith.schemas.Run`: The full Run object generated by the application on the given example.
- `example: langsmith.schemas.Example`: The full dataset Example, including the example inputs, outputs (if available), and metdata (if available).

Currently Python only

- `run: Run`: The full [Run](/reference/data_formats/run_data_format) object generated by the application on the given example.
- `example: Example`: The full dataset [Example](/reference/data_formats/example_data_format), including the example inputs, outputs (if available), and metdata (if available).
- `inputs: dict`: A dictionary of the inputs corresponding to a single example in a dataset.
- `outputs: dict`: A dictionary of the outputs generated by the application on the given `inputs`.
- `reference_outputs: dict`: A dictionary of the reference outputs associated with the example, if available.
- `reference_outputs/referenceOutputs: dict`: A dictionary of the reference outputs associated with the example, if available.

For most use cases you'll only need `inputs`, `outputs`, and `reference_outputs`. `run` and `example` are useful only if you need some extra trace or example metadata outside of the actual inputs and outputs of the application.

When using JS/TS these should all be passed in as part of a single object argument.

## Evaluator output

Custom evaluators are expected to return one of the following types:

Python and JS/TS

- `dict`: dicts of the form `{"score" | "value": ..., "name": ...}` allow you to customize the metric type ("score" for numerical and "value" for categorical) and metric name. This if useful if, for example, you want to log an integer as a categorical metric.
- `dict`: dicts of the form `{"score" | "value": ..., "key": ...}` allow you to customize the metric type ("score" for numerical and "value" for categorical) and metric name. This if useful if, for example, you want to log an integer as a categorical metric.

Currently Python only
Python only

- `int | float | bool`: this is interepreted as an continuous metric that can be averaged, sorted, etc. The function name is used as the name of the metric.
- `str`: this is intepreted as a categorical metric. The function name is used as the name of the metric.
Expand All @@ -85,16 +84,17 @@ Currently Python only
<CodeTabs
groupId="client-language"
tabs={[
python({caption: "Requires `langsmith>=0.1.145`"})`
python({caption: "Requires `langsmith>=0.2.0`"})`
from langsmith import evaluate, wrappers
from langsmith.schemas import Run, Example
from openai import AsyncOpenAI
# Assumes you've installed pydantic.
from pydantic import BaseModel
# Compare actual and reference outputs
def correct(outputs: dict, reference_outputs: dict) -> bool:
# We can still pass in Run and Example objects if we'd like
def correct_old_signature(run: Run, example: Example) -> dict:
"""Check if the answer exactly matches the expected answer."""
return outputs["answer"] == reference_outputs["answer"]
return {"key": "correct", "score": run.outputs["answer"] == example.outputs["answer"]}
# Just evaluate actual outputs
def concision(outputs: dict) -> int:
Expand Down Expand Up @@ -129,9 +129,105 @@ answer is logically valid and consistent with question and the answer."""
results = evaluate(
dummy_app,
data="dataset_name",
evaluators=[correct, concision, valid_reasoning]
evaluators=[correct_old_signature, concision, valid_reasoning]
)
`,
typescript`
import { Client } from "langsmith";
import { evaluate } from "langsmith/evaluation";
import { Run, Example } from "langsmith/schemas";
import OpenAI from "openai";
// Type definitions
interface AppInputs {
question: string;
}
interface AppOutputs {
answer: string;
reasoning: string;
}
interface Response {
reasoning_is_valid: boolean;
}
// Old signature evaluator
function correctOldSignature(run: Run, example: Example) {
return {
key: "correct",
score: run.outputs?.["answer"] === example.outputs?.["answer"],
};
}
// Output-only evaluator
function concision({ outputs }: { outputs: AppOutputs }) {
return {
key: "concision",
score: Math.min(Math.floor(outputs.answer.length / 1000), 4) + 1,
};
}
// LLM-as-judge evaluator
const openai = new OpenAI();
async function validReasoning({
inputs,
outputs
}: {
inputs: AppInputs;
outputs: AppOutputs;
}) {
const instructions = \`\
Given the following question, answer, and reasoning, determine if the reasoning for the \
answer is logically valid and consistent with question and the answer.\`;
const msg = \`Question: \${inputs.question}\nAnswer: \${outputs.answer}\\nReasoning: \${outputs.reasoning}\`;
const response = await openai.chat.completions.create({
model: "gpt-4",
messages: [
{ role: "system", content: instructions },
{ role: "user", content: msg }
],
response_format: { type: "json_object" },
functions: [{
name: "parse_response",
parameters: {
type: "object",
properties: {
reasoning_is_valid: {
type: "boolean",
description: "Whether the reasoning is valid"
}
},
required: ["reasoning_is_valid"]
}
}]
});
const parsed = JSON.parse(response.choices[0].message.content ?? "{}") as Response;
return {
key: "valid_reasoning",
score: parsed.reasoning_is_valid ? 1 : 0
};
}
// Example application
function dummyApp(inputs: AppInputs): AppOutputs {
return {
answer: "hmm i'm not sure",
reasoning: "i didn't understand the question"
};
}
const results = await evaluate(dummyApp, {
data: "dataset_name",
evaluators: [correctOldSignature, concision, validReasoning],
client: new Client()
});
`

]}
/>
Expand Down
2 changes: 1 addition & 1 deletion docs/evaluation/how_to_guides/dataset_subset.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -85,4 +85,4 @@ You can use the `list_examples` / `listExamples` method to evaluate on one or mu

## Related

- More on [how to filter datasets](./manage_datasets_programmatically#list-examples-by-structured-filter)
- Learn more about how to fetch views of a dataset [here](./manage_datasets_programmatically#fetch-datasets)
Loading

0 comments on commit da7d17f

Please sign in to comment.