Merge branch 'main' into infra/add-telemetry-information

langchain-ai · Jan 8, 2025 · da7d17f · da7d17f
2 parents 768fd52 + f71402c
commit da7d17f
Show file tree

Hide file tree

Showing 101 changed files with 6,047 additions and 1,820 deletions.
diff --git a/.prettierignore b/.prettierignore
@@ -1,4 +1,5 @@
 node_modules
 build
 .docusaurus
-docs/api
+docs/api
+docs/evaluation
diff --git a/Makefile b/Makefile
@@ -0,0 +1,24 @@
+install-vercel-deps:
+	yum -y update
+	yum install gcc bzip2-devel libffi-devel zlib-devel wget tar gzip rsync -y
+
+PYTHON = .venv/bin/python
+
+build-api-ref:
+	git clone --depth=1 https://github.com/langchain-ai/langsmith-sdk.git
+	python3 -m venv .venv
+	. .venv/bin/activate
+	$(PYTHON) -m pip install --upgrade pip
+	$(PYTHON) -m pip install --upgrade uv
+	cd langsmith-sdk && ../$(PYTHON) -m uv pip install -r python/docs/requirements.txt
+	$(PYTHON) langsmith-sdk/python/docs/create_api_rst.py
+	LC_ALL=C $(PYTHON) -m sphinx -T -E -b html -d langsmith-sdk/python/docs/_build/doctrees -c langsmith-sdk/python/docs langsmith-sdk/python/docs langsmith-sdk/python/docs/_build/html -j auto
+	$(PYTHON) langsmith-sdk/python/docs/scripts/custom_formatter.py langsmith-sdk/docs/_build/html/
+
+
+vercel-build: install-vercel-deps build-api-ref 
+	mkdir -p static/reference/python
+	mv langsmith-sdk/python/docs/_build/html/* static/reference/python/
+	rm -rf langsmith-sdk
+	NODE_OPTIONS="--max-old-space-size=5000" yarn run docusaurus build
+
diff --git a/docs/administration/concepts/index.mdx b/docs/administration/concepts/index.mdx
@@ -176,6 +176,14 @@ Roles can be managed in organization settings under the `Roles` tab:
 
 For more details on assigning and creating roles, see the [access control setup guide](../how_to_guides/organization_management/set_up_access_control.mdx).
 
+## Best Practices
+
+### Environment Separation
+
+Use [resource tags](#resource-tags) to organize resources by environment using the default tag key `Environment` and different values for the environment (e.g. `dev`, `staging`, `prod`). This tagging structure will allow you to organize your tracing projects today and easily enforce
+permissions when we release attribute based access control (ABAC). ABAC on the resource tag will provide a fine-grained way to restrict access to production tracing projects, for example. We do not recommend that you use Workspaces for environment separation as you cannot share resources
+across Workspaces. If you would like to promote a prompt from `staging` to `prod`, we recommend you use prompt tags instead. See [docs](../prompt_engineering/concepts#tags) for more information.
+
 ## Usage and Billing
 
 ### Data Retention

diff --git a/docs/administration/pricing.mdx b/docs/administration/pricing.mdx
@@ -208,13 +208,13 @@ If you’ve consumed the monthly allotment of free traces in your account, you c
 
 Every user will have a unique personal account on the Developer plan. <b>We cannot upgrade a Developer account to the Plus or Enterprise plans.</b> If you’re interested in working as a team, create a separate LangSmith Organization on the Plus plan. This plan can upgraded to the Enterprise plan at a later date.
 
-### How will billing work?
+### How does billing work?
 
 <b>Seats</b>
 <br />
-Seats are billed monthly on the first of the month in the future will be
-pro-rated if additional seats are purchased in the middle of the month. Seats
-removed mid-month will not be credited.
+Seats are billed monthly on the first of the month. Additional seats purchased
+mid-month are pro-rated and billed within one day of the purchase. Seats removed
+mid-month will not be credited.
 <br />
 <br />
 <b>Traces</b>

diff --git a/docs/evaluation/concepts/index.mdx b/docs/evaluation/concepts/index.mdx
diff --git a/docs/evaluation/concepts/static/dataset_concept.png b/docs/evaluation/concepts/static/dataset_concept.png
diff --git a/docs/evaluation/concepts/static/example_concept.png b/docs/evaluation/concepts/static/example_concept.png
diff --git a/docs/evaluation/concepts/static/langsmith_overview.png b/docs/evaluation/concepts/static/langsmith_overview.png
diff --git a/docs/evaluation/concepts/static/langsmith_summary.png b/docs/evaluation/concepts/static/langsmith_summary.png
diff --git a/docs/evaluation/concepts/static/offline.png b/docs/evaluation/concepts/static/offline.png
diff --git a/docs/evaluation/concepts/static/online.png b/docs/evaluation/concepts/static/online.png
diff --git a/docs/evaluation/how_to_guides/annotation_queues.mdx b/docs/evaluation/how_to_guides/annotation_queues.mdx
@@ -14,11 +14,28 @@ While you can always [annotate runs inline](./annotate_traces_inline), annotatio
 To create an annotation queue, navigate to the **Annotation queues** section through the homepage or left-hand navigation bar.
 Then click **+ New annotation queue** in the top right corner.
 
-![](./static/annotation_queue_form.png)
+![](./static/create_annotation_queue_new.png)
+
+### Basic Details
 
 Fill in the form with the **name** and **description** of the queue.
 You can also assign a **default dataset** to queue, which will streamline the process of sending the inputs and outputs of certain runs to datasets in your LangSmith workspace.
 
+### Annotation Rubric
+
+Begin by drafting some high-level instructions for your annotators, which will be shown in the sidebar on every run.
+
+Next, click "+ Desired Feedback" to add feedback keys to your annotation queue. Annotators will be presented with these feedback keys on each run.
+Add a description for each, as well as a short description of each category if the feedback is categorical.
+
+![annotation queue rubric](./static/create_annotation_rubric.png)
+
+Reviewers will see this:
+
+![rubric for annotators](./static/rubric_for_annotators.png)
+
+### Collaborator Settings
+
 There are a few settings related to multiple annotators:
 
 - **Number of reviewers per run**: This determines the number of reviewers that must mark a run as "Done" for it to be removed from the queue. If you check "All workspace members review each run," then a run will remain in the queue until all workspace members have marked it "Done".
@@ -56,6 +73,9 @@ To assign runs to an annotation queue, either:
 
 3. [Set up an automation rule](../../../observability/how_to_guides/monitoring/rules) that automatically assigns runs which pass a certain filter and sampling condition to an annotation queue.
 
+4. Select one or multiple experiments from the dataset page and click **Annotate**. From the resulting popup, you may either create a new queue or add the runs to an existing one:
+   ![](./static/annotate_experiment.png)
+
 :::tip
 
 It is often a very good idea to assign runs that have a certain user feedback score (eg thumbs up, thumbs down) from the application to an annotation queue. This way, you can identify and address issues that are causing user dissatisfaction.

diff --git a/docs/evaluation/how_to_guides/async.mdx b/docs/evaluation/how_to_guides/async.mdx
@@ -8,8 +8,8 @@ import { CodeTabs, python } from "@site/src/components/InstructionsWithCode";
 
 :::
 
-We can run evaluations asynchronously via the SDK using [aevaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._arunner.aevaluate.html),
-which accepts all of the same arguments as [evaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._runner.evaluate.html) but expects the application function to be asynchronous.
+We can run evaluations asynchronously via the SDK using [aevaluate()](https://docs.smith.langchain.com/reference/python/evaluation/langsmith.evaluation._arunner.aevaluate),
+which accepts all of the same arguments as [evaluate()](https://docs.smith.langchain.com/reference/python/evaluation/langsmith.evaluation._runner.evaluate) but expects the application function to be asynchronous.
 You can learn more about how to use the `evaluate()` function [here](./evaluate_llm_application).
 
 :::info Python only
@@ -25,8 +25,8 @@ You can see how to use it [here](./evaluate_llm_application).
 <CodeTabs
   groupId="client-language"
   tabs={[
-    python({caption: "Requires `langsmith>=0.1.145`"})`
-        from langsmith import aevaluate, wrappers, Client
+    python({caption: "Requires `langsmith>=0.2.0`"})`
+        from langsmith import wrappers, Client
         from openai import AsyncOpenAI
 
         # Optionally wrap the OpenAI client to trace all model calls.
@@ -61,12 +61,15 @@ list 5 concrete questions that should be investigated to determine if the idea i
             inputs=[{"idea": e} for e in examples,
         )
 
-        results = await aevaluate(
+        # Can equivalently use the 'aevaluate' function directly:
+        # from langsmith import aevaluate
+        # await aevaluate(...)
+        results = await ls_client.aevaluate(
             researcher_app,
             data=dataset,
             evaluators=[concise],
-            # Optional, no max_concurrency by default but it is recommended to set one.
-            max_concurrency=2,
+            # Optional, add concurrency.
+            max_concurrency=2,  # Optional, add concurrency.
             experiment_prefix="gpt-4o-mini-baseline"  # Optional, random by default.
         )
     `,

diff --git a/docs/evaluation/how_to_guides/compare_experiment_results.mdx b/docs/evaluation/how_to_guides/compare_experiment_results.mdx
@@ -70,3 +70,17 @@ You can adjust the display settings for comparison view by clicking on "Display"
 Here, you'll be able to toggle feedback, metrics, summary charts, and expand full text.
 
 ![](./static/update_display.png)
+
+## Use experiment metadata as chart labels
+
+With the summary charts enabled, you can configure the x-axis labels based on [experiment metadata](./filter_experiments_ui#background-add-metadata-to-your-experiments). First, click the three dots in the top right of the charts (note that you will only see them if your experiments have metadata attached).
+
+![](./static/three_dots_charts.png)
+
+Next, select a metadata key - note that this key must contain string values in order to render in the charts.
+
+![](./static/select_metadata_key.png)
+
+You will now see your metadata in the x-axis of the charts:
+
+![](./static/metadata_in_charts.png)
diff --git a/docs/evaluation/how_to_guides/create_few_shot_evaluators.mdx b/docs/evaluation/how_to_guides/create_few_shot_evaluators.mdx
@@ -2,7 +2,7 @@
 sidebar_position: 10
 ---
 
-How to create few-shot evaluators
+# How to create few-shot evaluators
 
 Using LLM-as-a-Judge evaluators can be very helpful when you can't evaluate your system programmatically. However, improving/iterating on these prompts can add unnecessary
 overhead to the development process of an LLM-based application - you now need to maintain both your application **and** your evaluators. To make this process easier, LangSmith allows

diff --git a/docs/evaluation/how_to_guides/custom_evaluator.mdx b/docs/evaluation/how_to_guides/custom_evaluator.mdx
@@ -13,14 +13,14 @@ import {
 :::
 
 Custom evaluators are just functions that take a dataset example and the resulting application output, and return one or more metrics.
-These functions can be passed directly into [evaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._runner.evaluate.html) / [aevaluate()](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._arunner.aevaluate.html).
+These functions can be passed directly into [evaluate()](https://docs.smith.langchain.com/reference/python/evaluation/langsmith.evaluation._runner.evaluate) / [aevaluate()](https://docs.smith.langchain.com/reference/python/evaluation/langsmith.evaluation._arunner.aevaluate).
 
 ## Basic example
 
 <CodeTabs
   groupId="client-language"
   tabs={[
-    python({caption: "Requires `langsmith>=0.1.145`"})`
+    python({caption: "Requires `langsmith>=0.2.0`"})`
         from langsmith import evaluate
 
         def correct(outputs: dict, reference_outputs: dict) -> bool:
@@ -36,12 +36,14 @@ These functions can be passed directly into [evaluate()](https://langsmith-sdk.r
             evaluators=[correct]
         )
     `,
-    typescript`
+    typescript({caption: "Requires `langsmith>=0.2.9`"})`
       import type { EvaluationResult } from "langsmith/evaluation";
-      import type { Run, Example } from "langsmith/schemas";
 
-      function correct(run: Run, example: Example): EvaluationResult {
-        const score = run.outputs?.output === example.outputs?.output;
+      const correct = async ({ outputs, referenceOutputs }: {
+        outputs: Record<string, any>;
+        referenceOutputs?: Record<string, any>;
+      }): Promise<EvaluationResult> => {
+        const score = outputs?.answer === referenceOutputs?.answer;
         return { key: "correct", score };
       }
     `,
@@ -53,28 +55,25 @@ These functions can be passed directly into [evaluate()](https://langsmith-sdk.r
 
 Custom evaluator functions must have specific argument names. They can take any subset of the following arguments:
 
-Python and JS/TS
-
-- `run: langsmith.schemas.Run`: The full Run object generated by the application on the given example.
-- `example: langsmith.schemas.Example`: The full dataset Example, including the example inputs, outputs (if available), and metdata (if available).
-
-Currently Python only
-
+- `run: Run`: The full [Run](/reference/data_formats/run_data_format) object generated by the application on the given example.
+- `example: Example`: The full dataset [Example](/reference/data_formats/example_data_format), including the example inputs, outputs (if available), and metdata (if available).
 - `inputs: dict`: A dictionary of the inputs corresponding to a single example in a dataset.
 - `outputs: dict`: A dictionary of the outputs generated by the application on the given `inputs`.
-- `reference_outputs: dict`: A dictionary of the reference outputs associated with the example, if available.
+- `reference_outputs/referenceOutputs: dict`: A dictionary of the reference outputs associated with the example, if available.
 
 For most use cases you'll only need `inputs`, `outputs`, and `reference_outputs`. `run` and `example` are useful only if you need some extra trace or example metadata outside of the actual inputs and outputs of the application.
 
+When using JS/TS these should all be passed in as part of a single object argument.
+
 ## Evaluator output
 
 Custom evaluators are expected to return one of the following types:
 
 Python and JS/TS
 
-- `dict`: dicts of the form `{"score" | "value": ..., "name": ...}` allow you to customize the metric type ("score" for numerical and "value" for categorical) and metric name. This if useful if, for example, you want to log an integer as a categorical metric.
+- `dict`: dicts of the form `{"score" | "value": ..., "key": ...}` allow you to customize the metric type ("score" for numerical and "value" for categorical) and metric name. This if useful if, for example, you want to log an integer as a categorical metric.
 
-Currently Python only
+Python only
 
 - `int | float | bool`: this is interepreted as an continuous metric that can be averaged, sorted, etc. The function name is used as the name of the metric.
 - `str`: this is intepreted as a categorical metric. The function name is used as the name of the metric.
@@ -85,16 +84,17 @@ Currently Python only
 <CodeTabs
   groupId="client-language"
   tabs={[
-    python({caption: "Requires `langsmith>=0.1.145`"})`
+    python({caption: "Requires `langsmith>=0.2.0`"})`
         from langsmith import evaluate, wrappers
+        from langsmith.schemas import Run, Example
         from openai import AsyncOpenAI
         # Assumes you've installed pydantic.
         from pydantic import BaseModel
 
-        # Compare actual and reference outputs
-        def correct(outputs: dict, reference_outputs: dict) -> bool:
+        # We can still pass in Run and Example objects if we'd like
+        def correct_old_signature(run: Run, example: Example) -> dict:
             """Check if the answer exactly matches the expected answer."""
-            return outputs["answer"] == reference_outputs["answer"]
+            return {"key": "correct", "score": run.outputs["answer"] == example.outputs["answer"]}
 
         # Just evaluate actual outputs
         def concision(outputs: dict) -> int:
@@ -129,9 +129,105 @@ answer is logically valid and consistent with question and the answer."""
         results = evaluate(
             dummy_app,
             data="dataset_name",
-            evaluators=[correct, concision, valid_reasoning]
+            evaluators=[correct_old_signature, concision, valid_reasoning]
         )
     `,
+    typescript`
+    import { Client } from "langsmith";
+    import { evaluate } from "langsmith/evaluation";
+    import { Run, Example } from "langsmith/schemas";
+    import OpenAI from "openai";
+
+    // Type definitions
+    interface AppInputs {
+        question: string;
+    }
+
+    interface AppOutputs {
+        answer: string;
+        reasoning: string;
+    }
+
+    interface Response {
+        reasoning_is_valid: boolean;
+    }
+
+    // Old signature evaluator
+    function correctOldSignature(run: Run, example: Example) {
+        return {
+            key: "correct",
+            score: run.outputs?.["answer"] === example.outputs?.["answer"],
+        };
+    }
+
+    // Output-only evaluator
+    function concision({ outputs }: { outputs: AppOutputs }) {
+        return {
+            key: "concision",
+            score: Math.min(Math.floor(outputs.answer.length / 1000), 4) + 1,
+        };
+    }
+
+    // LLM-as-judge evaluator
+    const openai = new OpenAI();
+
+    async function validReasoning({
+        inputs,
+        outputs
+    }: {
+        inputs: AppInputs;
+        outputs: AppOutputs;
+    }) {
+        const instructions = \`\
+        Given the following question, answer, and reasoning, determine if the reasoning for the \
+        answer is logically valid and consistent with question and the answer.\`;
+
+        const msg = \`Question: \${inputs.question}\nAnswer: \${outputs.answer}\\nReasoning: \${outputs.reasoning}\`;
+
+        const response = await openai.chat.completions.create({
+            model: "gpt-4",
+            messages: [
+                { role: "system", content: instructions },
+                { role: "user", content: msg }
+            ],
+            response_format: { type: "json_object" },
+            functions: [{
+            name: "parse_response",
+            parameters: {
+                type: "object",
+                properties: {
+                reasoning_is_valid: {
+                    type: "boolean",
+                    description: "Whether the reasoning is valid"
+                }
+                },
+                required: ["reasoning_is_valid"]
+            }
+            }]
+        });
+
+        const parsed = JSON.parse(response.choices[0].message.content ?? "{}") as Response;
+
+        return {
+            key: "valid_reasoning",
+            score: parsed.reasoning_is_valid ? 1 : 0
+        };
+    }
+
+    // Example application
+    function dummyApp(inputs: AppInputs): AppOutputs {
+        return {
+            answer: "hmm i'm not sure",
+            reasoning: "i didn't understand the question"
+        };
+    }
+
+    const results = await evaluate(dummyApp, {
+            data: "dataset_name",
+            evaluators: [correctOldSignature, concision, validReasoning],
+            client: new Client()
+    });
+    `
 
 ]}
 />

diff --git a/docs/evaluation/how_to_guides/dataset_subset.mdx b/docs/evaluation/how_to_guides/dataset_subset.mdx
@@ -85,4 +85,4 @@ You can use the `list_examples` / `listExamples` method to evaluate on one or mu
 
 ## Related
 
-- More on [how to filter datasets](./manage_datasets_programmatically#list-examples-by-structured-filter)
+- Learn more about how to fetch views of a dataset [here](./manage_datasets_programmatically#fetch-datasets)
Original file line number	Diff line number	Diff line change
Expand Up		@@ -85,4 +85,4 @@ You can use the `list_examples` / `listExamples` method to evaluate on one or mu

		## Related

		- More on [how to filter datasets](./manage_datasets_programmatically#list-examples-by-structured-filter)
		- Learn more about how to fetch views of a dataset [here](./manage_datasets_programmatically#fetch-datasets)