diff --git a/docs/docs/integrations/document_loaders/box.ipynb b/docs/docs/integrations/document_loaders/box.ipynb index 5f65ab5a0745c..c381815410fec 100644 --- a/docs/docs/integrations/document_loaders/box.ipynb +++ b/docs/docs/integrations/document_loaders/box.ipynb @@ -13,32 +13,38 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# BoxLoader\n", + "# BoxLoader and BoxBlobLoader\n", "\n", - "This notebook provides a quick overview for getting started with Box [document loader](/docs/integrations/document_loaders/). For detailed documentation of all BoxLoader features and configurations head to the [API reference](https://python.langchain.com/api_reference/box/document_loaders/langchain_box.document_loaders.box.BoxLoader.html).\n", "\n", + "The `langchain-box` package provides two methods to index your files from Box: `BoxLoader` and `BoxBlobLoader`. `BoxLoader` allows you to ingest text representations of files that have a text representation in Box. The `BoxBlobLoader` allows you download the blob for any document or image file for processing with the blob parser of your choice.\n", + "\n", + "This notebook details getting started with both of these. For detailed documentation of all BoxLoader features and configurations head to the API Reference pages for [BoxLoader](https://python.langchain.com/api_reference/box/document_loaders/langchain_box.document_loaders.box.BoxLoader.html) and [BoxBlobLoader](https://python.langchain.com/api_reference/box/document_loaders/langchain_box.blob_loaders.box.BoxBlobLoader.html).\n", "\n", "## Overview\n", "\n", "The `BoxLoader` class helps you get your unstructured content from Box in Langchain's `Document` format. You can do this with either a `List[str]` containing Box file IDs, or with a `str` containing a Box folder ID. \n", "\n", - "You must provide either a `List[str]` containing Box file Ids, or a `str` containing a folder ID. If getting files from a folder with folder ID, you can also set a `Bool` to tell the loader to get all sub-folders in that folder, as well. \n", + "The `BoxBlobLoader` class helps you get your unstructured content from Box in Langchain's `Blob` format. You can do this with a `List[str]` containing Box file IDs, a `str` containing a Box folder ID, a search query, or a `BoxMetadataQuery`. \n", + "\n", + "If getting files from a folder with folder ID, you can also set a `Bool` to tell the loader to get all sub-folders in that folder, as well. \n", "\n", ":::info\n", "A Box instance can contain Petabytes of files, and folders can contain millions of files. Be intentional when choosing what folders you choose to index. And we recommend never getting all files from folder 0 recursively. Folder ID 0 is your root folder.\n", ":::\n", "\n", - "Files without a text representation will be skipped.\n", + "The `BoxLoader` will skip files without a text representation, while the `BoxBlobLoader` will return blobs for all document and image files.\n", "\n", "### Integration details\n", "\n", "| Class | Package | Local | Serializable | JS support|\n", "| :--- | :--- | :---: | :---: | :---: |\n", "| [BoxLoader](https://python.langchain.com/api_reference/box/document_loaders/langchain_box.document_loaders.box.BoxLoader.html) | [langchain_box](https://python.langchain.com/api_reference/box/index.html) | ✅ | ❌ | ❌ | \n", + "| [BoxBlobLoader](https://python.langchain.com/api_reference/box/document_loaders/langchain_box.blob_loaders.box.BoxBlobLoader.html) | [langchain_box](https://python.langchain.com/api_reference/box/index.html) | ✅ | ❌ | ❌ | \n", "### Loader features\n", "| Source | Document Lazy Loading | Async Support\n", "| :---: | :---: | :---: | \n", "| BoxLoader | ✅ | ❌ | \n", + "| BoxBlobLoader | ✅ | ❌ | \n", "\n", "## Setup\n", "\n", @@ -59,7 +65,7 @@ "metadata": {}, "outputs": [ { - "name": "stdout", + "name": "stdin", "output_type": "stream", "text": [ "Enter your Box Developer Token: ········\n" @@ -120,7 +126,9 @@ "\n", "This requires 1 piece of information:\n", "\n", - "* **box_file_ids** (`List[str]`)- A list of Box file IDs. " + "* **box_file_ids** (`List[str]`)- A list of Box file IDs.\n", + "\n", + "#### BoxLoader" ] }, { @@ -140,6 +148,28 @@ ")" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### BoxBlobLoader" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_box.blob_loaders import BoxBlobLoader\n", + "\n", + "box_file_ids = [\"1514555423624\", \"1514553902288\"]\n", + "\n", + "loader = BoxBlobLoader(\n", + " box_developer_token=box_developer_token, box_file_ids=box_file_ids\n", + ")" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -150,7 +180,9 @@ "\n", "This requires 1 piece of information:\n", "\n", - "* **box_folder_id** (`str`)- A string containing a Box folder ID. " + "* **box_folder_id** (`str`)- A string containing a Box folder ID.\n", + "\n", + "#### BoxLoader" ] }, { @@ -174,7 +206,113 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Load" + "#### BoxBlobLoader" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_box.blob_loaders import BoxBlobLoader\n", + "\n", + "box_folder_id = \"260932470532\"\n", + "\n", + "loader = BoxBlobLoader(\n", + " box_folder_id=box_folder_id,\n", + " recursive=False, # Optional. return entire tree, defaults to False\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Search for files with BoxBlobLoader\n", + "\n", + "If you need to search for files, the `BoxBlobLoader` offers two methods. First you can perform a full text search with optional search options to narrow down that search.\n", + "\n", + "This requires 1 piece of information:\n", + "\n", + "* **query** (`str`)- A string containing the search query to perform.\n", + "\n", + "You can also provide a `BoxSearchOptions` object to narrow down that search\n", + "* **box_search_options** (`BoxSearchOptions`)\n", + "\n", + "#### BoxBlobLoader search" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_box.blob_loaders import BoxBlobLoader\n", + "from langchain_box.utilities import BoxSearchOptions, DocumentFiles, SearchTypeFilter\n", + "\n", + "box_folder_id = \"260932470532\"\n", + "\n", + "box_search_options = BoxSearchOptions(\n", + " ancestor_folder_ids=[box_folder_id],\n", + " search_type_filter=[SearchTypeFilter.FILE_CONTENT],\n", + " created_date_range=[\"2023-01-01T00:00:00-07:00\", \"2024-08-01T00:00:00-07:00,\"],\n", + " file_extensions=[DocumentFiles.DOCX, DocumentFiles.PDF],\n", + " k=200,\n", + " size_range=[1, 1000000],\n", + " updated_data_range=None,\n", + ")\n", + "\n", + "loader = BoxBlobLoader(\n", + " box_developer_token=box_developer_token,\n", + " query=\"Victor\",\n", + " box_search_options=box_search_options,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can also search for content based on Box Metadata. If your Box instance uses Metadata, you can search for any documents that have a specific Metadata Template attached that meet a certain criteria, like returning any invoices with a total greater than or equal to $500 that were created last quarter.\n", + "\n", + "This requires 1 piece of information:\n", + "\n", + "* **query** (`str`)- A string containing the search query to perform.\n", + "\n", + "You can also provide a `BoxSearchOptions` object to narrow down that search\n", + "* **box_search_options** (`BoxSearchOptions`)\n", + "\n", + "#### BoxBlobLoader Metadata query" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_box.blob_loaders import BoxBlobLoader\n", + "from langchain_box.utilities import BoxMetadataQuery\n", + "\n", + "query = BoxMetadataQuery(\n", + " template_key=\"enterprise_1234.myTemplate\",\n", + " query=\"total >= :value\",\n", + " query_params={\"value\": 100},\n", + " ancestor_folder_id=\"260932470532\",\n", + ")\n", + "\n", + "loader = BoxBlobLoader(box_metadata_query=query)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load\n", + "\n", + "#### BoxLoader" ] }, { @@ -219,7 +357,35 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Lazy Load" + "#### BoxBlobLoader" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Blob(id='1514555423624' metadata={'source': 'https://app.box.com/0/260935730128/260931903795/Invoice-A5555.txt', 'name': 'Invoice-A5555.txt', 'file_size': 150} data=\"b'Vendor: AstroTech Solutions\\\\nInvoice Number: A5555\\\\n\\\\nLine Items:\\\\n - Gravitational Wave Detector Kit: $800\\\\n - Exoplanet Terrarium: $120\\\\nTotal: $920'\" mimetype='text/plain' path='https://app.box.com/0/260935730128/260931903795/Invoice-A5555.txt')\n", + "Blob(id='1514553902288' metadata={'source': 'https://app.box.com/0/260935730128/260931903795/Invoice-B1234.txt', 'name': 'Invoice-B1234.txt', 'file_size': 168} data=\"b'Vendor: Galactic Gizmos Inc.\\\\nInvoice Number: B1234\\\\nPurchase Order Number: 001\\\\nLine Items:\\\\n - Quantum Flux Capacitor: $500\\\\n - Anti-Gravity Pen Set: $75\\\\nTotal: $575'\" mimetype='text/plain' path='https://app.box.com/0/260935730128/260931903795/Invoice-B1234.txt')\n" + ] + } + ], + "source": [ + "for blob in loader.yield_blobs():\n", + " print(f\"Blob({blob})\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Lazy Load\n", + "\n", + "#### BoxLoader only" ] }, { @@ -238,6 +404,24 @@ " page = []" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Extra fields\n", + "\n", + "All Box connectors offer the ability to select additional fields from the Box `FileFull` object to return as custom LangChain metadata. Each object accepts an optional `List[str]` called `extra_fields` containing the json key from the return object, like `extra_fields=[\"shared_link\"]`. \n", + "\n", + "The connector will add this field to the list of fields the integration needs to function and then add the results to the metadata returned in the `Document` or `Blob`, like `\"metadata\" : { \"source\" : \"source, \"shared_link\" : \"shared_link\" }`. If the field is unavailable for that file, it will be returned as an empty string, like `\"shared_link\" : \"\"`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, { "cell_type": "markdown", "metadata": {}, diff --git a/docs/docs/integrations/providers/box.mdx b/docs/docs/integrations/providers/box.mdx index 3fde28d556bcb..85ffc1a79e9e6 100644 --- a/docs/docs/integrations/providers/box.mdx +++ b/docs/docs/integrations/providers/box.mdx @@ -177,3 +177,14 @@ from langchain_box.document_loaders import BoxLoader from langchain_box.retrievers import BoxRetriever ``` + +## Blob Loaders + +### BoxBlobLoader + +[See usage example](/docs/integrations/document_loaders/box) + +```python +from langchain_box.blob_loaders import BoxBlobLoader + +``` \ No newline at end of file diff --git a/docs/docs/integrations/retrievers/box.ipynb b/docs/docs/integrations/retrievers/box.ipynb index af2dd8bdd1813..e826b6e3a8656 100644 --- a/docs/docs/integrations/retrievers/box.ipynb +++ b/docs/docs/integrations/retrievers/box.ipynb @@ -563,6 +563,18 @@ "print(f\"result {result['output']}\")" ] }, + { + "cell_type": "markdown", + "id": "8b5b8adb-77ad-43e7-a41c-7880a787b43e", + "metadata": {}, + "source": [ + "## Extra fields\n", + "\n", + "All Box connectors offer the ability to select additional fields from the Box `FileFull` object to return as custom LangChain metadata. Each object accepts an optional `List[str]` called `extra_fields` containing the json key from the return object, like `extra_fields=[\"shared_link\"]`. \n", + "\n", + "The connector will add this field to the list of fields the integration needs to function and then add the results to the metadata returned in the `Document` or `Blob`, like `\"metadata\" : { \"source\" : \"source, \"shared_link\" : \"shared_link\" }`. If the field is unavailable for that file, it will be returned as an empty string, like `\"shared_link\" : \"\"`." + ] + }, { "cell_type": "markdown", "id": "3a5bb5ca-c3ae-4a58-be67-2cd18574b9a3",