Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offset by 1 bug on RecursiveJsonSplitter::split_json() function #29153

Open
5 tasks done
blupants opened this issue Jan 11, 2025 · 0 comments
Open
5 tasks done

Offset by 1 bug on RecursiveJsonSplitter::split_json() function #29153

blupants opened this issue Jan 11, 2025 · 0 comments
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature

Comments

@blupants
Copy link

blupants commented Jan 11, 2025

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain_text_splitters import RecursiveJsonSplitter


input_data = {
  "projects": {
    "AS": {
      "AS-1": {}
    },
    "DLP": {
      "DLP-7": {},
      "DLP-6": {},
      "DLP-5": {},
      "DLP-4": {},
      "DLP-3": {},
      "DLP-2": {},
      "DLP-1": {}
    },
    "GTMS": {
      "GTMS-22": {},
      "GTMS-21": {},
      "GTMS-20": {},
      "GTMS-19": {},
      "GTMS-18": {},
      "GTMS-17": {},
      "GTMS-16": {},
      "GTMS-15": {},
      "GTMS-14": {},
      "GTMS-13": {},
      "GTMS-12": {},
      "GTMS-11": {},
      "GTMS-10": {},
      "GTMS-9": {},
      "GTMS-8": {},
      "GTMS-7": {},
      "GTMS-6": {},
      "GTMS-5": {},
      "GTMS-4": {},
      "GTMS-3": {},
      "GTMS-2": {},
      "GTMS-1": {}
    },
    "IT": {
      "IT-3": {},
      "IT-2": {},
      "IT-1": {}
    },
    "ITSAMPLE": {
      "ITSAMPLE-12": {},
      "ITSAMPLE-11": {},
      "ITSAMPLE-10": {},
      "ITSAMPLE-9": {},
      "ITSAMPLE-8": {},
      "ITSAMPLE-7": {},
      "ITSAMPLE-6": {},
      "ITSAMPLE-5": {},
      "ITSAMPLE-4": {},
      "ITSAMPLE-3": {},
      "ITSAMPLE-2": {},
      "ITSAMPLE-1": {}
    },
    "MAR": {
      "MAR-2": {},
      "MAR-1": {}
    }
  }
}

splitter = RecursiveJsonSplitter(max_chunk_size=216)
json_chunks = splitter.split_json(json_data=input_data)

input_data_DLP_5 = input_data.get("projects", {}).get("DLP", {}).get("DLP-5", None)
input_data_GTMS_10 = input_data.get("projects", {}).get("GTMS", {}).get("GTMS-10", None)
input_data_ITSAMPLE_2 = input_data.get("projects", {}).get("ITSAMPLE", {}).get("ITSAMPLE-2", None)

chunk_DLP_5 = None
chunk_GTMS_10 = None
chunk_ITSAMPLE_2 = None

for chunk in json_chunks:
    print(chunk)
    node = chunk.get("projects", {}).get("DLP", {}).get("DLP-5", None)
    if isinstance(node, dict):
        chunk_DLP_5 = node
    node = chunk.get("projects", {}).get("GTMS", {}).get("GTMS-10", None)
    if isinstance(node, dict):
        chunk_GTMS_10 = node
    node = chunk.get("projects", {}).get("ITSAMPLE", {}).get("ITSAMPLE-2", None)
    if isinstance(node, dict):
        chunk_ITSAMPLE_2 = node

print("\nRESULTS:")
if isinstance(chunk_DLP_5, dict):
    print(f"[PASS] - Node DLP-5 was found both in input_data and json_chunks")
else:
    print(f"[TEST FAILED] - Node DLP-5 from input_data was NOT FOUND in json_chunks")

if isinstance(chunk_GTMS_10, dict):
    print(f"[PASS] - Node GTMS-10 was found both in input_data and json_chunks")
else:
    print(f"[TEST FAILED] - Node GTMS-10 from input_data was NOT FOUND in json_chunks")

if isinstance(chunk_ITSAMPLE_2, dict):
    print(f"[PASS] - Node ITSAMPLE-2 was found both in input_data and json_chunks")
else:
    print(f"[TEST FAILED] - Node ITSAMPLE-2 from input_data was NOT FOUND in json_chunks")

Error Message and Stack Trace (if applicable)

No response

Description

I am trying to use langchain_text_splitters library to split JSON content using the function RecursiveJsonSplitter::split_json()

For most cases it works, however I am experiencing some data being lost depending on the input JSON and the chunk size I am using.

I was able to consistently replicate the issue for the input JSON provided on my sample code. I always get the nodes "GTMS-10" and "ITSAMPLE-2" discarded when I split the JSON using max_chunk_size=216.

I noticed this issue always occurs with nodes that would be on the edge of the chunks. When you run my sample code, it will print all the 5 chunks generated:

python split_json_bug.py 

{'projects': {'AS': {'AS-1': {}}, 'DLP': {'DLP-7': {}, 'DLP-6': {}, 'DLP-5': {}, 'DLP-4': {}, 'DLP-3': {}, 'DLP-2': {}, 'DLP-1': {}}}}
{'projects': {'GTMS': {'GTMS-22': {}, 'GTMS-21': {}, 'GTMS-20': {}, 'GTMS-19': {}, 'GTMS-18': {}, 'GTMS-17': {}, 'GTMS-16': {}, 'GTMS-15': {}, 'GTMS-14': {}, 'GTMS-13': {}, 'GTMS-12': {}, 'GTMS-11': {}}}}
{'projects': {'GTMS': {'GTMS-9': {}, 'GTMS-8': {}, 'GTMS-7': {}, 'GTMS-6': {}, 'GTMS-5': {}, 'GTMS-4': {}, 'GTMS-3': {}, 'GTMS-2': {}, 'GTMS-1': {}}, 'IT': {'IT-3': {}, 'IT-2': {}, 'IT-1': {}}}}
{'projects': {'ITSAMPLE': {'ITSAMPLE-12': {}, 'ITSAMPLE-11': {}, 'ITSAMPLE-10': {}, 'ITSAMPLE-9': {}, 'ITSAMPLE-8': {}, 'ITSAMPLE-7': {}, 'ITSAMPLE-6': {}, 'ITSAMPLE-5': {}, 'ITSAMPLE-4': {}, 'ITSAMPLE-3': {}}}}
{'projects': {'ITSAMPLE': {'ITSAMPLE-1': {}}, 'MAR': {'MAR-2': {}, 'MAR-1': {}}}}

RESULTS:
[PASS] - Node DLP-5 was found both in input_data and json_chunks
[TEST FAILED] - Node GTMS-10 from input_data was NOT FOUND in json_chunks
[TEST FAILED] - Node ITSAMPLE-2 from input_data was NOT FOUND in json_chunks

Please, noticed that the 2nd chunk ends with node "GTMS-11" and the 3rd chunk starts with "GTMS-9". Same thing for chunks number 4 (ends with "ITSAMPLE-3") and chunk number 5 (starts with "ITSAMPLE-1")

Because the chunks "GTMS-10" and "ITSAMPLE-2" were lost on the edges of chunks, I believe that might a case of an "offset by 1 bug" on the RecursiveJsonSplitter::split_json() Python function.

Since I am calling it exactly how it is described in the documentation and I couldn't find any bug and discussion mentioning it, I thought I should file a bug for it.

System Info

(.venv) user@User-MacBook-Air split_json_bug % python -m langchain_core.sys_info

System Information
------------------
> OS:  Darwin
> OS Version:  Darwin Kernel Version 23.6.0: Thu Sep 12 23:34:49 PDT 2024; root:xnu-10063.141.1.701.1~1/RELEASE_X86_64
> Python Version:  3.11.9 (main, Apr  2 2024, 08:25:04) [Clang 15.0.0 (clang-1500.3.9.4)]

Package Information
-------------------
> langchain_core: 0.3.29
> langsmith: 0.2.10
> langchain_text_splitters: 0.3.5

Optional packages not installed
-------------------------------
> langserve

Other Dependencies
------------------
> httpx: 0.28.1
> jsonpatch: 1.33
> langsmith-pyo3: Installed. No version info available.
> orjson: 3.10.14
> packaging: 24.2
> pydantic: 2.10.5
> PyYAML: 6.0.2
> requests: 2.32.3
> requests-toolbelt: 1.0.0
> tenacity: 9.0.0
> typing-extensions: 4.12.2
> zstandard: Installed. No version info available.
(.venv) user@User-MacBook-Air split_json_bug % pip freeze
annotated-types==0.7.0
anyio==4.8.0
certifi==2024.12.14
charset-normalizer==3.4.1
h11==0.14.0
httpcore==1.0.7
httpx==0.28.1
idna==3.10
jsonpatch==1.33
jsonpointer==3.0.0
langchain-core==0.3.29
langchain-text-splitters==0.3.5
langsmith==0.2.10
orjson==3.10.14
packaging==24.2
pydantic==2.10.5
pydantic_core==2.27.2
PyYAML==6.0.2
requests==2.32.3
requests-toolbelt==1.0.0
sniffio==1.3.1
tenacity==9.0.0
typing_extensions==4.12.2
urllib3==2.3.0
@dosubot dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Jan 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature
Projects
None yet
Development

No branches or pull requests

1 participant