Allow splitting of CSV file if it is larger than 10MB #154

FastLee · 2024-10-09T16:51:10Z

closes #151

Added Integration Test.

github-actions · 2024-10-09T16:57:04Z

✅ 38/38 passed, 2 skipped, 1m26s total

_{Running from acceptance #219}

JCZuurmond

Added some pointers, please add more tests

JCZuurmond · 2025-01-15T12:32:43Z

src/databricks/labs/blueprint/installation.py

@@ -40,6 +40,8 @@

 __all__ = ["Installation", "MockInstallation", "IllegalState", "NotInstalled", "SerdeError"]

+FILE_SIZE_LIMIT: int = 1024 * 1024 * 10


Type hint lookgs redudant

Suggested change

FILE_SIZE_LIMIT: int = 1024 * 1024 * 10

FILE_SIZE_LIMIT = 1024 * 1024 * 10

Irrespective of whether it's redundant or not, it's incorrect: it should be ClassVar[int].

JCZuurmond · 2025-01-15T12:33:28Z

src/databricks/labs/blueprint/installation.py

@@ -132,6 +134,10 @@ def check_folder(install_folder: str) -> Installation | None:
            tasks.append(functools.partial(check_folder, service_principal_folder))
        return Threads.strict(f"finding {product} installations", tasks)

+    @staticmethod
+    def extension(filename):


Missing type hints, looks like a redundant methods, typically we use Path.suffix

JCZuurmond · 2025-01-15T12:38:47Z

src/databricks/labs/blueprint/installation.py

@@ -804,9 +841,21 @@ def _dump_csv(raw: list[Json], type_ref: type) -> bytes:
        writer = csv.DictWriter(buffer, field_names, dialect="excel")
        writer.writeheader()
        for as_dict in raw:
+            # Check if the buffer + the current row is over the file size limit


@asnare : could you have a look at this? From experience I know you are familiar with looping over these situations in a smart way ---> is there a way to chunk raw so that you have a minimum number of chunks while each of the chunks have a size smaller than FILE_SIZE_LIMIT (after writing a chunk to the io.StringIO buffer)

This is fiddly, partly because we're treating the limit as a hard limit and there's no way to know in advance whether a row will exceed the limit.

A few things spring to mind:

If we treat the limit as a target and don't try to back out the last line then things become a lot simpler.

Even though we try to treat the maximum size as a limit, this code can still make files larger because we're counting characters but some will be more than a single byte. (Need to use BytesIO to fix this.)

An alternative approach, but not necessarily better, would be to first encode the header and rows as lines (in bytes), and then a second pass to chunk things up. I don't think this would be less code though.

JCZuurmond · 2025-01-15T12:41:30Z

src/databricks/labs/blueprint/installation.py

        """The `_dump_csv` method is a private method that is used to serialize a list of dictionaries to a CSV string.
        This method is called by the `save` method."""
+        raws = []


move this closer to where it is used

JCZuurmond · 2025-01-15T12:41:57Z

src/databricks/labs/blueprint/installation.py

+                with self._ws.workspace.download(f"{self.install_folder()}/{filename}") as f:
+                    return self._convert_content(filename, f)
+            except NotFound:
+                # If the file is not found, check if it is a multi-part csv file


Move this into to a separate mtehod

JCZuurmond · 2025-01-15T12:42:23Z

src/databricks/labs/blueprint/installation.py

+        raws = converters[extension](as_dict, type_ref)
+        if len(raws) > 1:
+            for i, raw in enumerate(raws):
+                self.upload(f"{filename[0:-4]}.{i + 1}.csv", raw)


Why filename[0:-4]?

JCZuurmond · 2025-01-15T12:44:21Z

src/databricks/labs/blueprint/installation.py

-        self.upload(filename, raw)
+        raws = converters[extension](as_dict, type_ref)
+        if len(raws) > 1:
+            for i, raw in enumerate(raws):


In spark world it is common to use a folder instead:

Name the folder like the file

Combine all files inside the folder to get the final file

FastLee added 4 commits October 8, 2024 13:41

Split CSV files.

70737f1

Added Integration Test.

Added more efficient split

c011003

Added support for read.

d082038

Cleaned up code

cc9d629

FastLee requested a review from nfx as a code owner October 9, 2024 16:51

FastLee had a problem deploying to runtime October 9, 2024 16:51 — with GitHub Actions Failure

gueniai requested a review from JCZuurmond January 13, 2025 15:23

JCZuurmond requested changes Jan 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow splitting of CSV file if it is larger than 10MB #154

Allow splitting of CSV file if it is larger than 10MB #154

FastLee commented Oct 9, 2024

github-actions bot commented Oct 9, 2024

JCZuurmond left a comment

JCZuurmond Jan 15, 2025

asnare Jan 15, 2025

JCZuurmond Jan 15, 2025

JCZuurmond Jan 15, 2025

asnare Jan 15, 2025

JCZuurmond Jan 15, 2025

JCZuurmond Jan 15, 2025

JCZuurmond Jan 15, 2025

JCZuurmond Jan 15, 2025

		@@ -40,6 +40,8 @@

		__all__ = ["Installation", "MockInstallation", "IllegalState", "NotInstalled", "SerdeError"]

		FILE_SIZE_LIMIT: int = 1024 * 1024 * 10

	FILE_SIZE_LIMIT: int = 1024 * 1024 * 10
	FILE_SIZE_LIMIT = 1024 * 1024 * 10

Allow splitting of CSV file if it is larger than 10MB #154

Are you sure you want to change the base?

Allow splitting of CSV file if it is larger than 10MB #154

Conversation

FastLee commented Oct 9, 2024

github-actions bot commented Oct 9, 2024

JCZuurmond left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment