Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

soto repository history is very large - 381 MiB #752

Open
vsarunas opened this issue Jan 9, 2025 · 6 comments
Open

soto repository history is very large - 381 MiB #752

vsarunas opened this issue Jan 9, 2025 · 6 comments

Comments

@vsarunas
Copy link

vsarunas commented Jan 9, 2025

Describe the bug

Swift Package Manager dependency fetching is significantly slowed down by the large size of the Soto repository (~381MB) due to AWS service model files. This impacts development workflow as Xcode DerivedData or SPM caches frequently need to be cleared due to various bugs, and SPM doesn't support shallow clones (swift-package-manager#6062).

The repository size is 381.59 MiB and grows by 6MB+ each time AWS service models are synced from aws-sdk-go.

To Reproduce

$ gh repo clone soto-project/soto
Cloning into 'soto'...
remote: Enumerating objects: 114100, done.
remote: Counting objects: 100% (7887/7887), done.
remote: Compressing objects: 100% (3274/3274), done.
remote: Total 114100 (delta 5929), reused 4702 (delta 4593), pack-reused 106213 (from 3)
Receiving objects: 100% (114100/114100), 381.59 MiB | 15.17 MiB/s, done.
Resolving deltas: 100% (85429/85429), done.

Top 20 largest files are:

git rev-list --objects --all | \
 git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | \
 grep '^blob' | \
 awk '$3 >= 500000' | \
 sort -k3nr | grep json | head -20

blob 336ecc330853ee06a307694ba425a7d2c8d5b460 6320480 models/ec2.json
blob c84199fffc6e61421dcf68711a3c275df1fa711e 6302862 models/ec2.json
blob 2c2430bf8a9bd4871684bcac269bcfc427a3c436 6039980 models/ec2.json
blob 06e20ca53d64951c669fcf97afbc0c59b62a53ad 5985734 models/ec2.json
blob 19c0dbb217c88008bcfe79e8ddad677fd6288fe3 5956200 models/ec2.json
blob 20f2ebe277b947091543fa752990c5ee6d553829 5927941 models/ec2.json
blob 9bb20e4a893300efa3e6c4d76600b73d7c59058f 5900772 models/ec2.json
blob 9ca7f79502db806873032094f08ecf13e51bdf15 5886406 models/ec2.json
blob 2433891020c34d5a6b81e5515f4c1bf256d10e10 5867251 models/ec2.json
blob 79ce1e574cddcaf5463547df42a9be5a02f3f493 5773910 models/ec2.json
blob 8d9b31253f21c67a0c3674152adc30b741eada8b 5740137 models/ec2.json
blob 34fed559480ecf58b72a354941ce1ab98eb95d1f 5546957 models/ec2.json
blob c6f2ef60bc2ca894a328654267ced949c8ad2e45 5519596 models/ec2.json
blob 47805fb8811dc399dd7edc1d457961a075a44d9d 5477188 models/ec2.json
blob ee225e13753a28ebbae96f1db7384eb5add49d09 5378103 models/ec2.json
blob 4025b25fd073468cafce5422bebddb1edabb2e3a 5302274 models/ec2.json
blob f82e2353bf30621eb0adc2f6f9d6ed8cd2c057af 5302103 models/ec2.json
blob 81e91be28c230e6bfa678307c0e6051005ed9264 5093333 models/ec2.json
blob b80941116c0cd5b07b982d0f0f6fa3feb9d3475d 5074867 models/ec2.json
blob 898e408cf2a1d285087b08d3faecf9ed315ddee0 5072090 models/ec2.json

@adam-fowler, would you consider moving these large model files to Git LFS to reduce repository size and improve clone times?
https://docs.github.com/en/repositories/working-with-files/managing-large-files/configuring-git-large-file-storage
https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-git-large-file-storage

@adam-fowler
Copy link
Member

I was thinking about this, Having a link between the service files and the model files that generated them is useful. I might have a look into moving the models into a separate repo but using a submodule so I can access them from the soto repo if needed. I don't think submodules are pulled by SwiftPM

@vsarunas
Copy link
Author

I believe SPM does update submodules each time - https://github.com/swiftlang/swift-package-manager/blob/swift-6.0.3-RELEASE/Sources/SourceControl/GitRepository.swift#L649-L685

This repo has a submodule - https://github.com/swift-server/swift-kafka-client

git clone https://github.com/swift-server/swift-kafka-client.git

SPM after each update will then run the updateSubmoduleAndCleanNotOnQueue() which will do the submodule fetch:

git submodule update --init --recursive
Submodule 'Sources/Crdkafka/librdkafka' (https://github.com/confluentinc/librdkafka) registered for path 'Sources/Crdkafka/librdkafka'
[...]
Submodule path 'Sources/Crdkafka/librdkafka': checked out '267367c9475c2154e72eafe6ff1957518cb2ed1a'

The json files could be stored in Git LFS and hosted by Github without any issues?

I tried:

git lfs migrate import --include="models/*.json" --everything
git push --force --all
git push --force --tags
git lfs prune

This is not allowed:

batch response: @vsarunas can not upload new objects to public fork vsarunas/soto

Switching to a fully personal account but reached a limit:

Uploading LFS objects:  97% (2984/3084), 1.5 GB | 6.3 MB/s, done.
batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
error: failed to push some refs to 'https://github.com/vsarunas/soto-lfs-test.git'

https://github.com/settings/billing/summary for me shows 1GB free for LFS.

@adam-fowler
Copy link
Member

The problem with putting the models in a separate repo, lfs or not, is I lose the connection between the models and the source code.

I need to have a think to see how I can get around this.

In the meantime have you considered using the code generator plugin? If you aren't using any of the extension code eg s3 multipart upload helpers then this is a viable alternative. https://github.com/soto-project/soto-codegenerator

@vsarunas
Copy link
Author

Is the generator usable in https://github.com/soto-project/soto-s3-file-transfer ?

2025-01-13T17:28:30+0300 info SotoCodeGenerator : [SotoCodeGeneratorLib] Wrote s3_api.swift
2025-01-13T17:28:31+0300 info SotoCodeGenerator : [SotoCodeGeneratorLib] Wrote s3_shapes.swift
[...]
soto-s3-file-transfer/Sources/SotoS3FileTransfer/S3FileTransferManager.swift:140:35: error: value of type 'S3' has no member 'multipartUpload'
138 |         if fileSize > self.configuration.multipartThreshold {
139 |             let request = S3.CreateMultipartUploadRequest(bucket: to.bucket, key: to.key, options: options)
140 |             _ = try await self.s3.multipartUpload(
    |                                   `- error: value of type 'S3' has no member 'multipartUpload'
141 |                 request,
142 |                 partSize: self.configuration.multipartPartSize,

Or is this the limitation mentioned under https://github.com/soto-project/soto-codegenerator?tab=readme-ov-file#missing-code ?

@adam-fowler
Copy link
Member

Sorry no

@vsarunas
Copy link
Author

If copying from https://github.com/soto-project/soto/tree/7.3.0/Sources/Soto/Extensions/S3 the following files:
AsyncEnumeratedSequence.swift FileByteBufferAsyncSequence.swift ReportSizeByteBufferSequence.swift S3+multipart.swift into the generator project, then can compile and test suite passes:

swift test | xcbeautify

----- xcbeautify -----
Version: 2.15.0
----------------------

Building for debugging...
[3/3] Applying SotoCodeGenerator-tool
Build of product 'SotoCodeGenerator' complete! (2.68s)
Building for debugging...
[1/1] Write swift-version-5BDAB9E9C0126B9D.txt
Build of product 'SotoCodeGenerator' complete! (0.15s)
[...]
[5/5] Linking soto-s3-file-transferPackageTests
Build complete! (14.90s)
All tests
soto-s3-file-transferPackageTests.xctest
S3FileTransferManagerTests
    ✔ testBigFolderUpload (0.060 seconds)
    ✔ testCancelledDownloadWithCancel (0.511 seconds)
    ✔ testCancelledSyncWithCancel (0.546 seconds)
    ✔ testCancelledSyncWithFlush (0.368 seconds)
    ✔ testCopyPathLocalToS3 (0.563 seconds)
    ✔ testDeleteFolderAndSync (0.605 seconds)
    ✔ testDeleteFolder (0.049 seconds)
    ✔ testDownloadFileToFolder (0.113 seconds)
    ✔ testDownloadFolderToFile (0.183 seconds)
    ✔ testFileFolderClash (0.034 seconds)
    ✔ testIgnoreFileFolderClash (0.048 seconds)
    ✔ testListFiles (0.006 seconds)
    ✔ testListS3Files (0.024 seconds)
    ✔ testS3CopyOfNonExistentFile (0.017 seconds)
    ✔ testS3Copy (0.122 seconds)
    ✔ testS3DownloadOfNonExistentFile (0.020 seconds)
    ✔ testS3TargetFiles (0.000 seconds)
    ✔ testS3toS3CopyPath (0.223 seconds)
    ✔ testS3UploadOfNonExistentFile (0.022 seconds)
    ✔ testSyncPathLocalToS3 (0.115 seconds)
    ✔ testTargetFiles (0.000 seconds)
    ✔ testUploadDownload (0.119 seconds)
Executed 24 tests, with 0 failures (0 unexpected) in 12.295 (12.300) seconds
S3PathTests
    ✔ testFileInFolder (0.000 seconds)
    ✔ testFileNameExtension (0.000 seconds)
    ✔ testS3File (0.000 seconds)
    ✔ testS3Folder (0.000 seconds)
    ✔ testSubFolder (0.000 seconds)
    ✔ testURL (0.000 seconds)
Executed 6 tests, with 0 failures (0 unexpected) in 0.001 (0.001) seconds
Test Suite 'soto-s3-file-transferPackageTests.xctest' passed at 2025-01-14 09:49:58.284.
Executed 30 tests, with 0 failures (0 unexpected) in 12.296 (12.301) seconds
Test Suite 'All tests' passed at 2025-01-14 09:49:58.284.
Executed 30 tests, with 0 failures (0 unexpected) in 12.296 (12.302) seconds
Test run started.
Test run with 0 tests passed after 0.001 seconds

Is there any way that it would be possible to upstream these changes? For example using a script which syncs the required json files and extensions from the main soto repository (if there is no other way to reduce repository size)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants