Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARC lookup for remote zipfile (WACZ) #23

Merged
merged 6 commits into from
Dec 31, 2024

Conversation

leewesleyv
Copy link
Collaborator

@leewesleyv leewesleyv commented Dec 19, 2024

Resolves #21

To account for large zipfiles (WACZ) we are implementing lookup of the index/WARC record using range requests for remote files.

  • Add a storage handler class that abstracts logic to allow handling local and remote zipfiles
  • Implement logic for requesting partial zip data and parsing the data low level instead of using Python's build-in zipfile module (does not seem to be compatible without having the entire file).
  • Add unit-tests

Implement different storages handlers

  • Implement local storage handler class
  • Implement S3 storage handler class
    Implement GC storage handler class
    Implement Azure storage handler class
    Implement FTP storage handler class

Useful links

@leewesleyv leewesleyv requested a review from wvengen December 19, 2024 10:29
@leewesleyv leewesleyv self-assigned this Dec 19, 2024
Copy link
Member

@wvengen wvengen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, impressive you got this, with a streaming zip-reading implementation! Looking good, at first glance.

I'm still a bit surprised that there is no existing streaming-zip-over-http implementation, but it happens that sometimes some real engineering is still necessary :) 🚀

scrapy_webarchive/wacz/storages.py Show resolved Hide resolved
scrapy_webarchive/wacz/storages.py Show resolved Hide resolved
@wvengen
Copy link
Member

wvengen commented Dec 19, 2024

I would say, GC/Azure/FTP are nice to haves - so feel free to add them, but I would not spend a long long time if you come across blocking issues.

@leewesleyv leewesleyv force-pushed the feature/21-warc-lookup-remote-zip branch from 486d4ba to 7269880 Compare December 20, 2024 14:35
@leewesleyv leewesleyv marked this pull request as ready for review December 23, 2024 13:11
@leewesleyv leewesleyv merged commit 2db01ad into main Dec 31, 2024
7 checks passed
@leewesleyv leewesleyv deleted the feature/21-warc-lookup-remote-zip branch December 31, 2024 09:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve performance of get_warc_from_cdxj_record
2 participants