-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
a lighter database? #95
Comments
Hi Dany, The two largest parts of the DB are the PSC Diamond db (UniRef90 cluster representative sequences) and the SQLite db storing the ~200 million IPS sequence hashes (UniRef100) along with all pre-compiled annotations. Therefore, excluding many except of just one annotation DB wouldn't result in significant DB size reductions. One option to reduce the databse size (that I already thought about) is to compile sub databases for certain phyla. Of course, that would imply a couple of things to develop, implement and test and thus would take its time on a mid term schedule. If this would be of interest for more users, we'd happily address that. Another option would be to host the database on more servers that distributed around the globe and thus might provide more bandwidth and better download times. Might that help in your case? Do you know of any free hosting services that would be eligible? Best regards, |
Another idea (inspired by @tseemann) is to use a ranked set of broader protein clusters. This could be addressed by skipping the A quick check on Uniprot/UniRef50 revealed 2,660,356 UniRef50 proteins. I'd estimate a size reduction of the entire database down to let's say 3-4 Gb. |
Hi @GaioTransposon, This lightweight version is only 1.2 Gb zipped and 3 Gb unzipped. |
EDIT: it was a fault conda installation (I think
|
Yes, the 3rd party dependencies needed an update. It should work, now. |
Hi there and thank you for the tool,
is there an option to download only part of the database?
https://zenodo.org/record/5961398/files/db.tar.gz) is nearly 30GB and it takes about 12 hours to download (I am using
bakta_db download --output .
with bakta installed with conda.what if one just wants to use only one of the DBs (eg.: UniProtKB/Swiss-Prot: 2021_04) ?
Kind Regards
Dany
The text was updated successfully, but these errors were encountered: