Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: containers-add-dhub - add multiple images/tags/repositories from docker hub #135

Merged
merged 10 commits into from
Apr 13, 2021

Conversation

yarikoptic
Copy link
Member

It is the first truly "containers-" (as opposed to "container-") command intended to add multiple containers from the Docker hub organization, possibly for multiple repositories.

Since docker hub announced that retention policy is about to be changed (well, deadline now moved to mid 2021 so not as urgent), we better provide easy means to "mirror" (or backup) all (or desired subset) of docker containers within a datalad dataset. We are planing to do that within https://github.com/repronim/containers , which ATM contains only docker containers rebuilt into singularity containers. But IMHO it could be of great benefit to have a helper tool or even a command so any user of datalad-container could establish a seamless backup of docker container from his/her (or just a favorite collection of) docker repository .

ATM it is possible to do that with this helper script (not a containers- command yet) which was initially crafted by @kyleam and then tortured into spaghetti code by me, for an "official" Docker images repo (e.g. busybox or neurodebian) which is just a shortcut to library/<repo> on the hub, or any other than library collection of the repositories (e.g. repronim/).
It already supports multiple architecture image collections (e.g. busybox) and annotates architecture (if multiple archs for the tag) in the image name. See the header of the file for more information/conventions

If eager to try (although might want to uncomment the TEST definitions in the file) - try running in some target new dataset (probably created with -c text2git), e.g. .../tools/containers_add_dhub_tags.py <(echo busybox) or tools/containers_add_dhub_tags.py <(echo repronim/) .

TODOs

  • many embedded in the code ATM, yet to be fully populated here
    • populate with URLs to layers on docker hub
    • establish a "default" datalad container config for a repo (for a "latest" and/or repo:tag) which would use latest for the default arch
    • ...
  • consider to make it into a command, e.g. containers-add-dhub

kyleam and others added 10 commits October 29, 2020 21:47
It's rough, but it might be at least somewhat functional.  I let the
first few entries of

    echo repronim/ | python containers_add_dhub_tags.py

and

    echo neurodebian | python containers_add_dhub_tags.py

complete.  The lack of progress about the underlying `docker pull` is
unfortunate (and I bet there's a -container issue open about it).

The main design decision here is to name the results (the image and
downloaded manifest) based on the manifest's .config.digest value and
then use exists() checks as an indication that we already have the
result locally.  Whether that's actually valid should be revisited.

Ref: ReproNim/containers#48
Use directory structure to make it easy to see which repo a digest
belongs to.

Ref: ReproNim/containers#48 (comment)
The digest was used in the first pass to avoid worrying about invalid
characters, but as mentioned in the comment and on the
ReproNim/containers issue tracker [*], this isn't good for recognizing
the name.

Instead construct the name from combining the repository and tag,
replacing any characters that containers-add doesn't allow with "--".
This introduces ambiguity and the potential for conflicts but is
probably good enough.

[*] ReproNim/containers#48 (comment)
Make the containers-list output less noisy for official images.

Ref: ReproNim/containers#48 (comment)
Subdirectories were added a few commits back (69e7fe5).
…tiple architectures etc

and also
- support multiple architectures
- request specific (based on digest) image, not just "first" one
- add bunch of TODO comments for what todo next
…datalad-container itself

I think this script would be at least a valuable tool within
datalad-container or might even better become a proper command
of the datalad-container extension
* local-dhub-tags/master:
  RF: de-dataset it and move file under tools/ so we could ingest into datalad-container itself
  ENH: make ugly short code into a pretty long spaggetti to support multiple architectures etc
  Update --help's description for directories
  Drop "library/" in containers-add name
  Use tag in containers-add name
  Store images and manifests under namespace/repo subdirectory
  Prototype of script to feed Docker Hub tags to containers-add
  [DATALAD] new dataset
no architecuture if only one, and no last_pushed if None
@codecov
Copy link

codecov bot commented Oct 31, 2020

Codecov Report

Merging #135 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #135   +/-   ##
=======================================
  Coverage   86.65%   86.65%           
=======================================
  Files          17       17           
  Lines         922      922           
=======================================
  Hits          799      799           
  Misses        123      123           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bfa3d20...bfe6743. Read the comment docs.

@kyleam
Copy link
Contributor

kyleam commented Nov 2, 2020

Thanks for turning my initial sketch into something more useful.

I think in terms of exploration this is good, and I don't have any objections to having this script or something like it live in tools/. However, having played around with skopeo more, it's really impressive, and I think we should just focus on it going forward. For the immediate problem at hand, there is already skopeo-sync, and it takes care of a large part of what this script does. A small wrapper would still be needed to list the repos of an organization and register them with containers-add (though I'm not sure that it's a good thing to couple the local dump with registering the container). Also, I don't think skopeo-sync supports getting all the architectures yet [*].

Here's an example to sync the neurodebian repo:

$ skopeo sync --src docker --scoped --dest dir docker.io/neurodebian images/
$ tree --charset=ascii images | head
images
`-- docker.io
    `-- library
        |-- neurodebian:artful
        |   |-- 261816990a775a30f88752a13a62a52bcde56bb65e4a55f197a2e9fb9bb5920e
        |   |-- 448bb314afa553bfb1578121328bbe92d2b5ca0f411967e7a0a200f672ade92f
        |   |-- 4ccdce43d1e00fd03ac5438d39e731c16db3dfcf03c68390884b8e8c814221ca
        |   |-- 518254c3dbad5ed8bf16b404277faae75f3ba8bd5fcd69a217de42fbed22f250
        |   |-- 78ff727be57a68299558bb40b737669ca5cb9a8db948411d852ec809c14e7a1f
        |   |-- 82656eee95ad054e0aa75486e7c55b7666c26abbd9bf19373dd349f6e172ce9d

Setting backing up/syncing a registry aside, the two missing pieces for working with skopeo are the adapter and support for the URLs in the datalad special remote. I think the mechanics of both these aspects are straightforward, but the trickier part about both is thinking through the design to leave room for other sources and targets. In the case of the blob downloader, that probably just boils down to including the specific registry as part of the URL. In the case of the adapter, the main thing I have in mind is execution with podman (gh-89, gh-106).

I'm working on getting an initial version of the downloader and skopeo adapter set up.

[*] It looks like support for that is coming: containers/skopeo#880

@yarikoptic
Copy link
Member Author

Great, thanks @kyleam for looking into it!
I also wonder, if we bet on skope's sync, would it still work (sync only new images etc) if files are annexed (symlinks) -- would we need to keep them unlocked or unlock explicitly first (pain/inefficient).

@kyleam
Copy link
Contributor

kyleam commented Nov 3, 2020

I also wonder, if we bet on skope's sync, would it still work (sync only new images etc) if files are annexed (symlinks)

Hmm, good point, but actually an update wouldn't work in general with a directory destination. It will fail refusing to overwrite the directory. I had assumed it would skip because I saw "Copying blob X skipped: already exists" in sync output posted to the issue tracker, but, looking at the code now, it seems like that'd only be for copying to a registry destination, not a local directory. So, it was too soon to say sync could mostly replace this script. We could still do a skopeo copy --all ..., though, rather than go through the docker adapter.

@yarikoptic
Copy link
Member Author

I think we better just have this PR merged to provide a tool. All the TODOs might come later if decided to proceed (likely with the next docker hub announcement ;) )

@yarikoptic yarikoptic merged commit 3f7b7b4 into datalad:master Apr 13, 2021
@yarikoptic yarikoptic added the internal Changes only affect the internal API label Apr 13, 2021
@yarikoptic yarikoptic deleted the enh-dhub-tags branch February 4, 2023 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal Changes only affect the internal API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants