Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Dataverse RDM repository integration #19367

Open
wants to merge 64 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
8d49f12
chore: ignore vsc workspace file
KaiOnGitHub Nov 21, 2024
f0a21dc
feat: initial template for dataverse integration
KaiOnGitHub Nov 21, 2024
1359cfe
feat: prototype of fetching datasets
KaiOnGitHub Nov 21, 2024
99073e8
feat: adding dataverse keys where zenodo and invendio was added
KaiOnGitHub Nov 21, 2024
06e4424
feat: refactor with more abstract naming in base class and migrate fu…
KaiOnGitHub Dec 4, 2024
b260661
feat: download remote files from dataverse (prototype)
KaiOnGitHub Dec 4, 2024
571dc07
chore: renaming from "record" to "container" for coherent terminology
KaiOnGitHub Dec 6, 2024
65f9504
chore: docstring to clarify what is a container in invenio
KaiOnGitHub Dec 6, 2024
60afbf0
chore: remove reference to galaxy Collection
KaiOnGitHub Dec 6, 2024
a01c848
chore: change order of file source and repository interactor
KaiOnGitHub Dec 6, 2024
63105a5
Revert "chore: change order of file source and repository interactor"
KaiOnGitHub Dec 6, 2024
93393e7
chore: change order of repository interactor and file source in invenio
KaiOnGitHub Dec 6, 2024
975c329
chore: explain container in invenio docstrings
KaiOnGitHub Dec 6, 2024
50894b5
Revert "chore: explain container in invenio docstrings"
KaiOnGitHub Dec 6, 2024
47f0dbe
Revert "chore: change order of repository interactor and file source …
KaiOnGitHub Dec 6, 2024
bc9d671
chore: explain container in invenio docstrings
KaiOnGitHub Dec 6, 2024
6473711
chore: dataset refactoring and renaming to container term
KaiOnGitHub Dec 6, 2024
7c22927
feat: only load drafts if writeable is true
KaiOnGitHub Dec 7, 2024
73c72d1
chore: clarification regarding dataset drafts
KaiOnGitHub Dec 7, 2024
4be6109
feat: load latest version of files from datasets (this automatically …
KaiOnGitHub Dec 7, 2024
f557b13
feat: download files from draft (doesn't work yet due to missing user…
KaiOnGitHub Dec 7, 2024
4a37a21
feat: add config samples for dataverse and dataverse_sandbox
KaiOnGitHub Dec 7, 2024
1b18e90
chore: cleanup after download drafts feature, remove apparantly not n…
KaiOnGitHub Dec 7, 2024
29ee871
chore: clearer naming for file container method
KaiOnGitHub Dec 7, 2024
77f0dde
feat: api versioning for long term stability
KaiOnGitHub Dec 7, 2024
6d2081c
chore: add repository type for invenio filesource class
KaiOnGitHub Dec 7, 2024
0220c74
chore: remove todos for tested methods
KaiOnGitHub Dec 7, 2024
fd921a4
feat: export history to existing dataset
KaiOnGitHub Dec 7, 2024
20cbd1a
feat: reimport of archived datasets from dataverse
KaiOnGitHub Dec 12, 2024
ad65b27
chore: line breaks
KaiOnGitHub Dec 13, 2024
647f1f5
feat: more reliable way to reimport archives
KaiOnGitHub Dec 13, 2024
5698ed5
chore: typo
KaiOnGitHub Dec 13, 2024
234c27d
chore: typo
KaiOnGitHub Dec 13, 2024
559afcf
fix: only recognize .zip files for dataset import workaround (.tar fi…
KaiOnGitHub Dec 16, 2024
48fa830
chore: add NotFoundException
KaiOnGitHub Dec 16, 2024
f789332
chore: remove print statements
KaiOnGitHub Dec 16, 2024
21dc77d
chore: remove TODOs
KaiOnGitHub Dec 16, 2024
5898d45
chore: score_url_match function
KaiOnGitHub Dec 16, 2024
b6a4f36
chore: remove invenio specific feature
KaiOnGitHub Dec 16, 2024
4e21abd
chore: remove unused metadata parameter for file upload
KaiOnGitHub Dec 17, 2024
128f8ab
fix: add files again
KaiOnGitHub Dec 17, 2024
e22b2df
feat: export history to new dataverse dataset
KaiOnGitHub Dec 17, 2024
08cd27f
fix: remove todos, file_access_url
KaiOnGitHub Dec 17, 2024
173c445
chore: remove get creator method
KaiOnGitHub Dec 17, 2024
15c9488
chore: remove reference to directories in rdm base class
KaiOnGitHub Dec 18, 2024
144fec0
chore: refactor parse path function
KaiOnGitHub Dec 18, 2024
8442cee
chore: reordering imports
KaiOnGitHub Dec 18, 2024
f4fdf5d
chore: remove duplicated function
KaiOnGitHub Dec 18, 2024
5480e8f
chore: add TODO for tar.gz files
KaiOnGitHub Dec 18, 2024
3d0ec46
chore: add dataset download url
KaiOnGitHub Dec 18, 2024
ed8bd78
chore: private get alias function
KaiOnGitHub Dec 18, 2024
441254f
chore: payload as str parameter instead of dict
KaiOnGitHub Dec 18, 2024
06a976f
chore: public_name not optional, create alias in payload preparation
KaiOnGitHub Dec 18, 2024
9c3a632
chore: docstrings
KaiOnGitHub Dec 18, 2024
0398a76
chore: remove unusued get file url function
KaiOnGitHub Dec 18, 2024
a73f679
chore: reorder is api url function
KaiOnGitHub Dec 18, 2024
ddecf15
chore: remove unneeded function
KaiOnGitHub Dec 18, 2024
1e4135f
chore: simplify get datasets from response
KaiOnGitHub Dec 18, 2024
d778dfd
chore: simplify get_files_from_response
KaiOnGitHub Dec 18, 2024
e44c2d3
chore: simplify collection creation
KaiOnGitHub Dec 18, 2024
885c4f3
chore: add docstring to realize_to and write_from
KaiOnGitHub Dec 18, 2024
4153952
feat: improve search to only search for title
KaiOnGitHub Dec 18, 2024
be3c779
chore: simplify get_file_containers
KaiOnGitHub Dec 18, 2024
60e2055
feat: filter files of dataset with search query
KaiOnGitHub Dec 18, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,7 @@ tool_test_output.json
client/**/jsconfig.json
vetur.config.js
.pre-commit-config.yaml
galaxy.code-workspace

# Chrom len files
*.len
Expand Down
1 change: 1 addition & 0 deletions client/src/utils/upload-payload.js
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ export const URI_PREFIXES = [
"drs://",
"invenio://",
"zenodo://",
"dataverse://",
];

export function isUrl(content) {
Expand Down
18 changes: 18 additions & 0 deletions lib/galaxy/config/sample/file_sources_conf.yml.sample
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,24 @@
public_name: ${user.preferences['zenodo_sandbox|public_name']}
writable: true

- type: dataverse
id: dataverse
doc: Dataverse is an open-source data repository platform designed for sharing, preserving, and managing research data, offering tools for data citation, exploration, and collaboration.
label: Dataverse
url: https://dataverse.org
token: ${user.user_vault.read_secret('preferences/dataverse/token')}
public_name: ${user.preferences['dataverse|public_name']}
writable: true

- type: dataverse
id: dataverse_sandbox
doc: This is the sandbox instance of Dataverse. It is used for testing purposes only, content is NOT preserved. DOIs created in this instance are not real and will not resolve.
label: Dataverse Sandbox (use only for testing purposes)
url: https://demo.dataverse.org
token: ${user.user_vault.read_secret('preferences/dataverse_sandbox/token')}
public_name: ${user.preferences['dataverse_sandbox|public_name']}
writable: true

# Note for developers: you can easily set up a minimal, dockerized Onedata environment
# using the so-called "demo-mode": https://onedata.org/#/home/documentation/topic/stable/demo-mode
- type: onedata
Expand Down
26 changes: 26 additions & 0 deletions lib/galaxy/config/sample/user_preferences_extra_conf.yml.sample
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,32 @@ preferences:
label: Creator name to associate with new records (formatted as "Last name, First name"). If left blank "Anonymous Galaxy User" will be used. You can always change this by editing your record directly.
type: text
required: False

dataverse:
description: Your Dataverse Integration Settings
inputs:
- name: token
label: API Token used to create draft records and to upload files. You can manage your tokens at https://YOUR_INSTANCE/dataverseuser.xhtml?selectTab=apiTokenTab (Replace YOUR_INSTANCE with your Dataverse instance URL)
type: secret
# store: vault # Requires setting up vault_config_file in your galaxy.yml
required: False
- name: public_name
label: Creator name to associate with new datasets (formatted as "Last name, First name"). If left blank "Anonymous Galaxy User" will be used. You can always change this by editing your dataset directly.
type: text
required: False

dataverse_sandbox:
description: Your Dataverse Integration Settings (TESTING ONLY)
inputs:
- name: token
label: API Token used to create draft records and to upload files. You can manage your tokens at https://demo.dataverse.org/dataverseuser.xhtml?selectTab=apiTokenTab (Replace demo.dataverse.org with your Dataverse instance URL)
type: secret
# store: vault # Requires setting up vault_config_file in your galaxy.yml
required: False
- name: public_name
label: Creator name to associate with new datasets (formatted as "Last name, First name"). If left blank "Anonymous Galaxy User" will be used. You can always change this by editing your dataset directly.
type: text
required: False

# Used in file_sources_conf.yml
onedata:
Expand Down
98 changes: 42 additions & 56 deletions lib/galaxy/files/sources/_rdm.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,16 +25,21 @@ class RDMFilesSourceProperties(FilesSourceProperties):
public_name: str


class RecordFilename(NamedTuple):
record_id: str
filename: str
class ContainerAndFileIdentifier(NamedTuple):
"""The file_identifier could be a filename or a file_id."""
container_id: str
file_identifier: str


class RDMRepositoryInteractor:
"""Base class for interacting with an external RDM repository.

This class is not intended to be used directly, but rather to be subclassed
by file sources that interact with RDM repositories.

Different RDM repositories use different terminology. Also they use the same term for different things.
To prevent confusion, we use the term "container" in the base repository.
This is an abstract term for the entity that contains multiple files.
"""

def __init__(self, repository_url: str, plugin: "RDMFilesSource"):
Expand All @@ -54,13 +59,13 @@ def repository_url(self) -> str:
"""
return self._repository_url

def to_plugin_uri(self, record_id: str, filename: Optional[str] = None) -> str:
"""Creates a valid plugin URI to reference the given record_id.
def to_plugin_uri(self, container_id: str, filename: Optional[str] = None) -> str:
"""Creates a valid plugin URI to reference the given container_id.

If a filename is provided, the URI will reference the specific file in the record."""
If a filename is provided, the URI will reference the specific file in the container."""
raise NotImplementedError()

def get_records(
def get_file_containers(
self,
writeable: bool,
user_context: OptionalUserContext = None,
Expand All @@ -69,54 +74,56 @@ def get_records(
query: Optional[str] = None,
sort_by: Optional[str] = None,
) -> Tuple[List[RemoteDirectory], int]:
"""Returns the list of records in the repository and the total count of records.
"""Returns the list of file containers in the repository and the total count containers.

If writeable is True, only records that the user can write to will be returned.
If writeable is True, only containers that the user can write to will be returned.
The user_context might be required to authenticate the user in the repository.
"""
raise NotImplementedError()

def get_files_in_record(
self, record_id: str, writeable: bool, user_context: OptionalUserContext = None
def get_files_in_container(
self, container_id: str, writeable: bool, user_context: OptionalUserContext = None, query: Optional[str] = None,
) -> List[RemoteFile]:
"""Returns the list of files contained in the given record.
"""Returns the list of files of a file container.

If writeable is True, we are signaling that the user intends to write to the record.
If writeable is True, we are signaling that the user intends to write to the container.
"""
raise NotImplementedError()

def create_draft_record(
def create_draft_file_container(

self, title: str, public_name: Optional[str] = None, user_context: OptionalUserContext = None
):
"""Creates a draft record (directory) in the repository with basic metadata.
"""Creates a draft file container in the repository with basic metadata.

The metadata is usually just the title of the record and the user that created it.
The metadata is usually just the title of the container and the user that created it.
Some plugins might also provide additional metadata defaults in the user settings."""
raise NotImplementedError()

def upload_file_to_draft_record(
def upload_file_to_draft_container(
self,
record_id: str,
container_id: str,
filename: str,
file_path: str,
user_context: OptionalUserContext = None,
) -> None:
"""Uploads a file with the provided filename (from file_path) to a draft record with the given record_id.
"""Uploads a file with the provided filename (from file_path) to a draft container with the given container_id.

The draft container must have been created in advance with the `create_draft_file_container` method.

The draft record must have been created in advance with the `create_draft_record` method.
The file must exist in the file system at the given file_path.
The user_context might be required to authenticate the user in the repository.
"""
raise NotImplementedError()

def download_file_from_record(
def download_file_from_container(
self,
record_id: str,
filename: str,
container_id: str,
file_identifier: str,
file_path: str,
user_context: OptionalUserContext = None,
) -> None:
"""Downloads a file with the provided filename from the record with the given record_id.
"""Downloads a file with the provided filename from the container with the given container_id.

The file will be downloaded to the file system at the given file_path.
The user_context might be required to authenticate the user in the repository if the
Expand All @@ -132,13 +139,11 @@ class RDMFilesSource(BaseFilesSource):
by file sources that interact with RDM repositories.

A RDM file source is similar to a regular file source, but instead of tree of
files and directories, it provides a (one level) list of records (representing directories)
files and directories, it provides a (one level) list of containers
that can contain only files (no subdirectories).

In addition, RDM file sources might need to create a new record (directory) in advance in the
repository, and then upload a file to it. This is done by calling the `create_entry`
method.

In addition, RDM file sources might need to create a new container in advance in the
repository, and then upload a file to it. This is done by calling the `_create_entry` method.
"""

plugin_kind = PluginKind.rdm
Expand All @@ -164,35 +169,16 @@ def get_repository_interactor(self, repository_url: str) -> RDMRepositoryInterac
This must be implemented by subclasses."""
raise NotImplementedError()

def parse_path(self, source_path: str, record_id_only: bool = False) -> RecordFilename:
"""Parses the given source path and returns the record_id and filename.
def parse_path(self, source_path: str, container_id_only: bool = False) -> ContainerAndFileIdentifier:
"""Parses the given source path and returns the container_id and filename.

If container_id_only is True, an empty filename will be returned.

The source path must have the format '/<record_id>/<file_name>'.
If record_id_only is True, the source path must have the format '/<record_id>' and an
empty filename will be returned.
"""
This must be implemented by subclasses."""
raise NotImplementedError()

def get_error_msg(details: str) -> str:
return f"Invalid source path: '{source_path}'. Expected format: '{expected_format}'. {details}"

expected_format = "/<record_id>"
if not source_path.startswith("/"):
raise ValueError(get_error_msg("Must start with '/'."))
parts = source_path[1:].split("/", 2)
if record_id_only:
if len(parts) != 1:
raise ValueError(get_error_msg("Please provide the record_id only."))
return RecordFilename(record_id=parts[0], filename="")
expected_format = "/<record_id>/<file_name>"
if len(parts) < 2:
raise ValueError(get_error_msg("Please provide both the record_id and file_name."))
if len(parts) > 2:
raise ValueError(get_error_msg("Too many parts. Please provide the record_id and file_name only."))
record_id, file_name = parts
return RecordFilename(record_id=record_id, filename=file_name)

def get_record_id_from_path(self, source_path: str) -> str:
return self.parse_path(source_path, record_id_only=True).record_id
def get_container_id_from_path(self, source_path: str) -> str:
raise NotImplementedError()

def _serialization_props(self, user_context: OptionalUserContext = None):
effective_props = {}
Expand Down
Loading
Loading