Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manifest file for all graphs and subgraphs #1

Closed
caufieldjh opened this issue Jan 21, 2022 · 7 comments · Fixed by #2
Closed

Manifest file for all graphs and subgraphs #1

caufieldjh opened this issue Jan 21, 2022 · 7 comments · Fixed by #2

Comments

@caufieldjh
Copy link
Contributor

caufieldjh commented Jan 21, 2022

In its role as an index of KGs, it would be useful for KG-HUB to provide a list (e.g., a manifest file) of all graphs and their component subgraphs (in most cases besides KG-OBO, the source transforms) and then have this list be publicly viewable. It should include metadata such as graph descriptions. These could be pulled from the download.yaml for each project in theory.

See also the draft linkml dataset distribution schema:
https://github.com/linkml/linkml-model/blob/main/linkml_model/model/schema/datasets.yaml

@caufieldjh
Copy link
Contributor Author

As per KG-OBO would also like to keep track of broken links - this will mean the manifest should not be written anew with each update, at least not without reading it first to store all obsolete IDs.

@caufieldjh
Copy link
Contributor Author

caufieldjh commented Jan 26, 2022

As per Slack discussion: also keep track of KGX sources, CURIE namespaces, and Biolink types.
This would support user knowledge of how KGs overlap.
It would also serve as error-checking to see if Biolink types match expectations - types are listed in KGX stats output so we can read those here

(Split into issue #8)

@caufieldjh
Copy link
Contributor Author

caufieldjh commented Jan 26, 2022

Other misc TODOs:

  • assign description and was_derived_from to objects - This may need to be extracted on a per-project basis
  • get version for projects other than KG-OBO
  • consider using other LinkML class (DataResource) for uncompressed files

@caufieldjh caufieldjh linked a pull request Jan 27, 2022 that will close this issue
@caufieldjh
Copy link
Contributor Author

caufieldjh commented Jan 28, 2022

Will also need to set up Jenkinsfile to run this weekly or so (I'll make that its own PR so it can have its own Jenkins test branch)

(And its own issue - see #7)

@caufieldjh
Copy link
Contributor Author

caufieldjh commented Jan 28, 2022

There is a degree of structural validation the manifest step can accomplish, with items like the following:

  • Verify that projects follow the expected file structure (dated builds, raw and transforms in their own dirs, stats in their own dir)
  • Graph tar.gz files contain only node and edge list
  • Files are, in fact, tsvs in KGX format

@caufieldjh
Copy link
Contributor Author

caufieldjh commented Feb 3, 2022

Need to set click arg to write to bucket or not, in case we just want a local version
As this will be called on a Jenkinsfile anyway, we can handle the writing there.

@caufieldjh
Copy link
Contributor Author

Each DataPackage object (here, compressed files) should include a list of its components (here, node and edge lists).
The validation step should collect the names of these files to ensure they are merged-kg_nodes.tsv and merged-kg_edges.tsv, then pass that to the object creator to include their names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant