Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data cleaning for license identifiers #31

Open
stain opened this issue Jun 26, 2024 · 0 comments
Open

Data cleaning for license identifiers #31

stain opened this issue Jun 26, 2024 · 0 comments

Comments

@stain
Copy link
Member

stain commented Jun 26, 2024

Many Workflow Hub RO-Crates have used license as a literal string rather than as a @id reference, which mean we get many variants for the same license:

    schema1:license "MIT" ;
# ...
    schema1:license "https://spdx.org/licenses/MIT" ;

https://www.researchobject.org/ro-crate/specification/1.1/root-data-entity.html#direct-properties-of-the-root-data-entity says it should "SHOULD link to a Contextual Entity in the RO-Crate Metadata File with a name and description" meaning it should look like this:

"license": {
  "@id": "http://spdx.org/licenses/MIT"
}

and after parsing to RDF graph:

schema1:license <http://spdx.org/licenses/MIT> ;

https://about.workflowhub.eu/Workflow-RO-Crate/ro-crate-metadata.json also defines the SPDX identifiers with http URIs not https -- this is to be compatible with identifiers used in https://github.com/spdx/license-list-data/blob/main/rdfturtle/MIT.ttl etc.

From this I'm getting the feeling we need two outputs, one "raw" RDF which may be exactly as in the RO-Crate, and one data cleaned. This may just be the named graphs #29 are "as is" and the separate output without named graphs has done such data cleaning -- or added this as a secondary named graph.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant