Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Improve the reading of EML metadata #543

Open
kmexter opened this issue Nov 5, 2024 · 11 comments
Open

[Feature]: Improve the reading of EML metadata #543

kmexter opened this issue Nov 5, 2024 · 11 comments
Labels
enhancement New feature or request

Comments

@kmexter
Copy link

kmexter commented Nov 5, 2024

Detailed Description

Following on from #542 (the same meeting with @huberrob and colleagues). I also tested out a few metadata records that use the EML schema (XML format: https://eml.ecoinformatics.org/eml-schema#). I am not sure that FUJI is reading this 100% well, and this is my analysis

https://www.eurobis.org/imis?module=dataset&dasid=848&show=eml which is in EML 2.1.1 returns initial for F, A, I, R, Comments on the failed fields:

  • F3-01M says it cannot find downloadable content but it is in there in ,
  • F4-01M it does not understand that EML is actually metadata that can be retrieved programmatically - it is xml following the eml schema,
  • I1-01M tho I am not sure if eml strictly-spreaking is metadata represented using a formal knowledge representation language so maybe that is correct,
  • I3-01M related resources are mentioned in but maybe it is looking for other types, so look to see if this fails for record 8357 (the next one) also since that does have a related paper and related datasets,
  • R1-01MD and R1.3-02D are right as we do not provide data-file info,
  • A1-01M I really am not sure what it is looking for here with access conditions that is different to the licence, but it is true we just have licence,
  • A1-03D hmm, maybe because it does not understand that is where to look for the data?

https://marineinfo.org/id/dataset/8357-eml-2.2.0.xml for eml 2.2.0. Also returns initial for F, I, R and moderate for A. My additional comments (i.e. not repeating those of above)

  • I3-01M it did not find the related publication, which we added via (note there is a class also)
  • F1-02D it did not find the DOI however I think that is a failing in EML - the URL or DOI to the resource itself is included in but there is no way to identify this as being "the URL of the metadata record itself". So I think EML should improve here rather than FUJI

Both records fare badly on the check for semantic resources. I am not sure if this is because it cannot find them or it does not recognise those vocabularies. So FYI these are vocabs we use often in biodiversity
MarineRegions - https://www.marineregions.org/about.php
MarineSpecies (aka aphia) - https://www.marinespecies.org/
NCBI taxonomy - https://www.ncbi.nlm.nih.gov/taxonomy
BODC vocabularies via NNV - http://vocab.nerc.ac.uk/
Environmental Ontology - see https://www.ebi.ac.uk/ols4/
ASFA - see https://aims.fao.org/network-fisheries-ontologies

FYI EML version 2.1.1 is used by the biodiversity databases OBIS and Gbif (and their respective regional nodes) and GBif have recently updated to EML 2.2.0. This new version has some extra features related to how to "tag" resources with semantics. If you don't do so already, you may want to look at this.

Context

Biodiversity databases use EML so any improvements would be very useful for checking those metadata. We would be interested in any feedback - for example, if EML needs more standardisation so that its fields are better "tagged" as being of a particular type, we can pass on that on to EML via its GH.

Possible Implementation

@kmexter kmexter added the enhancement New feature or request label Nov 5, 2024
@huberrob
Copy link
Contributor

huberrob commented Nov 7, 2024

Dear Katrina,

I think the main problem is that the 848 XML is a EML variant from GBIF and is including some metadata within GBIF specific fields. This part starts with XML element <additionalMetadata><metadata> and this is defined as xs:any which means that at this point you can add GBIF specific XML (https://eml.ecoinformatics.org/schema/)

Further, I think there is an identity problem with this dataset which seems rather to be a catalog entry linking to many representations and several different identifiers at EMODNET, EUROBIS, OBIS etc.
There is:

F-UJI was designed to assess datasets not catalog entries so it may not be the right tool for this record. Anyway, to answer your questions:

  • F3-01M : In EML dataset specific metadata should be provided in the<dataset><distribution> element which is not there. Instead, the <additionalMetadata><metadata><gbif><physical><distribution> which is GBIF specific. Further, all the links lead to catalogs or data repositories which then actually contain the data.

  • F1-02D The DOI should be listed in alternateIdentifier

  • F4-01M This test is checking if major search engines are supported so the dataset is searchable by them. The problem here is that the EML XML does not contain a link back to the dataset (catalog entry) webpage (https://www.eurobis.org/imis?module=dataset&dasid=848) which provides this information.

  • I1-01M F-UJI is expecting a formal knowledge representation language => RDF

  • I3-01M Related resources are not well covered by F-UJI's mapping and EML is not very well suited to include related resources, On F-UJI's TODO list should be mapping of literatureCited, referencePublication, usageCitation and probably otherEntity but I did not have good examples how this is done in real life such as 8357 which is using usageCitation. So thanks for this!

  • A1-01M is looking for information if the dataset is accessible or if it is restricted somehow. But I am not sure if this is possible with EML, maybe using intellectualRights?

  • A1-03D data links are not found

@kmexter
Copy link
Author

kmexter commented Nov 7, 2024

Further, I think there is an identity problem with this dataset which seems rather to be a catalog entry linking to many representations and several different identifiers at EMODNET, EUROBIS, OBIS etc.

Now that is interesting. I always assumed FUJI analysed metadata descriptions of datasets, which to me is what a catalogue entry is. Is there a specification that you use to define "dataset" vs "catalogue entry" -> useful for us to know if so.

For the rest - yes, Gbif annoyingly added their own things to EML, but to be fair they needed to do that at the time eml first came out. I will look at your comments and will make recommendations to EML and GBif and EurOBIS as to improvements via their respective issues (tho I will not hold my breath in getting any rapid action therefrom) - I have been meaning to do that for a while but keep putting it off....
For VLIZ, we have our own eml 2.2 profile that we put together and I will have a look at your comments to see if we can make some improvements there, and I can pass one super-complete catalogue record on to you when we have done that, if that is useful at all? I mean, if EML is not really a good match to FUJI, is it worth it for you?

For licence: so you look for a formal licence AND access conditions separately? Fair enough - it is up to the data provider to decide if they want to provide both or not.

@huberrob
Copy link
Contributor

huberrob commented Nov 7, 2024

Yes difficult to explain, a data catalog lists a catalog entry which refers to a dataset In your example a data catalog entry refers to another data catalog entry etc

F-UJI doesn't know which 'type' an entity it is testing it will test whatever you give. It is up to the user to decide what is useful or not.

For example this page https://datasetsearch.research.google.com/search?docid=L2cvMTFsdjRoa3o1NQ%3D%3D looks as if it is the same as https://obis.org/dataset/066f002f-58d5-4687-bdb8-b39cdaef0c2b but is it?
And on https://obis.org/dataset/066f002f-58d5-4687-bdb8-b39cdaef0c2b the user finds a link to https://ipt.vliz.be/upload/resource?r=arms_coi_2018-20 having the same title than the other two entities. So which one is the main entity, the 'dataset'?

@kmexter
Copy link
Author

kmexter commented Nov 7, 2024

hmm right. In our catalogue entries we list all the places you can find the dataset, and yes, they will probably not all be exactly the same - for example, eurobis, obis, and gbif all have the same data (as they harvest from e/o) but they do not have exactly 100% the same content in their data (the same data format but slight difference in organisation therein) and so all those data links will be in the metadata record.
Useful to know - since it helps analyse the results!

My question about EML above - are you interested in an updated example or is this not in scope?

@huberrob
Copy link
Contributor

huberrob commented Nov 7, 2024

Yes please ;) I am very interested in good EML examples

@mpo-vliz
Copy link

mpo-vliz commented Nov 7, 2024

@huberrob this starts to sound we could be fixing our eml first? (rather then already a clear feature / improvement for FUJI)

I mean, we kind of "know" its about a dataset there, so how should we best make that clear to FUJI?

Also (have not checked in detail, but) the four URL you mentioned there look like they very well could be referring to the same thing (that very dataset) -- again something we could be making clear to FUJI (some set of same-as, about, subjectOf relations to be provided?)

-- to be honest, somewhat streamlining this bulk of historic URL that have been doubling as "(false?) identifiers" for our datasets has become a concern we would like to tackle in a nice way - so any advise, opinion, suggestion from others is highly welcome (but, granted, not your problem)

Anyway, point is: we control what goes into those eml to a large extend, so we can make it work and build some practical testcases for you ;-) (me trying to make this a win-win 😉 )

@huberrob
Copy link
Contributor

huberrob commented Nov 8, 2024

Oh, you already helped to improve F-UJI ;)
As far as I understood Katrina, each of these entities may be slightly different and some how originate from a 'master' dataset. So ideally you could indicate the provenance of these datasets as you proposed but instead of sameAs I would propose to use isBasedOn or something like this?
I am not sure if this can be done in EML, but I assume you have something but if each entity already has something like Dublin Core or schema.org this would be a good place.

But F-UJI would not follow these links to determine a overall FAIRness of 'that very dataset' because it is impossible to verify if e.g. claimed sameAs links really are about the 'same' dataset.

Regarding these historic URLs I would recommend to HTTP redirect them to the 'master dataset' ?

@kmexter
Copy link
Author

kmexter commented Nov 8, 2024

Hmm, it can be hard to identify the master dataset (a-postiori) because everyone harvests from each other in all directions (plus, it is a lot of human resources to track that down), but to do that a-priori is more feasible.
I have my doubts that eml can handle any of this in its schema profile (Laurian, my colleague, and I already exhausted many of its possibilities) -> one would be going the gbif route and adding one's own fields. But Laurian and I will have a look at what one can do more within EML, before end of year at least.

I also know that many of the data portals that use eml also export those metadata records (so again, not dataset-metadata, but metadata records about a singe dataset) in json ld.

@kmexter
Copy link
Author

kmexter commented Dec 9, 2024

FYI Marc, they are not historic endpoints - they are active endpoints. Each one is a way to get the same data, but via a different provider (broker or harvester) in a slightly different format. So they are different versions of the same data.

So yes, in some way one can see this as the metadata record being rather a record of a catalogue, albeit a catalogue containing the result of one data activity, provided by different organisations in their own ways. I am 90% certain one cannot indicate this in the eml schema, which is very limited in what it can do wrt "tagging" elements with information. Having looked at the schema, I cannot see how one can say "this dataset url is a portal and this other one is another portal and this third one is the source data for everything". (one cannot add "annotation" to "distribution"). So other than having each access point repeated as an entire "dataset" (because you can have "annotation" in "dataset"), and then having several "datasets" in each metadata record, there is no way to provide more scope for each download URL within eml.
...But would there be any xml elements one can add, that fuji can understand?

And would that help, in any case? Well yes - the user will still want to know if the record has all the necessary FAIR elements.

FYI, there may not even be a "source" dataset for everything attached to that record, only the versions provided by the other providers.

I am pretty sure we don't want to "mess around" with EML too much - GBIf and OBIS are big users of eml and we have to conform to some good degree with what they want. But if we can add some XML elements that makes the eml record more obviously FAIR without "ruining" them for GBIF/OBIS, that we can do.

OK, so my colleague and I will be looking at EML 2.2 (which is the latest version) with fresh eyes from this issue (and any other comments that are added here), but probably not until March/April next year (we are really overloaded with work). But we did a lot of work some time ago and that is the version of 2.2 that you see in the example. If I can find an "empty" version with all the classes in it, I will pass that on.

@mpo-vliz
Copy link

Ok, so I find these multiple URL or their reason of existence confusing, even as a human.

Whatever there is going on with these multiple URL, it should be made clear:

  • are they factually the same, about the exact same data content, or derived in some way?
  • what is the relation between them?
  • who was first? or which one should be the canonical one --> also in terms of calculating impact it feels weird for us to have one dataset be referenced in 5 different ways, would that not split the referral counts, and thus hide the actual impact in the noise?

Then. Coding that clarity for the machine is a matter of agreeing on some semantic predicates that are likely to exist: eml itself might not have anything, but at least eml2 has now support for semantic metadata so basically a backdoor to slide in triples as we see fit. I would look into that first.

@kmexter
Copy link
Author

kmexter commented Dec 11, 2024

It is perhaps not common but also not infrequent that there are multiple URLs, but the level of support for annotating them in EML is extremely limited - we would have to take that up with EML and their GH issues. This part of this issue, Marc, will be for us to solve and we should move it out of here.

Regarding the points raised in the second comments box

  • F3-01M : the distribution issue is solved - for us at least - in eml 2.2.0 but I think it is worth tryin to communicate again with gbif to see if they are going to do this properly when they finally move to 2.2 as well
  • F1-02D : we will check that we are doing that properly in all our emls
  • F4-01M : the link to the metadata record itself is included in a <distribution>, e.g. <distribution><online><url function="download">https://www.vliz.be//en/imis?dasid=8357\</url></online> I can see that this is really insufficient, but again, even in eml 2.2, I cannot see how else to do it. So also something to raise with EML itself (but don't hold your breath!)
  • I1-01M: so XML (schema EML) will always fail this test ? If so, nice to know.
  • A1-01M: this puzzled me a lot when analysing all the tests - so this is not the same as the licence, right? the test is specifically looking for information additional to the licence? then that could go in intellectual rights, yes, and where there is no information other than the licence provided by the data creator, then one can copy the CC statement into there - but that will be a string then. Something we could do here, and should probably also take up with gbif (tho don't hold your breath here either)
  • A1-03D: also puzzled me a lot when analysing all our tests: if the is not what this is looking for, what is it looking for?
  • I3-01M: yes, in total we use literatureCited and usageCited, I will have to check that we use referencePublications

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

No branches or pull requests

3 participants