-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Improve the reading of EML metadata #543
Comments
Dear Katrina, I think the main problem is that the 848 XML is a EML variant from GBIF and is including some metadata within GBIF specific fields. This part starts with XML element Further, I think there is an identity problem with this dataset which seems rather to be a catalog entry linking to many representations and several different identifiers at EMODNET, EUROBIS, OBIS etc.
F-UJI was designed to assess datasets not catalog entries so it may not be the right tool for this record. Anyway, to answer your questions:
|
Now that is interesting. I always assumed FUJI analysed metadata descriptions of datasets, which to me is what a catalogue entry is. Is there a specification that you use to define "dataset" vs "catalogue entry" -> useful for us to know if so. For the rest - yes, Gbif annoyingly added their own things to EML, but to be fair they needed to do that at the time eml first came out. I will look at your comments and will make recommendations to EML and GBif and EurOBIS as to improvements via their respective issues (tho I will not hold my breath in getting any rapid action therefrom) - I have been meaning to do that for a while but keep putting it off.... For licence: so you look for a formal licence AND access conditions separately? Fair enough - it is up to the data provider to decide if they want to provide both or not. |
Yes difficult to explain, a data catalog lists a catalog entry which refers to a dataset In your example a data catalog entry refers to another data catalog entry etc F-UJI doesn't know which 'type' an entity it is testing it will test whatever you give. It is up to the user to decide what is useful or not. For example this page https://datasetsearch.research.google.com/search?docid=L2cvMTFsdjRoa3o1NQ%3D%3D looks as if it is the same as https://obis.org/dataset/066f002f-58d5-4687-bdb8-b39cdaef0c2b but is it? |
hmm right. In our catalogue entries we list all the places you can find the dataset, and yes, they will probably not all be exactly the same - for example, eurobis, obis, and gbif all have the same data (as they harvest from e/o) but they do not have exactly 100% the same content in their data (the same data format but slight difference in organisation therein) and so all those data links will be in the metadata record. My question about EML above - are you interested in an updated example or is this not in scope? |
Yes please ;) I am very interested in good EML examples |
@huberrob this starts to sound we could be fixing our eml first? (rather then already a clear feature / improvement for FUJI) I mean, we kind of "know" its about a dataset there, so how should we best make that clear to FUJI? Also (have not checked in detail, but) the four URL you mentioned there look like they very well could be referring to the same thing (that very dataset) -- again something we could be making clear to FUJI (some set of same-as, about, subjectOf relations to be provided?) -- to be honest, somewhat streamlining this bulk of historic URL that have been doubling as "(false?) identifiers" for our datasets has become a concern we would like to tackle in a nice way - so any advise, opinion, suggestion from others is highly welcome (but, granted, not your problem) Anyway, point is: we control what goes into those eml to a large extend, so we can make it work and build some practical testcases for you ;-) (me trying to make this a win-win 😉 ) |
Oh, you already helped to improve F-UJI ;) But F-UJI would not follow these links to determine a overall FAIRness of 'that very dataset' because it is impossible to verify if e.g. claimed sameAs links really are about the 'same' dataset. Regarding these historic URLs I would recommend to HTTP redirect them to the 'master dataset' ? |
Hmm, it can be hard to identify the master dataset (a-postiori) because everyone harvests from each other in all directions (plus, it is a lot of human resources to track that down), but to do that a-priori is more feasible. I also know that many of the data portals that use eml also export those metadata records (so again, not dataset-metadata, but metadata records about a singe dataset) in json ld. |
FYI Marc, they are not historic endpoints - they are active endpoints. Each one is a way to get the same data, but via a different provider (broker or harvester) in a slightly different format. So they are different versions of the same data. So yes, in some way one can see this as the metadata record being rather a record of a catalogue, albeit a catalogue containing the result of one data activity, provided by different organisations in their own ways. I am 90% certain one cannot indicate this in the eml schema, which is very limited in what it can do wrt "tagging" elements with information. Having looked at the schema, I cannot see how one can say "this dataset url is a portal and this other one is another portal and this third one is the source data for everything". (one cannot add "annotation" to "distribution"). So other than having each access point repeated as an entire "dataset" (because you can have "annotation" in "dataset"), and then having several "datasets" in each metadata record, there is no way to provide more scope for each download URL within eml. And would that help, in any case? Well yes - the user will still want to know if the record has all the necessary FAIR elements. FYI, there may not even be a "source" dataset for everything attached to that record, only the versions provided by the other providers. I am pretty sure we don't want to "mess around" with EML too much - GBIf and OBIS are big users of eml and we have to conform to some good degree with what they want. But if we can add some XML elements that makes the eml record more obviously FAIR without "ruining" them for GBIF/OBIS, that we can do. OK, so my colleague and I will be looking at EML 2.2 (which is the latest version) with fresh eyes from this issue (and any other comments that are added here), but probably not until March/April next year (we are really overloaded with work). But we did a lot of work some time ago and that is the version of 2.2 that you see in the example. If I can find an "empty" version with all the classes in it, I will pass that on. |
Ok, so I find these multiple URL or their reason of existence confusing, even as a human. Whatever there is going on with these multiple URL, it should be made clear:
Then. Coding that clarity for the machine is a matter of agreeing on some semantic predicates that are likely to exist: eml itself might not have anything, but at least eml2 has now support for semantic metadata so basically a backdoor to slide in triples as we see fit. I would look into that first. |
It is perhaps not common but also not infrequent that there are multiple URLs, but the level of support for annotating them in EML is extremely limited - we would have to take that up with EML and their GH issues. This part of this issue, Marc, will be for us to solve and we should move it out of here. Regarding the points raised in the second comments box
|
Detailed Description
Following on from #542 (the same meeting with @huberrob and colleagues). I also tested out a few metadata records that use the EML schema (XML format: https://eml.ecoinformatics.org/eml-schema#). I am not sure that FUJI is reading this 100% well, and this is my analysis
https://www.eurobis.org/imis?module=dataset&dasid=848&show=eml which is in EML 2.1.1 returns initial for F, A, I, R, Comments on the failed fields:
https://marineinfo.org/id/dataset/8357-eml-2.2.0.xml for eml 2.2.0. Also returns initial for F, I, R and moderate for A. My additional comments (i.e. not repeating those of above)
Both records fare badly on the check for semantic resources. I am not sure if this is because it cannot find them or it does not recognise those vocabularies. So FYI these are vocabs we use often in biodiversity
MarineRegions - https://www.marineregions.org/about.php
MarineSpecies (aka aphia) - https://www.marinespecies.org/
NCBI taxonomy - https://www.ncbi.nlm.nih.gov/taxonomy
BODC vocabularies via NNV - http://vocab.nerc.ac.uk/
Environmental Ontology - see https://www.ebi.ac.uk/ols4/
ASFA - see https://aims.fao.org/network-fisheries-ontologies
FYI EML version 2.1.1 is used by the biodiversity databases OBIS and Gbif (and their respective regional nodes) and GBif have recently updated to EML 2.2.0. This new version has some extra features related to how to "tag" resources with semantics. If you don't do so already, you may want to look at this.
Context
Biodiversity databases use EML so any improvements would be very useful for checking those metadata. We would be interested in any feedback - for example, if EML needs more standardisation so that its fields are better "tagged" as being of a particular type, we can pass on that on to EML via its GH.
Possible Implementation
The text was updated successfully, but these errors were encountered: