Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helmholtz Quality Indicator for Research Software Publications #5

Open
juckel opened this issue Oct 25, 2024 · 0 comments
Open

Helmholtz Quality Indicator for Research Software Publications #5

juckel opened this issue Oct 25, 2024 · 0 comments

Comments

@juckel
Copy link

juckel commented Oct 25, 2024

As discussed in the meeting earlier this week, I wanted to provide the current status of what we are working on within the Helmholtz Association. There we have a working group that defines a new indicator to be used in the reporting of the research (and provides the tools to collect it).

Disclaimer

Before I provide the current list of if attributes, please allow me to mention a couple of things about this work up front:

  • we decided to try and capture all aspects of research software quality in the list of attributes we poll, regardless whether they can be measured yet or are even fully defined by the research communities yet
  • due to all other research indicators of Helmholtz being centered around publications, we are also only counting research software publications (so not just look at the GitHub/GitLab/... repos)
  • we are using dimensions and attributes and for each dimension you generate a weighted average (depending on the not yet defined priorities of the attributes of the dimension). These averages can be nicely used to create a radar plot that shows the value for each dimension
  • for each attribute there are cumulative levels of maturity, i.e. the criteria of a lower level need to be fulfilled to qualify for a higher level (which gives us the chance to introduce different aspects with one attribute)
  • at the moment we are shrinking this "wish" list to a "what's possible" list, so not all attributes or levels of maturity will be included in the first version of the indicator that Helmholtz will start using in its reports for 2025 (i.e. the reports that are generated in 2026)
  • due to the large number of software publications, the determination of the indicator must be done via automated tools (albeit allowing some creativity, e.g. using the Helmholtz RSD to collect additional (meta)data)

Here is the current list of dimensions, attributes and levels of maturity:

Dimension Findable

The following statements address the aspect of being able to find and uniquely identify the
software.

Open Publication Repository
(0) There is no information available on where to find the software.
(1) The software is contained in an online repository.
(2) Some kind of description is available giving further information on the software in this
repository (e.g. readme file).
(3) A structured meta data description (e.g. following DataCite) given for software is in this
repository.
(4) The repository is listed in some overarching meta-repository (e.g. Helmholtz Research
Software Directory (RSD), re3data).

Versioning
(0) No software versioning applied.
(1) There is some kind of versioning for the software.
(2) The software uses structured (e.g. semantic) versioning.
(3) A description of the versioning scheme is available.
(4) There is a documentation on release cycles for the software.

Identifier
(0) No PIDs given.
(1) A handle/URL is provided to identify the software release.
(2) The handle/URL is provided with a defined metadata scheme.
(3) A persistent identifier is provided.
(4) A PID allowing for automated harvesting of metadata information is provided.

Rich Metadata
(0) No metadata given.
(1) Some metadata information is provided with the software.
(2) The metadata information is following a given metadata scheme complete (e.g. DataCite, CRAN → in FAQ).
(3) A metadata curation process reflects changes/updates.
(4) All metadata information following the given metadata scheme can be automatically harvested.

Dimension Accessible

The following questions address the aspect of being able to access research software. Accessing included the possibility to run the software, which might also be in terms of a web service. However, accessibility does not include the possibility to adjust the code which is rather being captured under the aspect of reusability.

Access Conditions (organizational)
(0) Not specified.
(1) A contact is given which to inquire about the right to use the software.
(2) The software has a license describing rights of use.
(3) The license allows for free/open use of the software (e.g. OSI licenses).
(4) Software license comes from the FLOSS list.

Access Options (process)
(0) There is only one specific form of accessing the software or no option at all.
(1) The software (source code or executable) is provided.
(2) The sources or executables being provided include some documentation on how to install/use the software.
(3) Provided checks allow to determine whether installation worked as being expected. (e.g. make install with return code 0)
(4) Provided test cases make sure the software works correctly (provides correct results). (e.g. make check with return code 0)

Technical Accessibility (run/start)
(0) No information given.
(1) “How to install” information is provided.
(2) Installation scripts are provided.
(3) The software allows for (semi-)automated installation, e.g. a Makefile or manual package
(like Python modules).
(4) A complete package that enables execution (e.g. container, app package) is available.

Dimension Interoperable

The following questions address the aspect of being interoperable, i.e., the possibilities of being able to integrate the software into one’s own software framework or execution pipelines.

Input/Output Formats
(0) Not specified.
(1) Some description of input and output formats is provided (potentially external).
(2) The software builds on standard formats for input and output.
(3) Additional options for varying input/output formats are provided.
(4) The software builds on accepted community standards for input/output data.

Adaptability/Flexibility of Use
(0) No information given.
(1) There is a way to use the software with one defined set of input data.
(2) There are parameters to adjust the way the software is working.
(3) There is some way of logging what is done during execution.
(4) There is documented way to integrate the software into one’s own software framework or execution pipelines, e.g. via APIs, containers, web-services etc.

Dimension Reusable

The following questions address the aspect of being reusable. In addition of being accessible, i.e. executable, reusability includes the possibility to actually change/adapt the code.

new: Support
Idea: add levels of support through SW developers -> still to be defined

Reusability Conditions
(0) Not clear.
(1) The software uses a custom license allowing reuse. (i.e. ask your lawyer before you use it)
(2) The software uses a FOSS/OSI approved license including that license dependencies are at least being checked manually.
(3) The software uses an appropriate license for different file types (code, text, images etc.)
following e.g. the REUSE specification.
(4) There is a process available for automatically checking e.g. the REUSE specification.

Dimension Scientific basis

The following questions address the aspect of the software being scientifically well grounded.
While domain specific scientific requirements have to be assessed as part of a scientific peer review process, certain generic aspects of good scientific practice can be assessed for all research software.

Community Standards
(0) No information given.
(1) The connection to known (scientific) standards is drawn.
(2) The software follows standards of the relevant scientific community.
(3) The software complies with relevant scientific standards of the field.
(4) There is an indication on how further evolution of community standards will be addressed.

Team Expertise
(0) No information given.
(1) Clear expertise from a single, relevant domain is part of the software development team.
(2) The software development team has access to expertise in several relevant domains.
(3) The software development team has access to expertise in all relevant domains.
(4) A fixed, established, interdisciplinary team works on the software.

Scientific Embedding
(0) No information given.
(1) At least one scientific use case is documented.
(2) A broader scientific context is documented including several examples.
(3) The software development is at least loosely connected to some scientific initiative.
(4) The software development is part of a larger scientific initiative with dedicated processes for software development.

Dimension Technical basis

The following questions address aspects of professional software development leading to sustainable, high quality research software.

Project Management
(0) No information on project management and code history being provided.
(1) A version control system is used.
(2) A version control system being part of a code project management platform (e.g. GitHub,
GitLab) and an associated ticket system is in place.
(3) A transparent process for ticket resolving, code review by other developer, and merge
requests is established.
(4) A release process with guaranteed changelog generation, testing, and product provisioning is established.

Repository Structure
(0) No information given.
(1) All files are provided in some structured/unstructured way inside the repository.
(2) The repository is structured albeit maybe in a manner such that every contributor is free to follow own way of organizing files.
(3) A contribution mechanism is documented, e.g. CONTRIBUTORS.md file, as well as a defined structure for the repository and a documented onboarding process.
(4) A common template for the repository structure is available, as well as some kind of
identification of deviation.

Code Structure
(0) No information given.
(1) Every developer is free to use his/her own style of coding.
(2) There are general recommendations for coding, albeit every developer being able to follow his/her own style.
(3) There is some harmonization of code style being enforced following common standards
including meaningful naming of functions/variables etc.
(4) The code style is enforced via a review process (e.g. manual review, failed pipelines or auto-formatting).

Reproducibility (Code)
(0) No tests, or duplicated code.
(1) The code follows a modular structure allowing for component reusability.
(2) Clear system requirements are documented with min/max versions, albeit version pinning, modularity etc. being enforced manually.
(3) Test coverage is measured, albeit tests may be written on a voluntary basis.
(4) Automated testing for different system environments, requirements for minimal test coverage, and provisioning of containerized packages is done.

Code change process
(0) No information
(1) Internal 4-eye principle for accepting changes
(2) Code changes via transparent processes, e.g. merge/pull request
(3) Approval of code changes via transparent processes and with a 4-eye principle
(4) Integration of code changes into main development branch/releases only allowed for specifically named/trained persons.

Security
(0) No security concepts given.
(1) There are at least sporadic updates and dependency checks.
(2) There is a systematic assessment of dependencies and documentation of the software stack. (this can include a documentation that dependencies have no security concept)
(3) Deployment is provided within a CI/CD framework for different environments including
tools for check for security leaks.
(4) There are regular and automated security monitoring and an automated update process in place allowing merges only if security checks have been passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant