DQMaRC submission #215

ALightNHS · 2024-10-31T11:48:11Z

Submitting Author: Anthony Lighterness (@ALightNHS)
All current maintainers: (@ALightNHS, @Lighterny)
Package Name: DQMaRC
One-Line Description of Package: A Python Tool for Structured Data Quality Profiling
Repository Link: https://github.com/christie-nhs-data-science/DQMaRC
Version submitted: v1.0.4
EiC: TBD
Editor: TBD
Reviewer 1: TBD
Reviewer 2: TBD
Archive: TBD
JOSS DOI: TBD
Version accepted: TBD
Date accepted (month/day/year): TBD

Code of Conduct & Commitment to Maintain Package

I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package after should it be accepted.
I have read and will commit to package maintenance after the review as per the pyOpenSci Policies Guidelines.

Description

Include a brief paragraph describing what your package does:
DQMaRC (Data Quality Markup and Ready-to-Connect) is a python package that allows users to profile the quality of structured tabular datasets across six dimensions of data quality. These dimensions, as defined by the Data Management Association (DAMA) include Completeness, Validity, Consistency, Uniqueness, Timeliness, and Accuracy.

Scope

Please indicate which category or categories.
Check out our package scope page to learn more about our
scope. (If you are unsure of which category you fit, we suggest you make a pre-submission inquiry):
- Data retrieval
- Data extraction
- Data processing/munging
- Data deposition
- Data validation and testing
- Data visualization¹
- Workflow automation
- Citation management and bibliometrics
- Scientific software wrappers
- Database interoperability

Domain Specific

Geospatial
Education

Community Partnerships

If your package is associated with an
existing community please check below:

Astropy:My package adheres to Astropy community standards
Pangeo: My package adheres to the Pangeo standards listed in the pyOpenSci peer review guidebook

For all submissions, explain how and why the package falls under the categories you indicated above. In your explanation, please address the following points (briefly, 1-2 sentences for each):
- Who is the target audience and what are scientific applications of this package?
  The target audience for DQMaRC is a broad range of professionals seeking to do deep-dive analysis of the quality of structured/tabular datasets. This may include a data scientist, analyst, statistician, engineer or data manager, among other. We built this tool primarily for python users so that it can be adapted to a broad range of data infrastructures, but we also built a user friendly front-end graphical user interface (built using shiny for python) so that it is more accessible to a range of both technical and non-technical users.
- Are there other Python packages that accomplish the same thing? If so, how does yours differ?
  There are popular data validation tools such as Pydantic and ydata-profiling, but our tool differs in the way that it handles the data quality test parameters and the product it generates, which is a cell-level binary mark-up of data quality flags joined to the source data. Specifically:
  (1) Test parameter setup
  Our tool lets users setup and maintain the data quality test parameters (i.e. the instructions for DQMaRC as to how and which data quality tests to run) in a table format which can be a csv file or database table. This non-programmatic approach makes it easier to setup and explain how data quality profiling is performed. It also forms part of a data governance artefact otherwise known as "metadata". On first use, our tool lets users initiate a test parameter template tailored to the input dataset, which allows a user to immediately run the tool to profile two of the six possible dimensions - completeness and uniqueness. We encourage users to then take the time to further specify other parameters to make the results more meaningful.
  (2) Data quality mark-up report
  Another key difference is that our tool was designed to generate an output containing a cell-level binary mark-up of the data quality results. This is a cell-level dataset joined with the source data, which contains indicators of data quality errors based on the test parameter setup. This format of this output allows detailed analysis of data quality issues present in source data, which can be ad-hoc or scheduled routinely.
- If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

does not violate the Terms of Service of any service it interacts with.
uses an OSI approved license.
contains a README with instructions for installing the development version.
includes documentation with examples for all functions.
contains a tutorial with examples of its essential functions and uses.
has a test suite.
has continuous integration setup, such as GitHub Actions CircleCI, and/or others.

Publication Options

Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Checks

The package has an obvious research application according to JOSS's definition in their submission requirements. Be aware that completing the pyOpenSci review process does not guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS.
The package is not a "minor utility" as defined by JOSS's submission requirements: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria.
The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
The package is deposited in a long-term repository with the DOI:

Note: JOSS accepts our review as theirs. You will NOT need to go through another full review. JOSS will only review your paper.md file. Be sure to link to this pyOpenSci issue when a JOSS issue is opened for your package. Also be sure to tell the JOSS editor that this is a pyOpenSci reviewed package once you reach this step.

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Confirm each of the following by checking the box.

I have read the author guide.
I expect to maintain this package for at least 2 years and can help find a replacement for the maintainer (team) if needed.

Please fill out our survey

Last but not least please fill out our pre-review survey. This helps us track
submission and improve our peer review process. We will also ask our reviewers
and editors to fill this out.

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

The editor template can be found here.

The review template can be found here.

Please fill out a pre-submission inquiry before submitting a data visualization package. ↩

The text was updated successfully, but these errors were encountered:

SimonMolinsky · 2024-11-05T21:57:55Z

Hi @ALightNHS

Thanks for sending the package for review. We will do a few pre-review checks this week, so stay tuned!

SimonMolinsky · 2024-11-07T08:40:25Z

Hi @ALightNHS

Before we start a review, I need to know why you didn't check this box: I expect to maintain this package for at least 2 years and can help find a replacement for the maintainer (team) if needed.

Is this package for publication only, and then you will move on to other projects?

Lighterny · 2024-11-07T10:26:18Z

Hi Simon, thank you for your messages. The reason for this is because I was the main person responsible for the dev work and release of the package recently. But since I have just left the organisation and team, I will endeavour to volunteer personal time to monitor and maintain the repository (via @Lighterny). However, it was unknown who would maintain it from the team/organisation side. The package is not just for publication - we hope it is useable and useful to others seeking to undertake and automate their data quality profiling processes.

…

On Thu, 7 Nov. 2024, 08:40 Simon, ***@***.***> wrote: Hi @ALightNHS <https://github.com/ALightNHS> Before we start a review, I need to know why you didn't check this box: I expect to maintain this package for at least 2 years and can help find a replacement for the maintainer (team) if needed. Is this package for publication only, and then you will move on to other projects? — Reply to this email directly, view it on GitHub <#215 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AIPYJJFKGVXRZAOUZ2H3ESLZ7MRRBAVCNFSM6AAAAABQ6DU2DOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRRGYZTCOJWGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

SimonMolinsky · 2024-11-07T18:44:03Z

@Lighterny Thank you for your explanation! Regarding your last statement, I agree that the package is extremely useful and I see its potential. What bothers me now is the maintenance status. Active maintenance is an important requirement for pyOpenSci. If I understood correctly, you'd like to maintain the package in the future, but would you have control over the repository?

I advise discussing the future maintenance status with your previous team - we need to know who will have control over the repository, PyPI, or conda two years after submission. It could be you, but you should have admin access to the repository.

On the other hand, JOSS doesn't have this maintenance requirement so that you can send the package there without any delays.

ALightNHS added 0/pre-review-checks New Submission! labels Oct 31, 2024

github-project-automation bot added this to peer-review-status Oct 31, 2024

lwasser moved this to pre-review-checks in peer-review-status Oct 31, 2024

SimonMolinsky self-assigned this Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DQMaRC submission #215

DQMaRC submission #215

ALightNHS commented Oct 31, 2024 •

edited

Loading

SimonMolinsky commented Nov 5, 2024

SimonMolinsky commented Nov 7, 2024

Lighterny commented Nov 7, 2024 via email

SimonMolinsky commented Nov 7, 2024

DQMaRC submission #215

DQMaRC submission #215

Comments

ALightNHS commented Oct 31, 2024 • edited Loading

Code of Conduct & Commitment to Maintain Package

Description

Scope

Domain Specific

Community Partnerships

Technical checks

Publication Options

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

Please fill out our survey

Editor and Review Templates

Footnotes

SimonMolinsky commented Nov 5, 2024

SimonMolinsky commented Nov 7, 2024

Lighterny commented Nov 7, 2024 via email

SimonMolinsky commented Nov 7, 2024

ALightNHS commented Oct 31, 2024 •

edited

Loading