BlockingPy submission #232

T-Strojny · 2025-01-09T18:13:33Z

Submitting Author: (@T-Strojny)
All current maintainers: @T-Strojny
Package Name: BlockingPy
One-Line Description of Package: Blocking records for record linkage and deduplication with Approximate Nearest Neighbor algorithms.;
Repository Link: https://github.com/ncn-foreigners/BlockingPy
Version submitted: v0.1.7
EiC: @coatless
Editor: TBD
Reviewer 1: TBD
Reviewer 2: TBD
Archive: TBD
JOSS DOI: TBD
Version accepted: TBD
Date accepted (month/day/year): TBD

Code of Conduct & Commitment to Maintain Package

I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package after should it be accepted.
I have read and will commit to package maintenance after the review as per the pyOpenSci Policies Guidelines.

Description

Include a brief paragraph describing what your package does: BlockingPy is a package that speeds up record linkage and deduplication tasks by using Approximate Nearest Neighbor (ANN) algorithms to create blocks with candidate record pairs. When linking or deduplicating large datasets, comparing all possible record pairs becomes computationally infeasible. BlockingPy solves this by using ANN algorithms to quickly identify similar records while significantly reducing the number of required comparisons.

Scope

Please indicate which category or categories.
Check out our package scope page to learn more about our
scope. (If you are unsure of which category you fit, we suggest you make a pre-submission inquiry):
- Data retrieval
- Data extraction
- Data processing/munging
- Data deposition
- Data validation and testing
- Data visualization¹
- Workflow automation
- Citation management and bibliometrics
- Scientific software wrappers
- Database interoperability

Domain Specific

Geospatial
Education

Community Partnerships

If your package is associated with an
existing community please check below:

Astropy:My package adheres to Astropy community standards
Pangeo: My package adheres to the Pangeo standards listed in the pyOpenSci peer review guidebook

For all submissions, explain how and why the package falls under the categories you indicated above. In your explanation, please address the following points (briefly, 1-2 sentences for each):
- Data processing/munging : BlockingPy transforms raw data into feature vectors and applies ANN algorithms and graphs to reduce the comparison space which enables scalable record linkage and deduplication.
- Who is the target audience and what are scientific applications of this package?
  BlockingPy is targeted for data scientists, researchers, and analysts working with large datasets that require record matching or deduplication and need a scalable approach.
- Are there other Python packages that accomplish the same thing? If so, how does yours differ?
  There are many packages around Record Linkage, however ours specializes in the blocking task and uses novel approach which is the use of ANN algorithms.
- If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:
  No inquiry was made

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

does not violate the Terms of Service of any service it interacts with.
uses an OSI approved license.
contains a README with instructions for installing the development version.
includes documentation with examples for all functions.
contains a tutorial with examples of its essential functions and uses.
has a test suite.
has continuous integration setup, such as GitHub Actions CircleCI, and/or others.

Publication Options

Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Checks

The package has an obvious research application according to JOSS's definition in their submission requirements. Be aware that completing the pyOpenSci review process does not guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS.
The package is not a "minor utility" as defined by JOSS's submission requirements: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria.
The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
The package is deposited in a long-term repository with the DOI:

Note: JOSS accepts our review as theirs. You will NOT need to go through another full review. JOSS will only review your paper.md file. Be sure to link to this pyOpenSci issue when a JOSS issue is opened for your package. Also be sure to tell the JOSS editor that this is a pyOpenSci reviewed package once you reach this step.

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Confirm each of the following by checking the box.

I have read the author guide.
I expect to maintain this package for at least 2 years and can help find a replacement for the maintainer (team) if needed.

Please fill out our survey

Last but not least please fill out our pre-review survey. This helps us track
submission and improve our peer review process. We will also ask our reviewers
and editors to fill this out.

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

The editor template can be found here.

The review template can be found here.

Please fill out a pre-submission inquiry before submitting a data visualization package. ↩

The text was updated successfully, but these errors were encountered:

coatless · 2025-01-21T06:53:42Z

Editor in Chief checks

Hi there! Thank you for submitting your package for pyOpenSci
review. Below are the basic checks that your package needs to pass
to begin our review. If some of these are missing, we will ask you
to work on them before the review process begins.

Please check our Python packaging guide for more information on the elements
below.

Initial onboarding survey was filled out
We appreciate each maintainer of the package filling out this survey individually. 🙌
Thank you authors in advance for setting aside five to ten minutes to do this. It truly helps our organization. 🙌

Editor comments

BlockingPy is in pristine shape for moving forward with a review! Nice work on getting it packaged for Python and implemented. Happy to see both mlpack and the original note on the R blocking package being emphasized.

T-Strojny · 2025-01-21T08:56:28Z

That's great to hear! Thank you for the feedback.

T-Strojny added 0/pre-review-checks New Submission! labels Jan 9, 2025

github-project-automation bot added this to peer-review-status Jan 9, 2025

lwasser moved this to pre-review-checks in peer-review-status Jan 9, 2025

coatless added 0/seeking-editor and removed 0/pre-review-checks labels Jan 21, 2025

lwasser moved this from pre-review-checks to seeking-editor in peer-review-status Jan 21, 2025

coatless removed the New Submission! label Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BlockingPy submission #232

BlockingPy submission #232

T-Strojny commented Jan 9, 2025 •

edited by coatless

Loading

coatless commented Jan 21, 2025 •

edited

Loading

T-Strojny commented Jan 21, 2025

BlockingPy submission #232

BlockingPy submission #232

Comments

T-Strojny commented Jan 9, 2025 • edited by coatless Loading

Code of Conduct & Commitment to Maintain Package

Description

Scope

Domain Specific

Community Partnerships

Technical checks

Publication Options

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

Please fill out our survey

Editor and Review Templates

Footnotes

coatless commented Jan 21, 2025 • edited Loading

Editor in Chief checks

Editor comments

T-Strojny commented Jan 21, 2025

T-Strojny commented Jan 9, 2025 •

edited by coatless

Loading

coatless commented Jan 21, 2025 •

edited

Loading