Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce pipeline processes #298

Open
wants to merge 68 commits into
base: maint_0.4
Choose a base branch
from

Conversation

christian-monch
Copy link
Collaborator

This PR fixes #261 partially and fixes #268

This PR modifies the dataset traverser component to send almost all information it has about the traversed dataset elements to the processors. That reduces the number of processes that the processors have to execute.

@christian-monch christian-monch changed the base branch from master to maint_0.4 January 16, 2023 12:57
@christian-monch christian-monch force-pushed the enh-reduce-pipeline-processes branch from e18f550 to 113321d Compare January 26, 2023 13:30
@christian-monch christian-monch force-pushed the enh-reduce-pipeline-processes branch from 113321d to 6183a2c Compare February 27, 2023 08:02
@codecov-commenter
Copy link

codecov-commenter commented Feb 27, 2023

Codecov Report

Patch and project coverage have no change.

Comparison is base (5c2181f) 86.27% compared to head (5c2181f) 86.27%.

❗ Current head 5c2181f differs from pull request most recent head 7e2b5ab. Consider uploading reports for the commit 7e2b5ab to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           maint_0.4     #298   +/-   ##
==========================================
  Coverage      86.27%   86.27%           
==========================================
  Files             88       88           
  Lines           4830     4830           
==========================================
  Hits            4167     4167           
  Misses           663      663           

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@yarikoptic
Copy link
Member

that's a lot of commits/work -- is it going to be merged?

This commit adds `build` and `twine` to
`requirements-devel.txt`. It also moves
sphinx-dependencies into development
requirements.

The datalad version is updated to >=0.17

In addition it sort the entries in
`requirements-devel.txt` and
`requirements.txt`.
This commit introduces AnnexedFileInfo,
to hold annex-status information for a
single file. To simplify handling,
the dataclasses_json package is used
and added to requirements

Python version requirement has been
set to >=3.7
This commit extends the FileInfo dataclass
and derives the AnnexedFileInfo class from it.
The classes hold file-information that is
returned by AnnexRepo.get_content_annexinfo(),
or by GitRepo.status().

It adds a parameter to pass JSON-serialized
FileInfo or AnnexedFileInfo objects to the
extract process via arguments, thus releaving
the necessity to invoce git-annex to
determine file-status.
This commit uses the --file-info parameter
to provide extractors with status information
about the element from which metadata should
be extracted.
This commit adds code to handle repositories
that do not posses an ID, usually these are
plain git repositories.
This commit adds a pipeline provider and a pipeline
processor with definable input output behavior. The
content that should be yielded can be defined externally,
the rate in which content is yielded can also be
defined externally. This allows to perform repeatable
performance measurements.
This commit adds information about the object-id
and the processor pid of the processor and provider
probes that are executed by meta-conduct
This commit adds an invocation count to the processor
probe that counts the invocations on this instance of
the probe.
This commit rebases the branch on maint_0.4
and adds code to check for the existence
of datset IDs.
This commit fixes the reporting of datasets in
traversal.

There is still something to due in datasets, i.e.
report "state", "gitshasum", and "prev_gitshasum"
This commit adds size information to the
traverser output for non-annexed files.
Because git-ls-files does not provide the
size information, an additional git-ls-tree
call is used to determine sizes of non-annexed
files
@mslw
Copy link
Contributor

mslw commented Apr 17, 2023

Git version 2.30.2, which is currently available in debian-stable, does not support git ls-tree --format. If I'm reading the docs correctly, --format was added in Git v 2.36.0.

I very much like the spirit of this PR and I'm not sure whether debian-stable compatibility is a goal, so I'm just leaving this as an observation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants