-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternative paradigm for concurrent batch processes #537
Comments
#538 has a port to Windows, and is ready to merge. Next steps:
|
Very interesting, nice short implementation, nice interface. A very good idea to use an additional thread to fill stdin. That makes it quite fast. ParadigmThe pipeline approach is nice, but AFAICS there are two open questions:
I think it would be interesting to see how, for example, ImplementationIn order to evaluate how much code duplication, code change, and code reuse we would suffer or enjoy, I added the thread-based approach for stdin-writing to the
The equivalent execution with
The Other criteria Pros of
|
I am curious on how it would compare to async implementation which would do the same. Not that we would be able to use it within datalad but as a data point to beware of. @jwodder would it be hard to come up with a "native" async version for such batching? I guess we might have it already to some degree within |
@yarikoptic I'm not entirely clear what you're asking. |
Here is a complete implementation that sits on #538 It goes all-in with the itertools!! Bit slower than the sketch above, but still faster than anything else.
Note that this is using from datalad_next.iterable_subprocess.iterable_subprocess import iterable_subprocess
from re import finditer
from more_itertools import (
intersperse,
side_effect,
)
from queue import Queue
# takes an iterable of bytes, yield items defined by delimiter
# tries to make as few copies as possible
def proc_byte_chunks(iter, delim):
remainder = b''
for chunk in iter:
for item, incomplete in proc_byte_chunk(chunk, delim):
if incomplete:
remainder += item
continue
elif remainder:
item = remainder + item
remainder = b''
yield item
if remainder:
yield remainder
# takes bytes, yields items defined by delimiter,
# plus bool to indicate incomplete items (did not end with delimiter)
def proc_byte_chunk(chunk, delim):
pos = 0
for delim in finditer(delim, chunk):
yield (chunk[pos:delim.start()], False)
pos = delim.end()
remainder = chunk[pos:]
if remainder:
yield (remainder, True)
# replaces empty items in an iterable with a marker
def none2marker(iter, marker):
for i in iter:
yield i if i else marker
# replaced items that end with a marker with `None`
def marker2none(iter, marker):
for i in iter:
yield None if i.endswith(marker) else i
# FIFO to buffer filenames
# we buffer them and not the key examination outcomes to minimie memory demands
filenames = Queue()
# helper to put an end marker into the filename queue
# this is used to signal stopiteration to zip() at the very end
def mark_filename_queue_end():
filenames.put(None)
# fake URL annex key we feed through examinekeys instead of not calling
# it at all -- lazy but fast
magic_marker = b'URL--MAGIC'
with \
iterable_subprocess(
# feeder of the pipeline, zero-byte delimited items
['git', 'ls-files', '-z'], tuple(),
) as glsf, \
iterable_subprocess(
# we get the annex key for any filename (or empty if not annexed)
['git', 'annex', 'find', '--anything', '--format=\${key}\n', '--batch'],
# intersperse items with newlines to trigger a batch run
# this avoids string operations to append newlines to items
intersperse(
b'\n',
# the sideeffect is that any filename items coming from ls-files
# is put into a FIFO, to be later merged with the key properties
side_effect(
filenames.put,
# we parse the raw byte chunks from the subprocess into
# zero-byte delimited items
proc_byte_chunks(glsf, rb'\0'),
# after the last filename, we put an end marker into the queue
after=mark_filename_queue_end,
)
),
) as gaf, \
iterable_subprocess(
# get the key object path (example property, could also be JSON record)
['git', 'annex', 'examinekey', '--format=\${objectpath}\n', '--batch'],
# again intersperse to kick off batches
intersperse(
b'\n',
# feed a fake key that does no harm to the command, when there is none
none2marker(
proc_byte_chunks(gaf, rb'\n'),
magic_marker,
)
),
) as gek:
# consume (the enumerate is not needed, just for decoration)
for r in enumerate(
# zip filenames from ls-files to key examination outcomes
# the `None` argument to iter() stops iteration when it
# gets the end marker from the FIFO
zip(iter(filenames.get, None),
# we undo the marker and turn them into `None`
marker2none(
proc_byte_chunks(gek, rb'\n'),
magic_marker)
)
):
print(r) |
And here one more variant of the script. No printing of records, but proper loading on complete JSON-records from End-to-end benchmark:
from datalad_next.iterable_subprocess.iterable_subprocess import iterable_subprocess
from re import finditer
from more_itertools import (
intersperse,
side_effect,
)
from queue import Queue
import json
# takes an iterable of bytes, yield items defined by delimiter
# tries to make as few copies as possible
def proc_byte_chunks(iter, delim):
remainder = b''
for chunk in iter:
for item, incomplete in proc_byte_chunk(chunk, delim):
if incomplete:
remainder += item
continue
elif remainder:
item = remainder + item
remainder = b''
yield item
if remainder:
yield remainder
# takes bytes, yields items defined by delimiter,
# plus bool to indicate incomplete items (did not end with delimiter)
def proc_byte_chunk(chunk, delim):
pos = 0
for delim in finditer(delim, chunk):
yield (chunk[pos:delim.start()], False)
pos = delim.end()
remainder = chunk[pos:]
if remainder:
yield (remainder, True)
# replaces empty items in an iterable with a marker
def none2marker(iter, marker):
for i in iter:
yield i if i else marker
# replaced items that end with a marker with `None`
def marker2none(iter, marker):
for i in iter:
yield None if i.endswith(marker) else i
# load json-encoded byte strings
def jsonloads(iter):
for i in iter:
yield None if i is None else json.loads(i)
# FIFO to buffer filenames
# we buffer them and not the key examination outcomes to minimie memory demands
filenames = Queue()
# helper to put an end marker into the filename queue
# this is used to signal stopiteration to zip() at the very end
def mark_filename_queue_end():
filenames.put(None)
# fake URL annex key we feed through examinekeys instead of not calling
# it at all -- lazy but fast
magic_marker = b'URL--MAGIC'
with \
iterable_subprocess(
# feeder of the pipeline, zero-byte delimited items
['git', 'ls-files', '-z'], tuple(),
) as glsf, \
iterable_subprocess(
# we get the annex key for any filename (or empty if not annexed)
['git', 'annex', 'find', '--anything', '--format=\${key}\n', '--batch'],
# intersperse items with newlines to trigger a batch run
# this avoids string operations to append newlines to items
intersperse(
b'\n',
# the sideeffect is that any filename items coming from ls-files
# is put into a FIFO, to be later merged with the key properties
side_effect(
filenames.put,
# we parse the raw byte chunks from the subprocess into
# zero-byte delimited items
proc_byte_chunks(glsf, rb'\0'),
# after the last filename, we put an end marker into the queue
after=mark_filename_queue_end,
)
),
) as gaf, \
iterable_subprocess(
# get the key properties JSON-lines style
['git', 'annex', 'examinekey', '--json', '--batch'],
# again intersperse to kick off batches
intersperse(
b'\n',
# feed a fake key that does no harm to the command,
# when there is none
none2marker(
proc_byte_chunks(gaf, rb'\n'),
magic_marker,
)
),
) as gek:
# zip filenames from ls-files to key examination outcomes
# the `None` argument to iter() stops iteration when it
# gets the end marker from the FIFO
lookup = dict(
zip(iter(filenames.get, None),
jsonloads(
# we undo the marker and turn them into `None`
marker2none(
proc_byte_chunks(gek, rb'\n'),
magic_marker)))
)
print(f'N records={len(lookup)}') |
I searched for alternatives to our current
ThreadedRunner
-based implementation for batch process handling (with context managers).Related:
iter_annexworktree
with one run-context manager and two batch-context managers #520and some more prior attempts by myself. All solutions so far appear to be rather heavy, and do not deliver the necessary performance.
Below is a demo of a leaner solution:
It runs the equivalent of the following pipeline:
As shown below, it connects three subprocesses via a "pipe". A custom iterator takes the conceptual place of our "protocols" and performs some output processing (pipe-inline).
Benchmark:
So on my machine and test dataset with 36k files, I achieve the same performance as with the shell command.
I like this solution, because it has minimal (code) overhead. The underlying implementation of
iterable_subprocess
is ~100 lines of code on top of the stdlib.While this looks good and solves an urgently needed use case, it is unclear to me what this code cannot do, that
ThreadedRunner
does for our existing implementations.@christian-monch please have a look!
The text was updated successfully, but these errors were encountered: