Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using casacore measures for computing parallactic angles acquires/drops the GIL excessively #265

Closed
JSKenyon opened this issue Jan 27, 2022 · 26 comments · Fixed by #269 · May be fixed by #266
Closed

Using casacore measures for computing parallactic angles acquires/drops the GIL excessively #265

JSKenyon opened this issue Jan 27, 2022 · 26 comments · Fixed by #269 · May be fixed by #266
Labels
enhancement New feature or request

Comments

@JSKenyon
Copy link
Collaborator

  • Codex Africanus version: 0.3.2
  • Python version: 3.8.12
  • Operating System: Pop!_OS 18.04

Description

When utilising the beam inside the fused RIME (in QuartiCal), performance takes a massive hit. @ulricharmel is seeing 60+ hour runtimes with the beam compared to around 8 without. Is this the same issue as the parallactic angle computation i.e. is the GIL to blame?

@sjperkins

This comment has been minimized.

@sjperkins sjperkins added the enhancement New feature or request label Jan 27, 2022
@sjperkins
Copy link
Member

Is this the same issue as the parallactic angle computation i.e. is the GIL to blame?

Also, I still believe from our discussions at the workshop that the use of the garbage collector in Quartical leads to poor performance. Has this changed?

@sjperkins sjperkins changed the title Beam in fused rime is very slow Support sampling of antenna terms at unique time, feed and antenna, rather than per row Jan 27, 2022
@JSKenyon
Copy link
Collaborator Author

Also, I still believe from our discussions at the workshop that the use of the garbage collector in Quartical leads to poor performance. Has this changed?

I think that I had come to an inaccurate conclusion due to the dataset in question containing several fields and correspondingly varied xds sizes. While I will make an effort to confirm that the garbage collector isn't to blame, I am pretty sure something else is afoot. Does the beam automatically compute the parallactic angle or does it rely on the user enabling it?

@sjperkins
Copy link
Member

Also, I still believe from our discussions at the workshop that the use of the garbage collector in Quartical leads to poor performance. Has this changed?

I think that I had come to an inaccurate conclusion due to the dataset in question containing several fields and correspondingly varied xds sizes. While I will make an effort to confirm that the garbage collector isn't to blame, I am pretty sure something else is afoot. Does the beam automatically compute the parallactic angle or does it rely on the user enabling it?

The above code does not compute parallactic angles, but it does consume parallactice angles computed by a transformer which computes them per unique time, feed and antenna.

@sjperkins
Copy link
Member

@JSKenyon
Copy link
Collaborator Author

This line:
https://github.com/ska-sa/codex-africanus/blob/174645637b39ea037a7b557e29cdebd55ba0146c/africanus/experimental/rime/fused/transformers/parangle.py#L47
calls
https://github.com/ska-sa/codex-africanus/blob/174645637b39ea037a7b557e29cdebd55ba0146c/africanus/experimental/rime/fused/transformers/parangle.py#L29-L34
which I believe means that the beam implementation will also exhibit the same problems as enabling the parallactic angle rotation. I may, however, be completely mistaken.

@JSKenyon
Copy link
Collaborator Author

I do have a decent test now where I have all the necessary beams etc, so I will make sure that the GC isn't involved.

@sjperkins
Copy link
Member

This line:

https://github.com/ska-sa/codex-africanus/blob/174645637b39ea037a7b557e29cdebd55ba0146c/africanus/experimental/rime/fused/transformers/parangle.py#L47

calls

https://github.com/ska-sa/codex-africanus/blob/174645637b39ea037a7b557e29cdebd55ba0146c/africanus/experimental/rime/fused/transformers/parangle.py#L29-L34

which I believe means that the beam implementation will also exhibit the same problems as enabling the parallactic angle rotation. I may, however, be completely mistaken.

Yes, if there's an issue affecting feed rotation, then it will affect the beam because they both depend on transformation of parallactic angles

@JSKenyon
Copy link
Collaborator Author

JSKenyon commented Jan 27, 2022

I can confirm that the GC is not the culprit. I have written a tiny hack for casa_parallactic_angles which spins off a process to run the measures server each time the function gets called. This seems to resolve the problem.

Edit: Though how robust it is I do not yet know. But very cool that it works from an objmode call inside the fused RIME. :-)

@sjperkins
Copy link
Member

I can confirm that the GC is not the culprit. I have written a tiny hack for casa_parallactic_angles which spins off a process to run the measures server each time the function gets called. This seems to resolve the problem.

Edit: Though how robust it is I do not yet know. But very cool that it works from an objmode call inside the fused RIME. :-)

Could you please push your changes up to a branch, in whatever exploratory state it's in?

@JSKenyon
Copy link
Collaborator Author

Will do!

@JSKenyon
Copy link
Collaborator Author

My hack currently lives in https://github.com/ska-sa/codex-africanus/tree/improve_parangles.

@JSKenyon
Copy link
Collaborator Author

This definitely doesn't seem to work with multiple workers yet. Does work from threads. So there will be some pickling required somewhere.

@sjperkins
Copy link
Member

This definitely doesn't seem to work with multiple workers yet. Does work from threads. So there will be some pickling required somewhere.

No problem, just exists to demonstrate the idea

@JSKenyon
Copy link
Collaborator Author

Actually, the problem is weirder. So the beams seem to have a problem with serialisation:

TypeError: ('Could not serialize object of type tuple.', "(<function load_beams.<locals>._load_correlation at 0x7f3722cf75e0>, <quartical.data_handling.predict.load_beams.<locals>.FITSFile object at 0x7f37230b0850>, <quartical.data_handling.predict.load_beams.<locals>.FITSFile object at 0x7f36f7f58670>, (<class 'tuple'>, [2, 1, 0]))")

That seems to be independent of the parallactic angle problem which, with the beam disabled, is:

AssertionError: daemonic processes are not allowed to have children

Which, incidentally, is the most awesome sounding error I have ever encountered.

@sjperkins
Copy link
Member

Actually, the problem is weirder. So the beams seem to have a problem with serialisation:

TypeError: ('Could not serialize object of type tuple.', "(<function load_beams.<locals>._load_correlation at 0x7f3722cf75e0>, <quartical.data_handling.predict.load_beams.<locals>.FITSFile object at 0x7f37230b0850>, <quartical.data_handling.predict.load_beams.<locals>.FITSFile object at 0x7f36f7f58670>, (<class 'tuple'>, [2, 1, 0]))")

That seems to be independent of the parallactic angle problem which, with the beam disabled, is:

AssertionError: daemonic processes are not allowed to have children

Which, incidentally, is the most awesome sounding error I have ever encountered.

Hmmmm, this means that FITSFile isn't easily pickleable -- it probably needs a custom __reduce__ method.

@sjperkins sjperkins changed the title Support sampling of antenna terms at unique time, feed and antenna, rather than per row Using casacore measures for computing parallactic angles acquires/drops the GIL excessively Jan 27, 2022
@JSKenyon
Copy link
Collaborator Author

AssertionError: daemonic processes are not allowed to have children

I have found a workaround for this - it requires making dask's worker processes non-daemonic when initialising the LocalCluster.

Doing so requires using dask.config.set({"distributed.worker.daemon": False}).

I am a little wary of this as a solution though, as it will make interrupting runs a little more likely to leave zombies (if my intuition is correct).

@sjperkins sjperkins linked a pull request Jan 27, 2022 that will close this issue
2 tasks
@sjperkins
Copy link
Member

AssertionError: daemonic processes are not allowed to have children

I have found a workaround for this - it requires making dask's worker processes non-daemonic when initialising the LocalCluster.

Doing so requires using dask.config.set({"distributed.worker.daemon": False}).

I am a little wary of this as a solution though, as it will make interrupting runs a little more likely to leave zombies (if my intuition is correct).

Hmmmm, useful to know. Off the top of my head it's probably desirable for Dask Worker's to be daemons. I feel that there may be some way around this.

@sjperkins
Copy link
Member

AssertionError: daemonic processes are not allowed to have children

I have found a workaround for this - it requires making dask's worker processes non-daemonic when initialising the LocalCluster.

Out of interest, does using multiprocessing.get_context("spawn").Pool(...) remove the need for non-daemonic dask workers?

@JSKenyon
Copy link
Collaborator Author

Out of interest, does using multiprocessing.get_context("spawn").Pool(...) remove the need for non-daemonic dask workers?

I gave it a try and got the same message - I think this is something we cannot really avoid. If the dask workers are created as daemonic I don't know if there will be a way to create processes from processes. For what it is worth, the child processes should be very short lived.

The other idea you suggested may work - having a ProcessPool associated with the RIME factory. I am not exactly sure how to hack that together quickly though and even then you may run into issues if the pool is created by a worker (although I suspect you are intending to create it at graph construction time, much like the thread pool associated with the reads/writes).

@sjperkins
Copy link
Member

sjperkins commented Jan 27, 2022

Out of interest, does using multiprocessing.get_context("spawn").Pool(...) remove the need for non-daemonic dask workers?

I gave it a try and got the same message - I think this is something we cannot really avoid. If the dask workers are created as daemonic I don't know if there will be a way to create processes from processes. For what it is worth, the child processes should be very short lived.

OK. It seems that if nanny processes are turned off then things might work, although that might interfere with other parts of the dask machinery: dask/distributed#2142

The other idea you suggested may work - having a ProcessPool associated with the RIME factory.

This is more a design decision within the RimeFactory itself and I don't think it'll escape the issue of daemon's not allowing child process.

I am not exactly sure how to hack that together quickly though and even then you may run into issues if the pool is created by a worker (although I suspect you are intending to create it at graph construction time, much like the thread pool associated with the reads/writes).

Yep, I think I'll need to devote some thought here, but essentially, RimeFactory's (like TableProxies) are embeddable in the graph and pickleable. Think of the FITSFile in Quartical we discussed. While the internal file object is heavyweight and impossible to pickle, the arguments to create the FITSFile are not

class FITSFile:
  def __init__(self, filename: str):
    # lightweight, easy to pickle
    self.filename = filename 
   # heavyweight, difficult to pickle   
    self.file = open(file)

  def __reduce__(self):
   # but we don't need self.file to represent the object for transmission
    return (FITSFile, (self.filename,))

What this means is that we can embed very complex objects in the graph, as long as the state required to reconstruct them via the pickle protocol is simple and itself pickleable. This is the case for the filename argument above.

It's also the case for RimeFactory, which contains all sorts of heavyweight objects (most notably a jitted function), but which is constructed from a lightweight RimeSpecification argument. Thus, we could have a heavyweight ProcessPoolExecutor within the RimeFactory without concerning ourselves as to whether it is pickleable.

I find that it's a very powerful paradigm for use with dask and it would be good to devote more time to explaining this in further detail. In conjunction with Multitons we can even ensure that only a single FITSFile object is created for a particular filename, within a single process.

@bennahugo
Copy link
Collaborator

My thought on this based on what we have seen in DDFacet pools are that you really want to spawn them as early in the process as possible and use them throughout. Child processes will inherit the memory pages which will be copy on write by the OS. This has lead to poor memory performance for us in the past so the best thing to do is to get a pool going outside of dask here perhaps? Another option would perhaps be IPC to support the non-shared memory distributed case (see https://docs.python.org/3/library/multiprocessing.html#multiprocessing-listeners-clients) with dask graph nodes requesting arrays from a server which can have its own pool?

@bennahugo
Copy link
Collaborator

(I think this is especially true if you are going to configure your system to enforce hugepages which typically substantially improves performance with the kind of large array sizes we work with typically)

@sjperkins
Copy link
Member

My thought on this based on what we have seen in DDFacet pools are that you really want to spawn them as early in the process as possible and use them throughout. Child processes will inherit the memory pages which will be copy on write by the OS.

I think the strategy here will be to use the spawn method, rather than fork, From the docs:

The parent process starts a fresh python interpreter process. The child process will only inherit those resources necessary to run the process object’s run() method. In particular, unnecessary file descriptors and handles from the parent process will not be inherited. Starting a process using this method is rather slow compared to using fork or forkserver.

This has lead to poor memory performance for us in the past so the best thing to do is to get a pool going outside of dask here perhaps? Another option would perhaps be IPC to support the non-shared memory distributed case (see https://docs.python.org/3/library/multiprocessing.html#multiprocessing-listeners-clients) with dask graph nodes requesting arrays from a server which can have its own pool?

I think the approach you're describing is appropriate to situations where large visibility buffers need to be shared between processes. The data that must be shared here (unique times, feeds and antennas) are far smaller so the IPC costs should be negligble relative to the compute required to produce visibilities. The Pool will indeed live "outside" dask though.

@sjperkins
Copy link
Member

Going a bit further, the patterns I'm describing in #265 (comment) are very useful for spawning resource objects that "live outside dask" but are used by it.

@bennahugo
Copy link
Collaborator

Ah ok yes a spawned would go some ways to prevent the memory issues here.

@sjperkins sjperkins linked a pull request Jan 31, 2022 that will close this issue
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
3 participants