Skip to content

Commit

Permalink
Handle ssh failure gracefully (#724)
Browse files Browse the repository at this point in the history
* handle failed attempt gracefully

---------
  • Loading branch information
johnwlambert authored Sep 28, 2023
1 parent 5193fb9 commit af7ce85
Show file tree
Hide file tree
Showing 2 changed files with 23 additions and 16 deletions.
24 changes: 13 additions & 11 deletions CLUSTER.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,27 +5,29 @@
GTSfM uses the [SSHCluster](https://docs.dask.org/en/stable/deploying-ssh.html#dask.distributed.SSHCluster) module of [Dask](https://distributed.dask.org/en/stable/) to provide cluster-utilization functionality for SfM execution. This readme is a step-by-step guide on how to set up your machines for a successful run on a cluster.

1. Choose which machine will serve as the scheduler. The data only needs to be on the scheduler node.
2. Enable passwordless SSH between all the workers on the cluster
- Log in into a machine
- For each of the other workers on the cluster run
2. Create a config file listing the IP addresses of cluster machines (example in [gtsfm/configs/cluster.yaml](https://github.com/borglab/gtsfm/blob/master/gtsfm/configs/cluster.yaml)).
- Note that the first worker in the cluster.yaml file must be the scheduler machine where the data is hosted.
3. Enable passwordless SSH between all the workers (machines) on the cluster.
- Log in individually to each machine listed in the cluster config file.
- For each of the other machines on the cluster, run:
* ```bash
ssh-copy-id username@machine_ip_address_of_another_worker
ssh-copy-id {username}@{machine_ip_address_of_another_worker}
```
* Repeat the above two steps on all machines
* If you see `/usr/bin/ssh-copy-id: ERROR: No identities found`, then run `ssh-keygen -t rsa` first.
* Repeat the two steps above on all machines.
- Note machines should be able to ssh into themselves passwordless e.g. host1 should be able to ssh into host1.
3. Clone gtsfm and follow the main readme file to setup the environment on all nodes in the cluster at an identical path
- If the cluster has 5 machines, then `ssh-copy-id` must be run 5*5=25 times.
4. Clone gtsfm and follow the main readme file to setup the environment on all nodes in the cluster at an identical path
- ```bash
git clone https://github.com/borglab/gtsfm.git
git clone --recursive https://github.com/borglab/gtsfm.git
conda env create -f environment_linux.yml
conda activate gtsfm-v1
```
4. Log into scheduler again and download the data to scheduler machine
5. Create a config file listing the cluster workers (example in [gtsfm/configs/cluster.yaml](https://github.com/borglab/gtsfm/blob/master/gtsfm/configs/cluster.yaml))
6. Run gtsfm with –cluster_config flag enabled, for example
5. Log into scheduler again and download the data to scheduler machine.
6. Run gtsfm with `-–cluster_config` flag enabled, for example
- ```
python /home/username/gtsfm/gtsfm/runner run_scene_optimizer_colmaploader.py --images_dir /home/username/gtsfm/skydio-32/images/ --config_name sift_front_end.yaml --colmap_files_dirpath /home/hstepanyan3/gtsfm/skydio-32/colmap_crane_mast_32imgs/ --cluster_config cluster.yaml
```
- Note that the first worker in the cluster.yaml file must be the scheduler machine where the data is hosted.
- Always provide absolute paths for all directories
7. If you would like to check out the dask dashboard, you will need to do port forwarding from machine to your local computer:
- ```
Expand Down
15 changes: 10 additions & 5 deletions gtsfm/runner/gtsfm_runner_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,13 +59,13 @@ def construct_argparser(self) -> argparse.ArgumentParser:
"--num_workers",
type=int,
default=1,
help="Number of workers to start (processes, by default)",
help="Number of workers to start (processes, by default).",
)
parser.add_argument(
"--threads_per_worker",
type=int,
default=1,
help="Number of threads per each worker",
help="Number of threads per each worker.",
)
parser.add_argument(
"--worker_memory_limit", type=str, default="8GB", help="Memory limit per worker, e.g. `8GB`"
Expand Down Expand Up @@ -106,7 +106,7 @@ def construct_argparser(self) -> argparse.ArgumentParser:
"--max_frame_lookahead",
type=int,
default=None,
help="maximum number of consecutive frames to consider for matching/co-visibility",
help="Maximum number of consecutive frames to consider for matching/co-visibility.",
)
parser.add_argument(
"--num_matched",
Expand All @@ -115,7 +115,7 @@ def construct_argparser(self) -> argparse.ArgumentParser:
help="Number of K potential matches to provide per query. These are the top `K` matches per query.",
)
parser.add_argument(
"--share_intrinsics", action="store_true", help="Shares the intrinsics between all the cameras"
"--share_intrinsics", action="store_true", help="Shares the intrinsics between all the cameras."
)
parser.add_argument("--mvs_off", action="store_true", help="Turn off dense MVS reconstruction")
parser.add_argument(
Expand Down Expand Up @@ -148,7 +148,7 @@ def construct_argparser(self) -> argparse.ArgumentParser:
"--num_retry_cluster_connection",
type=int,
default=3,
help="number of times to retry cluster connection if it fails",
help="Number of times to retry cluster connection if it fails.",
)
return parser

Expand Down Expand Up @@ -253,6 +253,11 @@ def setup_ssh_cluster_with_retries(self) -> SSHCluster:
except Exception as e:
logger.info(f"Worker failed to start: {str(e)}")
retry_count += 1
if not connected:
raise ValueError(
f"Connection to cluster could not be established after {self.parsed_args.num_retry_cluster_connection}"
" attempts. Aborting..."
)
return cluster

def run(self) -> GtsfmData:
Expand Down

0 comments on commit af7ce85

Please sign in to comment.