Handle ssh failure gracefully (#724)

* handle failed attempt gracefully ---------
borglab · Sep 28, 2023 · af7ce85 · af7ce85
1 parent 5193fb9
commit af7ce85
Show file tree

Hide file tree

Showing 2 changed files with 23 additions and 16 deletions.
diff --git a/CLUSTER.md b/CLUSTER.md
@@ -5,27 +5,29 @@
 GTSfM uses the [SSHCluster](https://docs.dask.org/en/stable/deploying-ssh.html#dask.distributed.SSHCluster) module of [Dask](https://distributed.dask.org/en/stable/) to provide cluster-utilization functionality for SfM execution. This readme is a step-by-step guide on how to set up your machines for a successful run on a cluster.
 
 1. Choose which machine will serve as the scheduler. The data only needs to be on the scheduler node.
-2. Enable passwordless SSH between all the workers on the cluster
-    - Log in into a machine
-    - For each of the other workers on the cluster run
+2. Create a config file listing the IP addresses of cluster machines (example in [gtsfm/configs/cluster.yaml](https://github.com/borglab/gtsfm/blob/master/gtsfm/configs/cluster.yaml)).
+    - Note that the first worker in the cluster.yaml file must be the scheduler machine where the data is hosted.
+3. Enable passwordless SSH between all the workers (machines) on the cluster.
+    - Log in individually to each machine listed in the cluster config file.
+    - For each of the other machines on the cluster, run:
         * ```bash 
-          ssh-copy-id username@machine_ip_address_of_another_worker
+          ssh-copy-id {username}@{machine_ip_address_of_another_worker}
           ```
-        * Repeat the above two steps on all machines
+        * If you see `/usr/bin/ssh-copy-id: ERROR: No identities found`, then run `ssh-keygen -t rsa` first.
+        * Repeat the two steps above on all machines.
     - Note machines should be able to ssh into themselves passwordless e.g. host1 should be able to ssh into host1.
-3. Clone gtsfm and follow the main readme file to setup the environment on all nodes in the cluster at an identical path
+    - If the cluster has 5 machines, then `ssh-copy-id` must be run 5*5=25 times.
+4. Clone gtsfm and follow the main readme file to setup the environment on all nodes in the cluster at an identical path
     - ```bash
-        git clone https://github.com/borglab/gtsfm.git
+        git clone --recursive https://github.com/borglab/gtsfm.git
         conda env create -f environment_linux.yml
         conda activate gtsfm-v1
       ```
-4. Log into scheduler again and download the data to scheduler machine
-5. Create a config file listing the cluster workers (example in [gtsfm/configs/cluster.yaml](https://github.com/borglab/gtsfm/blob/master/gtsfm/configs/cluster.yaml))
-6. Run gtsfm with –cluster_config flag enabled, for example
+5. Log into scheduler again and download the data to scheduler machine.
+6. Run gtsfm with `-–cluster_config` flag enabled, for example
     - ```
       python /home/username/gtsfm/gtsfm/runner run_scene_optimizer_colmaploader.py --images_dir /home/username/gtsfm/skydio-32/images/ --config_name sift_front_end.yaml --colmap_files_dirpath /home/hstepanyan3/gtsfm/skydio-32/colmap_crane_mast_32imgs/ --cluster_config cluster.yaml
       ```
-    - Note that the first worker in the cluster.yaml file must be the scheduler machine where the data is hosted.
     - Always provide absolute paths for all directories
 7. If you would like to check out the dask dashboard, you will need to do port forwarding from machine to your local computer:
     - ```

diff --git a/gtsfm/runner/gtsfm_runner_base.py b/gtsfm/runner/gtsfm_runner_base.py
@@ -59,13 +59,13 @@ def construct_argparser(self) -> argparse.ArgumentParser:
             "--num_workers",
             type=int,
             default=1,
-            help="Number of workers to start (processes, by default)",
+            help="Number of workers to start (processes, by default).",
         )
         parser.add_argument(
             "--threads_per_worker",
             type=int,
             default=1,
-            help="Number of threads per each worker",
+            help="Number of threads per each worker.",
         )
         parser.add_argument(
             "--worker_memory_limit", type=str, default="8GB", help="Memory limit per worker, e.g. `8GB`"
@@ -106,7 +106,7 @@ def construct_argparser(self) -> argparse.ArgumentParser:
             "--max_frame_lookahead",
             type=int,
             default=None,
-            help="maximum number of consecutive frames to consider for matching/co-visibility",
+            help="Maximum number of consecutive frames to consider for matching/co-visibility.",
         )
         parser.add_argument(
             "--num_matched",
@@ -115,7 +115,7 @@ def construct_argparser(self) -> argparse.ArgumentParser:
             help="Number of K potential matches to provide per query. These are the top `K` matches per query.",
         )
         parser.add_argument(
-            "--share_intrinsics", action="store_true", help="Shares the intrinsics between all the cameras"
+            "--share_intrinsics", action="store_true", help="Shares the intrinsics between all the cameras."
         )
         parser.add_argument("--mvs_off", action="store_true", help="Turn off dense MVS reconstruction")
         parser.add_argument(
@@ -148,7 +148,7 @@ def construct_argparser(self) -> argparse.ArgumentParser:
             "--num_retry_cluster_connection",
             type=int,
             default=3,
-            help="number of times to retry cluster connection if it fails",
+            help="Number of times to retry cluster connection if it fails.",
         )
         return parser
 
@@ -253,6 +253,11 @@ def setup_ssh_cluster_with_retries(self) -> SSHCluster:
             except Exception as e:
                 logger.info(f"Worker failed to start: {str(e)}")
                 retry_count += 1
+        if not connected:
+            raise ValueError(
+                f"Connection to cluster could not be established after {self.parsed_args.num_retry_cluster_connection}"
+                " attempts. Aborting..."
+            )
         return cluster
 
     def run(self) -> GtsfmData: