Skip to content

Latest commit

 

History

History
38 lines (23 loc) · 3.72 KB

File metadata and controls

38 lines (23 loc) · 3.72 KB

Trn1 Test Cluster

Info

This recipe creates a ParallelCluster system where you can try out Amazon EC2 Trn1 instances. The cluster is builds is the same architecture as the one in Train a model on AWS Trn1 ParallelCluster and as such can be used to complete the exercises in that repository.

The cluster design includes the following features:

  • AWS Neuron SDK pre-installed.
  • Elastic compute queue featuring Trn1 instances
  • High-speed, low-latency networking with Amazon Elastic Fabric Adapter (EFA).
  • Performant shared scratch storage (based on Amazon FSx for Lustre) available at /fsx.

Usage

Validate that you can use Trainium

If you are not sure whether your account can use Trn1 instances, try to launch one in the Amazon EC2 console before using this recipe. If you are unable to, reach out to your account manager to ensure Trainium access is enabled for your account.

Launch the Cluster

  1. Create a basic HPC networking configuration in a Region and Availability Zone where Trn1 instances are available. You can do this manually or using the net/hpc_basic recipe.
  2. Ensure you have an Amazon EC2 SSH key created in the Region where you want to launch your Trn1 cluster.
  3. Launch the cluster template
    • Follow the instructions in the AWS CloudFormation console. When you configure the queue sizes (i.e. ComputeInstanceMax), choose a value that is consistent with your service quota.
  4. Monitor the status of the AWS CloudFormation stack. When its status reaches CREATE_COMPLETE, navigate to its Outputs tab to find information you need to access the cluster.

Access the Cluster

If you want to use SSH to access the cluster, you will need its public IP (from above). Using your local terminal, connect via SSH like so: ssh -i KeyPair.pem ubuntu@HeadNodeIp where KeyPair.pem is the path to the EC2 keypair you specified when launcing the cluster and HeadNodeIp is the IP address from above.

You can also use AWS Systems Manager to access the cluster. You can follow the link found in Outputs > SystemManagerUrl. Or, you can navigate to the Instances panel in the Amazon EC2 Console. Find the instance named HeadNode - this is your cluster's access node. Select that instance, then choose Actions followed by Connect. On the Connect to instance page, navigate to Session Manager then choose Connect.

Once you are on the system, consult the repo Train a model on AWS Trn1 ParallelCluster to learn what to do next.

Cleaning Up

When you are done using your cluster, you can delete it and all its associated resources by navigating to the AWS CloudFormation console and deleting the relevant stack. Note that data on the /fsx volume will be deleted. If you want to keep it, find the relevant FSx for Lustre volume in the AWS console and back it up.