This recipe creates a ParallelCluster system where you can try out Amazon EC2 Trn1 instances. The cluster is builds is the same architecture as the one in Train a model on AWS Trn1 ParallelCluster and as such can be used to complete the exercises in that repository.
The cluster design includes the following features:
- AWS Neuron SDK pre-installed.
- Elastic compute queue featuring Trn1 instances
- High-speed, low-latency networking with Amazon Elastic Fabric Adapter (EFA).
- Performant shared scratch storage (based on Amazon FSx for Lustre) available at
/fsx
.
If you are not sure whether your account can use Trn1 instances, try to launch one in the Amazon EC2 console before using this recipe. If you are unable to, reach out to your account manager to ensure Trainium access is enabled for your account.
- Create a basic HPC networking configuration in a Region and Availability Zone where Trn1 instances are available. You can do this manually or using the net/hpc_basic recipe.
- Ensure you have an Amazon EC2 SSH key created in the Region where you want to launch your Trn1 cluster.
- Launch the cluster template
- Follow the instructions in the AWS CloudFormation console. When you configure the queue sizes (i.e.
ComputeInstanceMax
), choose a value that is consistent with your service quota.
- Follow the instructions in the AWS CloudFormation console. When you configure the queue sizes (i.e.
- Monitor the status of the AWS CloudFormation stack. When its status reaches
CREATE_COMPLETE
, navigate to its Outputs tab to find information you need to access the cluster.
If you want to use SSH to access the cluster, you will need its public IP (from above). Using your local terminal, connect via SSH like so: ssh -i KeyPair.pem ubuntu@HeadNodeIp
where KeyPair.pem
is the path to the EC2 keypair you specified when launcing the cluster and HeadNodeIp
is the IP address from above.
You can also use AWS Systems Manager to access the cluster. You can follow the link found in Outputs > SystemManagerUrl. Or, you can navigate to the Instances panel in the Amazon EC2 Console. Find the instance named HeadNode - this is your cluster's access node. Select that instance, then choose Actions followed by Connect. On the Connect to instance page, navigate to Session Manager then choose Connect.
Once you are on the system, consult the repo Train a model on AWS Trn1 ParallelCluster to learn what to do next.
When you are done using your cluster, you can delete it and all its associated resources by navigating to the AWS CloudFormation console and deleting the relevant stack. Note that data on the /fsx
volume will be deleted. If you want to keep it, find the relevant FSx for Lustre volume in the AWS console and back it up.