The goal of the solution is to demonstrate a deployment of Amazon SageMaker Studio into a secure controlled environment with multi-layer security and implementation of secure MLOps CI/CD pipelines.
This solution covers the main four topics:
- Secure deployment of Amazon SageMaker Studio into a new or an existing secure environment (VPC, private subnets, VPC endpoints, security groups). We implement end-to-end data encryption and fine-grained access control
- Self-service data science environment provisioning based on AWS Service Catalog and AWS CloudFormation
- Self-service provisioning of MLOps project templates in Studio
- MLOps CI/CD automation using SageMaker projects and SageMaker Pipelines for model training and multi-account deployment
The goals of implementing MLOps for your AI/ML projects and environment are:
- Operationalization of AI/ML workloads and workflows
- Create secured, automated, and reproducible ML workflows
- Manage models with a model registry and data lineage
- Enable continious delivery with IaC and CI/CD pipelines
- Monitor performance and feedback information to your models
One of the key components of MLOps pipeline in SageMaker is the model registry.
The model registry provides the following features:
- Centralized model storage and tracking service that stores lineage, versioning, and related metadata for ML models
- Stores governance and audit data (e.g. who trained and published the model, which datasets were used)
- Stores models metrics and when the model was deployed to production
- Manages model version life cycle
- Manages the approval status of a model
This section describes the overall solution architecture.
1 – AWS Service Catalog
The end-to-end deployment of the data science environment is delivered as an AWS Service Catalog self-provisioned product. One of the main advantages of using AWS Service Catalog for self- provisioning is that authorized users can configure and deploy available products and AWS resources on their own without needing full privileges or access to AWS services. The deployment of all AWS Service Catalog products happens under a specified service role with the defined set of permissions, which are unrelated to the user’s permissions.
2 – Amazon SageMaker Studio Domain
The Data Science Environment product in the AWS Service Catalog creates an Amazon SageMaker Studio domain, which consists of a list of authorized users, configuration settings, and an Amazon Elastic File System (Amazon EFS) volume, which contains data for the users, including notebooks, resources, and artifacts.
3,4 – SageMaker MLOps project templates
The solution delivers the customized versions of SageMaker MLOps project templates. Each MLOps template provides an automated model building and deployment pipeline using continuous integration and continuous delivery (CI/CD). The delivered templates are configured for the secure multi-account model deployment and are fully integrated in the provisioned data science environment. The project templates are provisioned in the Studio via AWS Service Catalog.
5,6 – CI/CD workflows
The MLOps projects implement CI/CD using SageMaker Pipelines and AWS CodePipeline, AWS CodeCommit, and AWS CodeBuild. SageMaker Pipelines are responsible for orchestrating workflows across each step of the ML process and task automation including data loading, data transformation, training, tuning and validation, and deployment. Each model is tracked via the SageMaker model registry, which stores the model metadata, such as training and validation metrics and data lineage, manages model versions and the approval status of the model.
This solution supports secure multi-account model deployment using AWS Organizations or via simple target account lists.
7 – Secure infrastructure
Studio domain is deployed in a dedicated VPC. Each elastic network interface (ENI) used by SageMaker domain or workload is created within a private dedicated subnet and attached to the specified security groups. The data science environment VPC can be configured with internet access via an optional NAT gateway. You can also run this VPC in internet-free mode without any inbound or outbound internet access.
All access to the AWS public services is routed via AWS PrivateLink. Traffic between your VPC and the AWS services does not leave the Amazon network and is not exposed to the public internet.
8 – Data security
All data in the data science environment, which is stored in Amazon S3 buckets, Amazon EBS and EFS volumes, is encrypted at rest using customer-managed AWK Key Management Service (KMS) keys. All data transfer between platform components, API calls, and inter-container communication is protected using Transport Layer Security (TLS 1.2) protocol.
Data access from the SageMaker Studio notebooks or any SageMaker workload to the environment Amazon S3 buckets is governed by the combination of the Amazon S3 bucket and user policies and S3 VPC endpoint policy.
The following diagram shows the proposed team and AWS account structure for the solution:
The data science environment has a three-level organisational structure: Enterprise (Organizational Unit), Team (Environment), and Project.
- Enterprise level: The highest level in hierarchy, represented by the
DS Administrator
role and Data Science portfolio in AWS Service Catalog. A data science environment per team is provisioned via the AWS Service Catalog self-service into a dedicated data science AWS Account. - Team/Environment level: There is one dedicated Data Science Team AWS account and one Studio domain per region per AWS account.
DS Team Administrator
can create user profiles in Studio for different user roles with different permissions, and also provision a CI/CD MLOps pipelines per project. The DS Administrator role is responsible for approving ML models and deployment into staging and production accounts. Based on role permission setup you can implement fine-granular separation of rights and duties per project. You can find more details on permission control with IAM policies and resource tagging in the blog post Configuring Amazon SageMaker Studio for teams and groups with complete resource isolation on AWS Machine Learning Blog - Project level: This is the individual project level and represented by CI/CD pipelines which are provisioned via SageMaker projects in Studio
The recently published AWS whitepaper Build a Secure Enterprise Machine Learning Platform on AWS gives detailed overview and outlines the best practices for more generic use case of building multi-account enterprise-level data science environments.
This is a proposed environment structure and can be adapted for your specific requirements, organizational and governance structure, and project methodology.
The best practice for implementing production and real-life data science environment is to use multiple AWS accounts. The multi-account approaches has the following benefits:
- Supports multiple unrelated teams
- Ensures fine-grained security and compliance controls
- Minimizes the blast radius
- Provides workload and data isolation
- Facilitates the billing and improves cost visibility and control
- Separates between development, test, and production
You can find a recommended multi-account best practices in the whitepaper Build a Secure Enterprise Machine Learning Platform on AWS.
The solution implements the following multi-account approach:
- A dedicated AWS account (development account) per Data Science team and one Amazon SageMaker Studio domain per region per account
- Dedicated AWS accounts for staging and production of AI/ML models
- Optional usage of AWS Organizations to enable trust and security control policies for member AWS accounts
Without loss of generality, this solution uses three account groups:
- Development (data science) account: this account is used by data scientists and ML engineers to perform experimentation and development. Data science tools such as Studio is used in the development account. Amazon S3 buckets with data and models, code repositories and CI/CD pipelines are hosted in this account. Models are built, trained, validated, and registered in the model repository in this account.
- Testing/staging/UAT accounts: Validated and approved models are first deployed to the staging account, where the automated unit and integration tests are run. Data scientists and ML engineers do have read-only access to this account.
- Production accounts: Tested models from the staging accounts are finally deployed to the production account for both online and batch inference.
❗ For real-life production setup we recommend to use additional two account groups:
- Shared services accounts: This account hosts common resources like team code repositories, CI/CD pipelines for MLOps workflows, Docker image repositories, service catalog portfolios, model registries, and library package repositories.
- Data management accounts: A dedicated AWS account to store and manage all data for the machine learning process. It is recommended to implement strong data security and governance practices using AWS Data Lake and AWS Lake Formation.
Each of these account groups can have multiple AWS accounts and environments for development and testing of services and storing different type of data.
Initial baselines for the IAM roles are taken from Secure data science reference architecture open source project on GitHub. You can adjust role permission policies based on your specific security guidelines and requirements.
This solution uses the concept of following roles in Model Development Life Cycle (MDLC):
-
Data engineer/ML Engineer: Responsible for:
- data sourcing
- data quality assurance
- data pre-processing
- Feature engineering
- develop data delivery pipelines (from the data source to destination S3 bucket where it is consumed by ML model/Data scientist)
Permission baseline:
- Data processing services (DMS, AWS Glue ETL, Athena, Kinesis, EMR, DataBrew, SageMaker, SageMaker Data Wrangler)
- Data locations (S3, Data Lakes, Lake Formation)
- Databases (RDS, Aurora, Redshift)
- BI services (Grafana, Quicksight)
- SageMaker notebooks
❗ This role is out of scope of this solution
-
Data scientist:
“Project user” role for a DataScientist. This solution implements a wide permission baseline. You might trim the permissions down for your real-life projects to reflect your security environment and requirements. The provisioned role uses the AWS managed permission policy for the job functionDataScientist
:- Model training and evaluation in SageMaker, Studio, Notebooks, SageMaker JumpStart
- Feature engineering
- Starting ML pipelines
- Deploy models to the model registry
- GitLab permissions
Permission baseline:
- AWS managed policy
DataScientist
- Data locations (S3): only a defined set of S3 buckets (e.g. Data and Model)
- SageMaker, Studio, Notebooks
Role definition: CloudFormation
-
Data science team administrator:
This is the admin role for a team or data science environment. In real-life setup you might want to add one more level of hiearchy and add an administrator role per project (or group of projects) – it creates a separation of duties between proejcts and minimize the blast radius. In this solution we are going to have one administrator role per team or "environment", effectively meaning that Project Adminstrator = Team Administrator. The main responsibilities of the role are:- ML infrastructure management and operations (endpoints, monitoring, scaling, reliability/high availability, BCM/DR)
- Model hosting
- Production integration
- Model approval for deployment in staging and production
- Project template (MLOps) provisioning
- Provision products via AWS Service Catalog
- GitLab repository access
Permission baseline:
- PROD account
- ML infrastructure
- Data locations (S3)
Role definition: CloudFormation
-
Data science administrator:
This is overarching admin role for the data science projects and setting up the secure SageMaker environment. It uses AWS managed policies only. This role has the following responsibilities:- Provisioning of the shared data science infructure
- Model approval for production
- Deploys data science environments via the AWS Service Catalog
Permission baseline:
- Approvals (MLOps, Model registry, CodePipeline)
- Permissions to deploy products from AWS Service Catalog
Role definition: CloudFormation
A dedicated IAM role is created for each persona/user. For a real-life project we recommend to start with a least possible permission set and add necessary permissions based on your security requirements and data science environment settings.
The following diagram shows the provisioned IAM roles for personas/users and execution roles for services:
More information and examples about the best practices and security setup for multi-project and multi-team environments you can find in the blog post Configuring Amazon SageMaker Studio for teams and groups with complete resource isolation on AWS Machine Learning Blog.
For this solution we use three AWS Organizations organizational units (OUs) to simulate development, staging, and production environments. The following section describes how the IAM roles should be mapped to the accounts in the OUs.
DEV account:
- All three IAM user roles must be created in the development (data science) account:
DataScienceAdministratorRole
,DataScienceProjectAdministratorRole
,DataScientistRole
- DataScienceAdministratorRole: overall management of Data Science projects via AWS Service Catalog. Management of the AWS Service Catalog. Deployment of a Data Science environment (VPC, subnets, S3 bucket) to an AWS account
- DataScientistRole: Data Scientist role within a specific project. Provisioned on per-project and per-stage (dev/test/prod) basis
- DataScienceTeamAdministratorRole: Administrator role within a specific team or environment
The following IAM execution roles will be provisioned in the development account:
SageMakerDetectiveControlExecutionRole
: for Lambda function to implement responsive/detective security controlsSCLaunchRole
: for AWS Service Catalog to deploy a new Studio domainSageMakerExecutionRole
: execution role for the SageMaker workloads and StudioSageMakerPipelineExecutionRole
: execution role for SageMaker PipelinesSageMakerModelExecutionRole
: execution role for SageMaker model endpoint, will be created in each of dev, stating and production accountsSCProjectLaunchRole
: for AWS Service Catalog to deploy project-specific products (such as SageMaker Notebooks)AmazonSageMakerServiceCatalogProductsUseRole
: for SageMaker CI/CD execution (CodeBuild and CodePipeline)AmazonSageMakerServiceCatalogProductsLaunchRole
: for SageMaker MLOps project templates deployments via AWS Service CatalogVPCFlowLogsRole
: optional role for VPC Flow Logs to write logs into a CloudWatch log group
The following roles should be created in each of the accounts in the staging and production environments:
ModelExecutionRole
: SageMaker uses this role to run inference endpoints and access the model artifacts during the provisioning of the endpoint. This role must have a trust policy forsagemaker.amazonaws.com
serviceStackSetExecutionRole
: used for CloudFormation stack set operations in the staging and production accounts. This role is assumed byStackSetAdministrationRole
in the development accountSetupStackSetExecutionRole
: this role is needed for stac set opertions for the initial provisioning of the data science environment in multi-account deployment use case
To use multi-account ML model deployment with this solution, you have two options how to provide staging and production accounts ids:
- use AWS Organizations organizational units (OUs)
- use AWS account lists
This solution can use a multi-account AWS environment setup in AWS Organizations. Organizational units (OUs) should be based on function or common set of controls rather than mirroring company’s reporting structure.
If you use an AWS Organization setup option, you must provide two organizational unit ids (OU ids) for the staging and production unit and setup the data science account as the delegated administrator for AWS Organizations.
Your AWS Organizations structure can look like the following:
Root
`--- OU SageMaker PoC
|--- Data Science account (development)
`----OU Staging
|--- Staging acccounts
`----OU Production
|--- Production accounts
Please refer to the Deployment section for the details how to setup the delegated administrator.
Use of AWS Organizations is not needed for a multi-account MLOps setup. The same permission logic and account structure can be implemented with IAM cross-account permissions without need for AWS organizations. This solution also works without AWS Organizations setup. You can provide lists with staging and production AWS account ids during the provisioning of the data science environment.
Please refer to the Deployment section for the description of corresponding CloudFormation parameters.
The solution also implements full MLOps functionality with single-account setup. All examples, workflows and pipelines work in the single (development) data science account. We do not recommend to use a single-account setup for any production use of SageMaker and Studio, but for testing and experimentation purposes it is a fast and cost-effective option to choose.
The following deployment architecture is implemented by this solution:
The main design principles and decisions are:
- SageMaker Studio domain is deployed in a dedicated VPC. Each elastic network interface (ENI) used by SageMaker domain is created within a private dedicated subnet and attached to the specified security groups
Data Science Team VPC
can be configured with internet access by attaching a NAT gateway. You can also run this VPC in internet-free mode without any inbound or outbound internet access- All access to S3 is routed via S3 VPC endpoints
- All access to SageMaker API and runtime as welle as to all used AWS public services is routed via VPC endpoints
- A PyPI repository mirror running on Fargate is hosted within your network to provide a python package repository in internet-free mode (not implemented in this version)
- AWS Service Catalog is used to deploy a data science environment and SageMaker project templates
- All user roles are deployed into Data Science account IAM
- Provisioning of all IAM roles is completely separated from the deployment of the data science environment. You can use your own processes to provision the needed IAM roles.
All self-provisoned products in this solution such as Data Science environment, Studio user profile, SageMaker Notebooks, and MLOps project templates are delivered and deployed via AWS Service Catalog.
One of the main advantages of using AWS Service Catalog for self-service provisioning is that users can configure and deploy configured products and AWS resources without needing full privileges to AWS services. The deployment of all AWS Service Catalog products happens under a specified service role with the defined set of permissions.
The Service Catalog approach offers the following features:
- Product definition via CloudFormation templates following best-practices and compliance requirements
- Control access to underlying AWS services
- Self-service product provisioning for the entitled end users
- Implement technology recommendations and patterns
- Provide security and information privacy guardrails
Governed and secure environments are delivered via AWS Service Data catalog:
- Research and train ML models
- Share reproducible results
- Develop data processing automation
- Develop ML model training automation
- Define ML Deployment resources
- Test ML models before deployment
The following sections describe products which are delivered as part of this solution.
This product provisions end-to-end data science environment (Studio) for a specific Data Science team (Studio Domain) and stage (dev/test/prod). It deploys the following team- and stage-specific resources:
- VPC:
- Dedicated
Data Science Team VPC
- Private subnets in each of the selected Availability Zones (AZ), up to four AZs are supported
- If NAT gateway option is selected: NAT gateways and public subnets in each of the selected AZs
- Dedicated
- VPC endpoints:
- S3 VPC endpoint (
gateway
type) and endpoint policy to access the S3 data and model buckets. Only data and model buckets can be accessed - VPC endpoints (
interface
type) to access AWS public services (CloudWatch, SSM, SageMaker, CodeCommit) - VPC endpoint to access the
Shared services VPC
- S3 VPC endpoint (
- Security Groups:
- SageMaker security group for SageMaker resources. No ingress allowed
- VPC endpoint security group for all VPC endpoints. HTTPS 443 ingress only
- VPC endpoint security group to access the
Shared services VPC
. HTTP 80 ingress only
- IAM roles:
- Data Scientist IAM role
- Data Science team administrator IAM role
- AWS KMS:
- KMS key for data encryption in S3 buckets
- KMS key for data encryption on EBS volumes attached to SageMaker instances (for training, processing, batch jobs)
- Amazon S3 buckets:
- Amazon S3 data bucket with a pre-defined bucket policy. The S3 bucket is encrypted with AWS KMS key.
- Amazon S3 model bucket with a pre-defined bucket policy. The S3 bucket is encrypted with AWS KMS key
This solution deploys two SageMaker projects as Service Catalog products:
- Multi-account model deployment
- Model building, training, and validating with SageMaker Pipelines
These products are available as SageMaker projects in Studio and only deployable from the Studio.
Each provisioning of a Data Science environment product creates a Studio domain with a default user profile. You can optionally manually (from AWS CLI or SageMaker console) create new user profiles:
- Each user profile has its own dedicated compute resource with a slice of the shared EFS file system
- Each user profile can be associated with its own execution role (or use default domain execution role)
❗ There is a limit of one SageMaker domain per region per account and you can provision only one Data Science environment product per region per account.
This product is available for Data Scientist and Data Science Team Administrator roles. Each notebook is provisioned with pre-defined lifecycle configuration. The following considerations are applied to the notebook product:
- Only some instance types are allowed to use in the notebook
- Pre-defined notebook execution role is attached to the notebook
- Notebook execution role enforce use of security configurations and controls (e.g. the notebook can be started only in VPC attachment mode)
- Notebook has write access only to the projejct-specific S3 buckets (data and model as deployed by the Data Science Environment product)
- Notebook-attached EBS volume is encrypted with its own AWS KMS key
- Notebook is started in the SageMaker VPC, subnet, and security group
Functional MLOps architecture based on SageMaker MLOps Project templates as described in Building, automating, managing, and scaling ML workflows using Amazon SageMaker Pipelines and Multi-account model deployment with Amazon SageMaker Pipelines blog posts on the AWS Machine Learning Blog.
The following diagram shows the MLOps architecture which is implemented by this solution and delivered via MLOps SageMaker project templates:
The main design principles are:
- MLOps project templates are deployed via Studio
- Dedicated IAM user and execution roles used to perform assigned actions/tasks in the environment
- All project artifacts are connected via SageMaker ProjectId ensuring a strong data governance and lineage
- Multi-account deployment approach is used for secure deployment of your SageMaker models
See more details in the MLOps projects section.
This section describes security controls and best practices implemented by the solution.
- All network traffic is transferred over private and secure network links
- All ingress internet access is blocked for the private subnets and only allowed for NAT gateway route
- Optionally you can block all internet egress creating a completely internet-free secure environment
- SageMaker endpoints with a trained, validated, and approved model are hosted in dedicated staging and production accounts in your private VPC
- All access is managed by IAM and can be compliant with your corporate authentication standards
- All user interfaces can be integrated with your Active Directory or SSO system
- Access to any resource is disabled by default (implicit deny) and must be explicitly authorized in permission or resource policies
- You can limit access to data, code and training resources by role and job function
- All data is encrypted in-transit and at-rest using customer-managed AWS KMS keys
- You can block access to public libraries and frameworks
- Code and model artifacts are securely persisted in CodeCommit repositories
- The solution can provide end-to-end auditability with AWS CloudTrail, AWS Config, and Amazon CloudWatch
- Network traffic can be captured at individual network interface level
We use an IAM role policy which enforce usage of specific security controls. For example, all SageMaker workloads must be created in the VPC with specified security groups and subnets:
{
"Condition": {
"Null": {
"sagemaker:VpcSubnets": "true"
}
},
"Action": [
"sagemaker:CreateNotebookInstance",
"sagemaker:CreateHyperParameterTuningJob",
"sagemaker:CreateProcessingJob",
"sagemaker:CreateTrainingJob",
"sagemaker:CreateModel"
],
"Resource": [
"arn:aws:sagemaker:*:<ACCOUNT_ID>:*"
],
"Effect": "Deny"
}
List of IAM policy conditions for Amazon SageMaker
We use an Amazon S3 bucket policy explicitly denies all access which is not originated from the designated S3 VPC endpoints:
{
"Version": "2008-10-17",
"Statement": [
{
"Effect": "Deny",
"Principal": "*",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::<s3-bucket-name>/*",
"arn:aws:s3:::<s3-bucket-name>"
],
"Condition": {
"StringNotEquals": {
"aws:sourceVpce": ["<s3-vpc-endpoint-id1>", "<s3-vpc-endpoint-id2>"]
}
}
}
]
}
S3 VPC endpoint policy allows access only to the specified S3 project buckets with data, models and CI/CD pipeline artifacts, SageMaker-owned S3 bucket and S3 objects which are used for product provisioning.
Not implemented in this version
Not implemented in this version
This solution delivers two MLOps projects as SageMaker project templates:
- Model build, train, and validate pipeline
- Multi-account model deploy pipeline
These projects are fully functional examples which are integrated with exising multi-layer security controls such as VPC, subnets, security groups, AWS account boundaries, and the dedicated IAM execution roles.
The solution is based on the SageMaker project template for model building, training, and deployment. You can find in-depth review of this MLOps project in Building, automating, managing, and scaling ML workflows using Amazon SageMaker Pipelines on the AWS Machine Learning Blog.
The following diagram shows the functional components of the MLOps project.
This project provisions the following resources as part of a MLOps pipeline:
- The MLOps template is made available through SageMaker projects and is provided via an AWS Service Catalog portfolio
- CodePipeline pipeline with two stages -
Source
to get the source code andBuild
to build and execute the SageMaker pipeline - SageMaker pipeline implements a repeatable workflow which processes the data, trains, validates, and register the model
- Seed code repository in CodeCommit:
- This repository provides seed code to create a multi-step model building pipeline including the following steps: data processing, model training, model evaluation, and conditional model registration based on model accuracy. As you can see in the
pipeline.py
file, this pipeline trains a linear regression model using the XGBoost algorithm on the well-known UCI Abalone dataset. This repository also includes a build specification file, used by CodePipeline and CodeBuild to run the pipeline automatically
You can find a step-by-step instruction, implementation details, and usage patterns of the model building pipeline project in the provided notebook sagemaker-pipeline.ipynb
and sagemaker-pipelines-project.ipynb
files, delivered as part of the seed code.
To deploy the notebooks into your local environment, you must clone the CodeCommit repository with the seed code after you have deployed the SageMaker project into the Studio. Go to the project overview page, select the Repositories
tab and click the clone repo...
link:
After the clone operation finished, you can browse the repository files in Studio File view:
You can open the notebook and start experimenting with SageMaker Pipelines.
The following diagram shows the functional components of the MLOps project.
This MLOps project consists of the following parts:
- The MLOps project template deployable through SageMaker project in Studio
- CodeCommit repository with seed code
- Model deployment multi-stage CodePipeline pipeline
- Staging AWS account (can be the same account as the data science account)
- Production AWS account (can be the same account as the data science account)
- SageMaker endpoints with an approved model hosted in your private VPC
The following diagram shows how the trained and approved model is deployed into the taget accounts.
After model training and validation, the model is registered in the model registry. Model registry stores the model metadata, and all model artifacts are stored in an S3 bucket (step 1 in the preceeding diagram). The CI/CD pipeline uses CloudFormation stack sets (2) to deploy the model in the target accounts. The CloudFormation service assume the role StackSetExecutionRole
(3) in the target account to perform the deployment. SageMaker also assumes the ModelExecutionRole
(4) to access the model metadata and download the model artifacts from the S3 bucket. The StackSetExecutionRole
must have iam:PassRole
permission (5) for ModelExecutionRole
to be able to pass the role successfully. Finally, the model is deployed to a SageMaker endpoint (6).
To access the model artifacts and a KMS encryption key an additional cross-account permission setup is needed in case of the multi-account deployment:
All access to the model artifacts happens via the S3 VPC endpoint (1). This VPC endpoint allows access to the model and data S3 buckets. The model S3 bucket policy (2) grant access to the ModelExecutionRole principals (5) in each of the target accounts.
"Sid": "AllowCrossAccount",
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::<staging-account>:role/SageMakerModelExecutionRole",
"arn:aws:iam::<prod-account>:role/SageMakerModelExecutionRole",
"arn:aws:iam::<dev-account>:root"
]
}
We apply the same setup for the data encryption key (3), whose policy (4) grant access to the principals in the target accounts. SageMaker model-hosting endpoints are placed in a VPC (6) in each of the target accounts. Any access to S3 buckets and KMS keys happens via the corresponding VPC endpoints. The IDs of these VPC endpoints are added to the Condition statement of the S3 bucket and KMS keys resource policies.
"Sid": "DenyNoVPC",
"Effect": "Deny",
"Principal": "*",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket",
"s3:GetBucketAcl",
"s3:GetObjectAcl",
"s3:PutBucketAcl",
"s3:PutObjectAcl"
],
"Resource": [
"arn:aws:s3:::sm-mlops-dev-us-east-1-models/*",
"arn:aws:s3:::sm-mlops-dev-us-east-1-models"
],
"Condition": {
"StringNotEquals": {
"aws:sourceVpce": [
"vpce-0b82e29a828790da2",
"vpce-07ef65869ca950e14",
"vpce-03d9ed0a1ba396ff5"
]
}
}
Multi-account model deployment can use the AWS Organizations setup to deploy model to the staging and production organizational units (OUs) or provided staging and production account lists. For a proper functioning of the multi-account deployment process, you must configure the cross-account access and specific execution roles in the target accounts.
Execution roles SageMakerModelExecutionRole
and StackSetExecutionRole
must be deployed in all target accounts. Target accounts are accounts where models are deployed. These accoutns are member of the staging and production OUs or provided in the staging and production account lists at data science environment provisioning time.
These execution roles are deployed to the target accounts automatically during the provisioning of the data science enviroment if the parameter CreateEnvironmentIAMRoles
is set to YES
. If this parameter is set to NO
, you are responsible for provisioning of the execution roles in all target accounts. Refer to `predeploy-iam-setup.md' for step-by-step instructions.
The model execution role SageMakerModelExecutionRole
in the staging and production accounts is assumed by AmazonSageMakerServiceCatalogProductsUseRole
in the data science environment account to test the endpoints in the target accounts.
Alternatively you can choose to use single-account deployment. In this case the ML model will be deployed in the data science account (Studio account). You do not need to setup target account execution roles and provide OU IDs or account lists as deployment parameters.
❗ If you use single-account deployment, the MultiAccountDeployment
variable for MLOps Model Deploy project must be set to NO
:
The following pre-requisites are common for both single- and multi-account deployment. These pre-requisites are automatically provisioned if you use provided CLoudFormation templates.
- SageMaker must be configured with at least two subnets in two AZs, otherwise the SageMaker endpoint deployment will fail as it requires at least two AZs to deploy an endpoint
- CI/CD pipeline with model deployment uses AWS CloudFormation StackSets. It requires two IAM service roles created or provided (in case of the BYO IAM role option):
StackSetAdministrationRole
: This role must exist in the data science account and used to perform administration stack set operations in the data science account. TheAmazonSageMakerServiceCatalogProductsUseRole
must haveiam:PassRole
permission for this roleStackSetExecutionRole
: This role must exist in the data science account and each of the target accounts in staging and production environments. This role is assumed byStackSetAdministrationRole
to perform stack set operations in the target accounts. This role must haveiam:PassRole
permission for the model execution roleSageMakerModelExecutionRole
You can find step-by-step instructions, implementation details, and usage patterns of multi-account model deployment in the provided notebook.
Sign in to the console with the data scientist account. On the SageMaker console, open Studio with your user profile (default name is <environment name>-<environment type>-<region>-user-profile
).
In the Studio:
- Choose the Components and registries
- On the drop-down menu, choose Projects
- Choose Create project
- Choose Organization templates
- Choose a project template from the list
Each of the delivered MLOps projects contains a seed code which is deployed as project's CodeCommit repository.
The seed repository contains fully functional source code used by the CI/CD pipeline for model building, training, and validating, or for multi-project model deployment. Refer to README.md
for each of the provided projects.
To work with the seed repository source code you must clone the repository into your Studio environment. If you would like to develop the seed code and update the MLOps project templates with new version of the code, refer to the Appendix G
After you have finished working and experimenting with MLOps projects you should perform clean up of the provisioned SageMaker resources to avoid charges. The following billable resources should be removed:
- CloudFormation stack sets with model deployment (in case you run Model deploy pipeline)
This will delete provisioned SageMaker endpoints and associated resources - SageMaker projects and corresponding S3 buckets with project artifacts
- Any data in the data and model S3 buckets
For the full clean-up CLI script refer to the Clean up
secion in the delivered shell script.
❗ This is a destructive action. All data on in Amazon S3 buckets for MLOps pipelines, ML data, and ML models will be permanently deleted. All MLOps project seed code repositories will be permanently removed from your AWS environment.
The provided notebooks for MLOps projects - sagemaker-model-deploy and sagemaker-pipelines-project - include clean-up code to remove created resources such as SageMaker projects, SageMaker endpoints, CloudFormation stack sets, and S3 bucket. Run the code cells in the Clean up section after you finished experimenting with the project:
import time
cf = boto3.client("cloudformation")
for ss in [
f"sagemaker-{project_name}-{project_id}-deploy-{env_data['EnvTypeStagingName']}",
f"sagemaker-{project_name}-{project_id}-deploy-{env_data['EnvTypeProdName']}"
]:
accounts = [a["Account"] for a in cf.list_stack_instances(StackSetName=ss)["Summaries"]]
print(f"delete stack set instances for {ss} stack set for the accounts {accounts}")
r = cf.delete_stack_instances(
StackSetName=ss,
Accounts=accounts,
Regions=[boto3.session.Session().region_name],
RetainStacks=False,
)
print(r)
time.sleep(180)
print(f"delete stack set {ss}")
r = cf.delete_stack_set(
StackSetName=ss
)
print(f"Deleting project {project_name}:{sm.delete_project(ProjectName=project_name)}")
!aws s3 rb s3://sm-mlops-cp-{project_name}-{project_id} --force
Alternatively, you can run the commands in shell script to clean up resources of multiple projects.
After completion of all clean-up steps for MLOps projects you can proceed with data science environment clean-up steps.
To verify the access to the Amazon S3 buckets for the data science environment, you can run the following commands in the Studio terminal:
aws s3 ls
The S3 VPC endpoint policy blocks access to S3 ListBuckets
operation.
aws s3 ls s3://<sagemaker deployment data S3 bucket name>
You can access the data science environment's data or models S3 buckets.
aws s3 mb s3://<any available bucket name>
The S3 VPC endpoint policy blocks access to any other S3 bucket.
aws sts get-caller-identity
All operations are performed under the SageMaker execution role.
Try to start a training job without VPC attachment:
container_uri = sagemaker.image_uris.retrieve(region=session.region_name,
framework='xgboost',
version='1.0-1',
image_scope='training')
xgb = sagemaker.estimator.Estimator(image_uri=container_uri,
role=sagemaker_execution_role,
instance_count=2,
instance_type='ml.m5.xlarge',
output_path='s3://{}/{}/model-artifacts'.format(default_bucket, prefix),
sagemaker_session=sagemaker_session,
base_job_name='reorder-classifier',
subnets=network_config.subnets,
security_group_ids=network_config.security_group_ids,
encrypt_inter_container_traffic=network_config.encrypt_inter_container_traffic,
enable_network_isolation=network_config.enable_network_isolation,
volume_kms_key=ebs_kms_id,
output_kms_key=s3_kms_id
)
xgb.set_hyperparameters(objective='binary:logistic',
num_round=100)
xgb.fit({'train': train_set_pointer, 'validation': validation_set_pointer})
You will get AccessDeniedException
because of the explicit Deny
in the IAM policy:
IAM policy:
{
"Condition": {
"Null": {
"sagemaker:VpcSubnets": "true",
"sagemaker:VpcSecurityGroup": "true"
}
},
"Action": [
"sagemaker:CreateNotebookInstance",
"sagemaker:CreateHyperParameterTuningJob",
"sagemaker:CreateProcessingJob",
"sagemaker:CreateTrainingJob",
"sagemaker:CreateModel"
],
"Resource": [
"arn:aws:sagemaker:*:<ACCOUNT_ID>:*"
],
"Effect": "Deny"
}
Now add the secure network configuration to the Estimator
:
network_config = NetworkConfig(
enable_network_isolation=False,
security_group_ids=env_data["SecurityGroups"],
subnets=env_data["SubnetIds"],
encrypt_inter_container_traffic=True)
xgb = sagemaker.estimator.Estimator(
image_uri=container_uri,
role=sagemaker_execution_role,
instance_count=2,
instance_type='ml.m5.xlarge',
output_path='s3://{}/{}/model-artifacts'.format(default_bucket, prefix),
sagemaker_session=sagemaker_session,
base_job_name='reorder-classifier',
subnets=network_config.subnets,
security_group_ids=network_config.security_group_ids,
encrypt_inter_container_traffic=network_config.encrypt_inter_container_traffic,
enable_network_isolation=network_config.enable_network_isolation,
volume_kms_key=ebs_kms_id,
output_kms_key=s3_kms_id
)
You will be able to create and run the training job
To deploy the solution, you must have Administrator (or Power User) permissions to package the CloudFormation templates, upload templates in your Amazon S3 bucket, and run the deployment commands.
You must also have AWS CLI. If you do not have it, see Installing, updating, and uninstalling the AWS CLI. If you would like to use the multi-account model deployment option, you need access to minimum two AWS accounts, recommended three accounts for development, staging and production environments.
Please go through these step-by-step instructions to package and upload the solution templates into a S3 bucket for the deployment.
If you would like to run a security scan on the CloudFormation templates using cfn_nag
(recommended), you have to install cfn_nag
:
brew install ruby brew-gem
brew gem install cfn-nag
To initiate the security scan, run the following command:
make cfn_nag_scan
You have a choice of different independent deployment options using the delivered CloudFormation templates:
- Data science environment quickstart: deploy end-to-end Data Science Environment with majority of options set to default values. This deployment type supports single-account model deployment workflow only. You can change only few deployment options
- Two-step deployment via CloudFormation: deploy the core infrastructure in the first step and then deploy a Data Science Environment, both as CloudFormation templates. CLI
aws cloudformation create-stack
is used for deployment. You can change any deployment option - Two-step deployment via CloudFormation and AWS Service Catalog: deploy the core infrastructure in the first step via
aws cloudformation create-stack
and then deploy a Data Science Environment via AWS Service Catalog. You can change any deployment option
The following sections give step-by-step deployment instructions for each of the options.
You can also find all CLI commands in the delivered shell scripts in the project folder test
.
This special type of deployment is designed for an environment, where all IAM-altering operations, such as role and policy creation, are separated from the main deployment. All IAM roles for users and services and related IAM permission policies should be created as part of a separate process following the separation of duties principle.
The IAM part can be deployed using the delivered CloudFormation templates or completely separated out-of-stack in your own process. You will provide the ARNs for the IAM roles as CloudFormation template parameters to deploy the Data Science environment.
See Appendix B
Multi-account model deployment requires VPC infrastructure and specific execution roles to be provisioned in the target accounts. The provisioning of the infrastructure and the roles is done automatically during the deployment of the data science environment as a part of the overall deployment process. To enable multi-account setup you must provide the staging and production organizational unit (OUs) IDs OR staging and production lists as CloudFormation parameters for the deployment.
This diagram shows how the CloudFormation stack sets are used to deploy the needed infrastructure to the target accounts.
Two stack sets - one for the VPC infrastructure and another for the roles - are deployed for each envrionment type, staging and production.
One-off setup is needed to enable multi-account model deployment workflow with SageMaker MLOps projects. You don't need to perform this setup if you are going to use single-account deployment only.
The provisioning of a data science environment uses CloudFormation stack set to deploy the IAM roles and VPC infrastructure into the target accounts.
The solution uses SELF_MANAGED
stack set permission model and needs two IAM roles:
AdministratorRole
in the development account (main account)SetupStackSetExecutionRole
in each of the target accounts
You must provision these roles before starting the solution deployment. The AdministratorRole
is automatically created during the soluton deployment. For the SetupStackSetExecutionRole
you can use the delivered CloudFormation template env-iam-setup-stacksest-role.yaml
or your own process of provisioning of an IAM role.
# STEP 1:
# SELF_MANAGED stack set permission model:
# Deploy a stack set execution role to _EACH_ of the target accounts in both staging and prod OUs
# This stack set execution role is used to deploy the target accounts stack sets in env-main.yaml
# !!!!!!!!!!!! RUN THIS COMMAND IN EACH OF THE TARGET ACCOUNTS !!!!!!!!!!!!
ENV_NAME="sm-mlops"
ENV_TYPE=# use your own consistent environment stage names like "staging" and "prod"
STACK_NAME=$ENV_NAME-setup-stackset-role
ADMIN_ACCOUNT_ID=<DATA SCIENCE DEVELOPMENT ACCOUNT ID>
SETUP_STACKSET_ROLE_NAME=$ENV_NAME-setup-stackset-execution-role
# Delete stack if it exists
aws cloudformation delete-stack --stack-name $STACK_NAME
aws cloudformation deploy \
--template-file cfn_templates/env-iam-setup-stackset-role.yaml \
--stack-name $STACK_NAME \
--capabilities CAPABILITY_NAMED_IAM \
--parameter-overrides \
EnvName=$ENV_NAME \
EnvType=$ENV_TYPE \
StackSetExecutionRoleName=$SETUP_STACKSET_ROLE_NAME \
AdministratorAccountId=$ADMIN_ACCOUNT_ID
aws cloudformation describe-stacks \
--stack-name $STACK_NAME \
--output table \
--query "Stacks[0].Outputs[*].[OutputKey, OutputValue]"
The name of the provisioned IAM role StackSetExecutionRoleName
must be passed to the env-main.yaml
template or used in Service Catalog-based deployment as SetupStackSetExecutionRoleName
parameter.
This step is only needed if you use AWS Organizations setup.
A delegated administrator account must be registred in order to enable ListAccountsForParent
AWS Organization API call. If the data science account is already the management account in the AWS Organizations, this step must be skipped.
# STEP 2:
# Register a delegated administrator to enable AWS Organizations API permission for non-management account
# Must be run under administrator in the AWS Organizations _management account_
aws organizations register-delegated-administrator \
--service-principal=member.org.stacksets.cloudformation.amazonaws.com \
--account-id=$ADMIN_ACCOUNT_ID
aws organizations list-delegated-administrators \
--service-principal=member.org.stacksets.cloudformation.amazonaws.com
The solution is designed for multi-region deployment. You can deploy end-to-end stack in any region of a single AWS account. The following limitations and considerations apply:
- The shared IAM roles (
DSAdministratorRole
,SageMakerDetectiveControlExecutionRole
,SCLaunchRole
) are created each time you deploy a new core infrastructure (core-main
) or "quickstart" (data-science-environment-quickstart
) stack. They created with<StackName>-<RegionName>
prefix and designed to be unique within your end-to-end data science environment. For example, if you deploy one stack set (including core infrastructure and team data science environment) in one region and another stack in another region, these two stacks will not share any IAM roles and any users assuming any persona roles will have an independent set of permissions per stack set. - The environment IAM roles (
DSTeamAdministratorRole
,DataScientistRole
,SageMakerExecutionRole
,SageMakerPipelineExecutionRole
,SCProjectLaunchRole
,SageMakerModelExecutionRole
) are created with unique names. Each deployment of a new data science environment (via CloudFormation or via AWS Service Catalog) creates a set of unique roles. - SageMaker Studio uses two pre-defined roles
AmazonSageMakerServiceCatalogProductsLaunchRole
andAmazonSageMakerServiceCatalogProductsUseRole
. These roles are global for the AWS account and created by the first deployment of core infrastructure. These two roles haveRetain
deletion policy and are not deleted when you delete the stack which has created these roles.
If you created any SageMaker projects, you must clean up resources as described in the Clean up after working with MLOps project templates section.
To delete resources of the multiple projects, you can use this shell script.
CloudFormation delete-stack
will not remove any non-empty S3 bucket. You must empty data science environment S3 buckets for data and models before you can delete the data science environment stack.
First, remove VPC-only access policy from the data and model bucket to be able to delete objects from a CLI terminal.
ENV_NAME=<use default name `sm-mlops` or your data science environment name you chosen when you created the stack>
aws s3api delete-bucket-policy --bucket $ENV_NAME-dev-${AWS_DEFAULT_REGION}-data
aws s3api delete-bucket-policy --bucket $ENV_NAME-dev-${AWS_DEFAULT_REGION}-models
❗ This is a destructive action. The following command will delete all files in the data and models S3 buckets ❗
Now you can empty the buckets:
aws s3 rm s3://$ENV_NAME-dev-$AWS_DEFAULT_REGION-data --recursive
aws s3 rm s3://$ENV_NAME-dev-$AWS_DEFAULT_REGION-models --recursive
Depending on the deployment type, you must delete the corresponding CloudFormation stacks. The following commands use the default stack names. If you customized the stack names, adjust the commands correspondingly.
aws cloudformation delete-stack --stack-name ds-quickstart
aws cloudformation wait stack-delete-complete --stack-name ds-quickstart
aws cloudformation delete-stack --stack-name sagemaker-mlops-package-cfn
aws cloudformation delete-stack --stack-name sm-mlops-env
aws cloudformation wait stack-delete-complete --stack-name sm-mlops-env
aws cloudformation delete-stack --stack-name sm-mlops-core
aws cloudformation wait stack-delete-complete --stack-name sm-mlops-core
aws cloudformation delete-stack --stack-name sagemaker-mlops-package-cfn
- Assume DS Administrator IAM role via link in the CloudFormation output.
aws cloudformation describe-stacks \
--stack-name sm-mlops-core \
--output table \
--query "Stacks[0].Outputs[*].[OutputKey, OutputValue]"
- In AWS Service Catalog console go to the Provisioned Products, select your product and click Terminate from the Action button. Wait until the delete process ends.
- Delete the core infrastructure CloudFormation stack:
aws cloudformation delete-stack --stack-name sm-mlops-core
aws cloudformation wait stack-delete-complete --stack-name sm-mlops-core
aws cloudformation delete-stack --stack-name sagemaker-mlops-package-cfn
The deployment of Studio creates a new EFS in your account. This EFS is shared with all users of Studio and contains home directories for Studio users and may contain your data. When you delete the data science environment stack, the Studio domain, user profile and Apps are also deleted. However, the EFS will not be deleted and kept "as is" in your account. Additional resources are created by Studio and retained upon deletion together with the EFS:
- EFS mounting points in each private subnet of your VPC
- ENI for each mounting point
- Security groups for EFS inbound and outbound traffic
❗ To delete the EFS and EFS-related resources in your AWS account created by the deployment of this solution, do the following steps after running delete-stack
commands.
❗ This is a destructive action. All data on the EFS will be deleted (SageMaker home directories). You may want to backup the EFS before deletion. ❗
From AWS console: aGot to the EFS console and delete the SageMaker EFS. You may want to backup the EFS before deletion.
To find the SageMaker EFS, click on the file system ID and then on the Tags tab. You see a tag with the Tag Key ManagedByAmazonSageMakerResource. Its Tag Value contains the SageMaker domain ID:
Click on the Delete button to delete this EFS.
- Go to the VPC console and delete the data science VPC
Alternatively, you can remove EFS using the following AWS CLI commands:
- List SageMaker domain IDs for all EFS with SageMaker tag:
aws efs describe-file-systems \
--query 'FileSystems[].Tags[?Key==`ManagedByAmazonSageMakerResource`].Value[]'
- Copy the SageMaker domain ID and run the following script from the solution directory:
SM_DOMAIN_ID=#SageMaker domain id
pipenv run python3 functions/pipeline/clean-up-efs-cli.py $SM_DOMAIN_ID
For the full clean-up scrip please refer to the Clean up
secion in the delivered shell script and instructions in MLOps project section.
The following three sections describes each deployment type and deployment use case in detail.
This option deploys the end-to-end infrastructure and a Data Science Environment in one go. You can change only few deployment options. The majority of the options are set to their default values.
📜 Use this option if you want to provision a completely new set of the infrastructure and do not want to parametrize the deployment.
❗ With Quickstart deployment type you can uses single-account model deployment only. Do not select YES
for MultiAccountDeployment
parameter for model deploy SageMaker project template:
The only deployment options you can change are:
CreateSharedServices
: defaultNO
. Set toYES
if you want to provision a shared services VPC with a private PyPI mirror (not implemeted at this stage)VPCIDR
: default10.0.0.0/16
. CIDR block for the new VPC- Private and public subnets CIDR blocks: default
10.0.0.0/19
Make sure you specify the CIDR blocks which do not conflict with your existing network IP ranges.
❗ You cannot use existing VPC or existing IAM roles to deploy this stack. The stack will provision a new own set of network and IAM resources.
Initiate the stack deployment with the following command. Use the S3 bucket name you used to upload CloudFormation templates in Package CloudFormation templates section.
STACK_NAME="ds-quickstart"
ENV_NAME="sagemaker-mlops"
aws cloudformation create-stack \
--template-url https://s3.$AWS_DEFAULT_REGION.amazonaws.com/$S3_BUCKET_NAME/sagemaker-mlops/data-science-environment-quickstart.yaml \
--region $AWS_DEFAULT_REGION \
--stack-name $STACK_NAME \
--disable-rollback \
--capabilities CAPABILITY_NAMED_IAM \
--parameters \
ParameterKey=EnvName,ParameterValue=$ENV_NAME \
ParameterKey=EnvType,ParameterValue=dev \
ParameterKey=SeedCodeS3BucketName,ParameterValue=$S3_BUCKET_NAME
The full end-to-end deployment takes about 25 minutes.
Refer to clean up section.
Using this option you provision a Data Science environment in two steps, each with its own CloudFormation template. You can control all deployment parameters.
📜 Use this option if you want to parametrize every aspect of the deployment based on your specific requirements and enviroment.
❗ You can select your existing VPC and network resources (subnets, NAT gateways, route tables) and existing IAM resources to be used for stack set deployment. Set the correspoinding CloudFormation parameters to names and ARNs or your existing resources.
❗ You must specify the valid OU IDs for the OrganizationalUnitStagingId
/OrganizationalUnitProdId
or StagingAccountList
/ProductionAccountList
parameters for the env-main.yaml
template to enable multi-account model deployment.
You can use the provided shell script to run this deployment type or follow the commands below.
In this step you deploy the shared core infrastructure into your AWS Account. The stack (core-main.yaml
) will provision:
- Shared IAM roles for Data Science personas and services (optional if you bring your own IAM roles)
- A shared services VPC and related networking resources (optional if you bring your own network configuration)
- An ECS Fargate cluster to run a private PyPi mirror (not implemented at this stage)
- An AWS Service Catalog portfolio to provide a self-service deployment for the Data Science administrator user role
- Security guardrails for your Data Science environment (detective controls are not implemented at this stage)
The deployment options you can use are:
CreateIAMRoles
: defaultYES
. Set toNO
if you have created the IAM roles outside of the stack (e.g. via a separate process) - such as "Bring Your Own IAM Role (BYOR IAM)" use caseCreateSharedServices
: defaultNO
. Set toYES
if you would like to create a shared services VPC and an ECS Fargate cluster for a private PyPi mirror (not implemented at this stage)CreateSCPortfolio
: defaultYES
. Set toNO
if you don't want to to deploy an AWS Service Catalog portfolio with Data Science environment productsDSAdministratorRoleArn
: required ifCreateIAMRoles=NO
, otherwise will be automatically provisionedSCLaunchRoleArn
: required ifCreateIAMRoles=NO
, otherwise will be automatically provisionedSecurityControlExecutionRoleArn
: required ifCreateIAMRoles=NO
, otherwise will be automatically provisioned
The following command uses the default values for the deployment options. You can specify parameters via ParameterKey=<ParameterKey>,ParameterValue=<Value>
pairs in the aws cloudformation create-stack
call:
STACK_NAME="sm-mlops-core"
aws cloudformation create-stack \
--template-url https://s3.$AWS_DEFAULT_REGION.amazonaws.com/$S3_BUCKET_NAME/sagemaker-mlops/core-main.yaml \
--region $AWS_DEFAULT_REGION \
--stack-name $STACK_NAME \
--disable-rollback \
--capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM \
--parameters \
ParameterKey=StackSetName,ParameterValue=$STACK_NAME
After a successful stack deployment, you can see the stack output:
aws cloudformation describe-stacks \
--stack-name sm-mlops-core \
--output table \
--query "Stacks[0].Outputs[*].[OutputKey, OutputValue]"
The step 2 CloudFormation template (env-main.yaml
) provides two deployment options:
- Deploy Studio into a new VPC: This option provisions a new AWS network infrastruture consisting of:
- VPC
- private subnet in each AZ
- optional public subnet in each AZ. The public subnets are provisioned only if you chose to create NAT gateways
- VPC endpoints to access public AWS services and Amazon S3 buckets for data science projects
- security groups
- optional NAT gateway in each AZ
- if you select the NAT gateway option, an internet gateway will be created and attached to the VPC
- routing tables and routes for private and public subnets
You specify the number of AZs and CIDR blocks for VPC and each of the subnets.
After provisioning the network infrastructure, the solution deploys Studio into this VPC.
- Deploy Studio into an existing VPC: This option provisions Studio in your existing AWS network infrastructure. You have several options to choose between existing or create new network resources:
- VPC: you must provide a valid existing VPC Id
- Subnets: you can choose between:
- providing existing subnet CIDR blocks (set
CreatePrivateSubnets
toNO
) - in this case no new subnets are provisioned and NAT gateway option is not available. All SageMaker resources are deployed into your existing VPC and private subnets. You use your existing NAT (if available) to access internet from the private subnets - provisioning new private (set
CreatePrivateSubnets
toYES
) and optional (only if the NAT gateway option is selected,CreateNATGateways
=YES
) public subnets. The deployment creates new subnets with specified CIDR blocks inside your existing VPC.
- providing existing subnet CIDR blocks (set
You must specify the number of AZs you would like to deploy the network resources into.
❗ A new internet gateway will be created and attached to the VPC in "existing VPC" scenario if you select the NAT gateway option by setting CreateNATGateways
to YES
. The stack creation will fail if there is an internet gateway already attached to the existing VPC and you select the NAT gateway option.
Example of the existing VPC infrastructure:
The Data Science environment deployment will provision the following resources in your AWS account:
- environment-specific IAM roles (optional, if
CreateEnvironmentIAMRoles
set toYES
) - a VPC with all network infrastructure for the environment (optional) - see the considerations above
- VPC endpoints to access the environment's Amazon S3 buckets and AWS public services via private network
- AWS KMS keys for data encryption
- two S3 buckets for environment data and model artifacts
- AWS Service Catalog portfolio with environment-specific products
- Studio domain and default user profile
If you choose the multi-account model deployment option by providing values for OrganizationalUnitStagingId
/OrganizationalUnitProdId
or StagingAccountList
/ProductionAccountList
, the deployment will provision the following resources in the target accounts:
- VPC with a private subnet in each of the AZs, no internet connectivity
- Security groups for SageMaker model hosting and VPC endpoints
- Execution roles for stack set operations and SageMaker models
You can change any deployment options via CloudFormation parameters for core-main.yaml
and env-main.yaml
templates.
Run command providing the deployment options for your environment. The following command uses the minimal set of the options:
STACK_NAME="sm-mlops-env"
ENV_NAME="sm-mlops"
aws cloudformation create-stack \
--template-url https://s3.$AWS_DEFAULT_REGION.amazonaws.com/$S3_BUCKET_NAME/sagemaker-mlops/env-main.yaml \
--region $AWS_DEFAULT_REGION \
--stack-name $STACK_NAME \
--disable-rollback \
--capabilities CAPABILITY_NAMED_IAM \
--parameters \
ParameterKey=EnvName,ParameterValue=$ENV_NAME \
ParameterKey=EnvType,ParameterValue=dev \
ParameterKey=AvailabilityZones,ParameterValue=${AWS_DEFAULT_REGION}a\\,${AWS_DEFAULT_REGION}b \
ParameterKey=NumberOfAZs,ParameterValue=2 \
ParameterKey=SeedCodeS3BucketName,ParameterValue=$S3_BUCKET_NAME
If you would like to use multi-account model deployment, you must provide the valid values for OU IDs or account lists and the name for the SetupStackSetExecutionRole
:
STACK_NAME="sm-mlops-env"
ENV_NAME="sm-mlops"
STAGING_OU_ID=<OU id>
PROD_OU_ID=<OU id>
STAGING_ACCOUNTS=<comma-delimited account list>
PROD_ACCOUNTS=<comma-delimited account list>
SETUP_STACKSET_ROLE_NAME=$ENV_NAME-setup-stackset-execution-role
aws cloudformation create-stack \
--template-url https://s3.$AWS_DEFAULT_REGION.amazonaws.com/$S3_BUCKET_NAME/sagemaker-mlops/env-main.yaml \
--region $AWS_DEFAULT_REGION \
--stack-name $STACK_NAME \
--disable-rollback \
--capabilities CAPABILITY_NAMED_IAM \
--parameters \
ParameterKey=EnvName,ParameterValue=$ENV_NAME \
ParameterKey=EnvType,ParameterValue=dev \
ParameterKey=AvailabilityZones,ParameterValue=${AWS_DEFAULT_REGION}a\\,${AWS_DEFAULT_REGION}b \
ParameterKey=NumberOfAZs,ParameterValue=2 \
ParameterKey=StartKernelGatewayApps,ParameterValue=NO \
ParameterKey=SeedCodeS3BucketName,ParameterValue=$S3_BUCKET_NAME \
ParameterKey=OrganizationalUnitStagingId,ParameterValue=$STAGING_OU_ID \
ParameterKey=OrganizationalUnitProdId,ParameterValue=$PROD_OU_ID \
ParameterKey=StagingAccountList,ParameterValue=$STAGING_ACCOUNTS \
ParameterKey=ProductionAccountList,ParameterValue=$PROD_ACCOUNTS \
ParameterKey=SetupStackSetExecutionRoleName,ParameterValue=$SETUP_STACKSET_ROLE_NAME
Refer to clean up section.
This deployment option first deploys the core infrastructure including the AWS Service Catalog portfolio of Data Science products. In the second step, the Data Science Administrator deploys a Data Science environment via the AWS Service Catalog.
📜 Use this option if you want to similate the end user experience in provisioning a Data Science environment via AWS Service Catalog
❗ You can select your existing VPC and network resources (subnets, NAT gateways, route tables) and existing IAM resources to be used for stack set deployment. Set the correspoinding CloudFormation and AWS Service Catalog product parameters to names and ARNs or your existing resources.
Same as Step 1 from Two-step deployment via CloudFormation
After the base infrastructure is provisioned, data scientists and other users must assume the DS Administrator IAM role (AssumeDSAdministratorRole
) via link in the CloudFormation output. In this role, the users can browse the AWS Service Catalog and then provision a secure Studio environment.
First, print the output from the stack deployment in Step 1:
aws cloudformation describe-stacks \
--stack-name sm-mlops-core \
--output table \
--query "Stacks[0].Outputs[*].[OutputKey, OutputValue]"
Copy and paste the AssumeDSAdministratorRole
link to a web browser and switch role to DS Administrator.
Go to AWS Service Catalog in the AWS console and select Products on the left pane:
You see the list of available products for your user role:
Click on the product name and and then on the Launch product on the product page:
Fill the product parameters with values specific for your environment. Provide the valid values for OU ids OR for the staging and production account lists and the name for the SetupStackSetExecutionRole
if you would like to enable multi-account model deployment, otherwise keep these parameters empty.
There are only two required parameters you must provide in the product launch console:
- S3 bucket name with MLOps seed code - use the S3 bucket where you packaged the CloudFormation templates:
- Availability Zones: you need at least two availability zones for SageMaker model deployment workflow:
Wait until AWS Service Catalog finishes the provisioning of the Data Science environment stack and the product status becomes Available. The data science environment provisioning takes about 20 minutes to complete.
Now you provisioned the Data Science environment and can start working with it.
Refer to clean up section.
- [R1]: Amazon SageMaker Pipelines documentation
- [R2]: Best practices for multi-account AWS environment
- [R3]: AWS Well-Architected Framework - Machine Learning Lens Whitepaper
- [R4]: Terraform provider AWS GitHub
- [R5]: Data processing options for AI/ML
- [R6]: Architect and build the full machine learning lifecycle with AWS: An end-to-end Amazon SageMaker demo
- [R7]: End-to-end Amazon SageMaker demo
- [R8]: Multi-account model deployment with Amazon SageMaker Pipelines
- [R9]: Building, automating, managing, and scaling ML workflows using Amazon SageMaker Pipelines
- [R10]: Best Practices for Organizational Units with AWS Organizations
- [R11]: Build a CI/CD pipeline for deploying custom machine learning models using AWS services
- [R12]: Configuring Amazon SageMaker Studio for teams and groups with complete resource isolation
- [R13]: Enable feature reuse across accounts and teams using Amazon SageMaker Feature Store
- [R14]: How Genworth built a serverless ML pipeline on AWS using Amazon SageMaker and AWS Glue
- [R15]: SageMaker cross-account model
- [R16]: Use Amazon CloudWatch custom metrics for real-time monitoring of Amazon Sagemaker model performance
- [R17]: Automate feature engineering pipelines with Amazon SageMaker
- [R18]: Build a Secure Enterprise Machine Learning Platform on AWS
- [R19]: Automate Amazon SageMaker Studio setup using AWS CDK
- [R20]: How to use trust policies with IAM roles
- [R21]: Use Amazon SageMaker Studio Notebooks
- [R22]: Shutting Down Amazon SageMaker Studio Apps on a Scheduled Basis
- [R23]: Register and Deploy Models with Model Registry
- [R24]: Setting up secure, well-governed machine learning environments on AWS
- [R25]: Machine learning best practices in financial services
- [R26]: Machine Learning Best Practices in Financial Services
- [R27]: Dynamic A/B testing for machine learning models with Amazon SageMaker MLOps projects
- [R28]: Hosting a private PyPI server for Amazon SageMaker Studio notebooks in a VPC
- [R29]: Automate a centralized deployment of Amazon SageMaker Studio with AWS Service Catalog
- [SOL1]: AWS MLOps Framework
- [SOL2]: Amazon SageMaker with Guardrails on AWS
- [S1]: Building secure machine learning environments with Amazon SageMaker
- [S2]: Secure Data Science Reference Architecture GitHub
- [S3]: SageMaker Notebook instance lifecycle config samples GitHub
- [S4]: Securing Amazon SageMaker Studio connectivity using a private VPC
- [S5]: Secure deployment of Amazon SageMaker resources
- [S6]: Understanding Amazon SageMaker notebook instance networking configurations and advanced routing options
- [S7]: Security group rules for different use cases
- [S8]: Data encryption at rest in SageMaker Studio
- [S9]: Connect SageMaker Studio Notebooks to Resources in a VPC
- [S10]: Control root access to Amazon SageMaker notebook instances
- [S11]: 7 ways to improve security of your machine learning workflows
- [S12]: PySparkProcessor - Unable to locate credentials for boto3 call in AppMaster
- [S13]: Private package installation in Amazon SageMaker running in internet-free mode
- [S14]: Securing Amazon SageMaker Studio internet traffic using AWS Network Firewall
- [S15]: Secure Your SageMaker Studio Access Using AWS PrivateLink and AWS IAM SourceIP Restrictions
- [S16]: Model Risk Management by Deloitte
- [S17]: Building secure Amazon SageMaker access URLs with AWS Service Catalog
- [W1]: SageMaker immersion day GitHub
- [W2]: SageMaker immersion day workshop 2.0
- [W3]: Amazon Sagemaker MLOps workshop GitHub
- [W4]: Operationalizing the ML pipeline workshop
- [W5]: Safe MLOps deployment pipeline
- [W6]: Buiding secure environments workshop
- [W7]: Amazon Managed Workflows for Apache Airflow workshop
- https://github.com/visenger/awesome-mlops
- https://github.com/EthicalML/awesome-production-machine-learning
- https://github.com/alirezadir/Production-Level-Deep-Learning
- https://www.featurestore.org/
- https://twitter.com/chipro/status/1318190833141714949?s=20
- TWIML podcast: Feature Stores for MLOps with Mike del Balso
- TWIML podcast: Enterprise Readiness, MLOps and Lifecycle Management with - - Jordan Edwards
- Full stack deep learning free online course
- Continuous Delivery for Machine Learning
- Feature Store vs Data Warehouse
- Seldon Core
- MLflow and PyTorch — Where Cutting Edge AI meets MLOps
- 5 Lessons Learned Building an Open Source MLOps Platform
Deployment into an existing VPC and with pre-provisioned IAM resources
The solution is tested end-to-end for all possible depoyment options using AWS CodePipeline and AWS developer tools.
The CodePipeline pipelines used in the solution deploy stack in different regions. You must create an Amazon S3 CodePipeline bucket per region following the naming convention: codepipeline-<ProjectName>-<AWS Region>
:
PROJECT_NAME=sm-mlops
aws s3 mb s3://codepipeline-${PROJECT_NAME}-us-east-2 --region us-east-2
aws s3 mb s3://codepipeline-${PROJECT_NAME}-eu-central-1 --region eu-central-1
aws s3 mb s3://codepipeline-${PROJECT_NAME}-eu-west-1 --region eu-west-1
aws s3 mb s3://codepipeline-${PROJECT_NAME}-eu-west-2 --region eu-west-2
To use CI/CD pipelines you must setup CodeCommit repository and configure notifications on pipeline status changes.
-
Setup CodeCommit repository
-
Create SNS topic to receive notifications
To setup all CI/CD pipelines run the following command from the solution directory:
aws cloudformation deploy \
--template-file test/cfn_templates/create-base-infra-pipeline.yaml \
--stack-name base-infra-$AWS_DEFAULT_REGION \
--capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM \
--parameter-overrides \
CodeCommitRepositoryArn= \
NotificationArn=
AWS Service Catalog Terraform Reference Architecture GitHub
Q: Can I use Terraform with AWS Service Catalog?
You can leverage the AWS Service Catalog Terraform Reference Architecture. This reference architecture provides an example for using AWS Service Catalog products, an AWS CloudFormation custom resource, and Terraform to provision resources on AWS.
How to get an IAM snapshot from the account:
aws iam get-account-authorization-details > iam_snapshot.json
To enable SageMaker projects you need first to enable SageMaker AWS Service Catalog portfolio and then to associate the Studio execution role with the portfolio using https://docs.aws.amazon.com/cli/latest/reference/servicecatalog/associate-principal-with-portfolio.html.
In addition you need to make sure to create two roles (which otherwise get created through the console): AmazonSageMakerServiceCatalogProductsUseRole
and AmazonSageMakerServiceCatalogProductsLaunchRole
.
Below a sample code_snippet for boto3 for the full workflow:
studio_role_arn
is the role which is associated with Studiosc_client
is AWS Service Catalog boto3 clientclient
: is SageMaker boto3 client
def enable_projects(studio_role_arn):
# enable Project on account level (accepts portfolio share)
response = client.enable_sagemaker_servicecatalog_portfolio()
# associate studio role with portfolio
response = sc_client.list_accepted_portfolio_shares()
portfolio_id = ''
for portfolio in response['PortfolioDetails']:
if portfolio['ProviderName'] == 'Amazon SageMaker':
portfolio_id = portfolio['Id']
response = sc_client.associate_principal_with_portfolio(
PortfolioId=portfolio_id,
PrincipalARN=studio_role_arn,
PrincipalType='IAM'
)
The following architectural options for implementing MLOps pipeline are available:
- SageMaker Projects + CodePipeline (implemented by this solutions)
- Step Functions
- Apache AirFlow, SageMaker Operators for AirFlow
- SageMaker Operators for Kubernetes
You can develop and evolve the seed code for your own needs. To deliver the new version of the seed code in form of the project template, please follow the steps:
-
Update existing or create your own version of the seed code
-
Zip all files that should go into a project CodeCommit repository into a single
.zip
file -
Upload this
.zip
file to an Amazon S3 bucket of your choice. You must specify this S3 bucket name when you create a new project in Studio -
Set a special tag
servicecatalog:provisioning
on the uploaded file. This tag will enable access to the object byAmazonSageMakerServiceCatalogProductsLaunchRole
IAM role:aws s3api put-object-tagging \ --bucket <your Amazon S3 bucket name> \ --key <your project name>/seed-code/<zip-file name> \ --tagging 'TagSet=[{Key=servicecatalog:provisioning,Value=true}]'
-
Update the
AWS::CodeCommit::Repository
resource in the CloudFormation template with the CI/CD pipeline for the corresponding MLOps project (project-model-build-train.yaml
orproejct-model-deploy.yaml
) with the new zip-file name.Model deploy project
proejct-model-deploy.yaml
:ModelDeployCodeCommitRepository: Type: AWS::CodeCommit::Repository Properties: # Max allowed length: 100 chars RepositoryName: !Sub sagemaker-${SageMakerProjectName}-${SageMakerProjectId}-model-deploy # max: 10+33+15+12=70 RepositoryDescription: !Sub SageMaker Endpoint deployment infrastructure as code for the project ${SageMakerProjectName} Code: S3: Bucket: !Ref SeedCodeS3BucketName Key: <your project name>/seed-code/<zip-file name> BranchName: main
Model build, train, validate project
project-model-build-train.yaml
:ModelBuildCodeCommitRepository: Type: AWS::CodeCommit::Repository Properties: # Max allowed length: 100 chars RepositoryName: !Sub sagemaker-${SageMakerProjectName}-${SageMakerProjectId}-model-build-train # max: 10+33+15+18=76 RepositoryDescription: !Sub SageMaker Model building infrastructure as code for the project ${SageMakerProjectName} Code: S3: Bucket: !Ref SeedCodeS3BucketName Key: sagemaker-mlops/seed-code/mlops-model-build-train-v1.0.zip BranchName: main
-
Update the CloudFormation template file
env-sc-portfolio.yaml
with a new version of Service Catalog Product:DataScienceMLOpsModelBuildTrainProduct: Type: 'AWS::ServiceCatalog::CloudFormationProduct' Properties: Name: !Sub '${EnvName}-${EnvType} MLOps Model Build Train <NEW VERSION>' Description: 'This template creates a CI/CD MLOps project which implements ML build-train-validate pipeline' Owner: 'Data Science Administration Team' ProvisioningArtifactParameters: - Name: 'MLOps Model Build Train <NEW VERSION>'
-
Package and upload all changed CloudFormation template to the distribution S3 bucket
-
Update Service Catalog CloudFormation stack with the updated templates:
- Package the CloudFormation templates and upload everything to the Amazon S3 bucket:
S3_BUCKET_NAME=<YOUR S3 BUCKET NAME> make package CFN_BUCKET_NAME=$S3_BUCKET_NAME
- Update the Service Catalog portfolio and product stack:
STACK_NAME= # <generated EnvironmentSCPortfolio stack name> PRINCIPAL_ROLE_ARN= # <SageMaker Execution Role ARN> LAUNCH_ROLE_ARN= # <AmazonSageMakerServiceCatalogProductsLaunchRole ARN> aws cloudformation update-stack \ --template-url https://s3.$AWS_DEFAULT_REGION.amazonaws.com/$S3_BUCKET_NAME/sagemaker-mlops/env-sc-portfolio.yaml \ --region $AWS_DEFAULT_REGION \ --stack-name $STACK_NAME \ --parameters \ ParameterKey=EnvName,ParameterValue=sm-mlops \ ParameterKey=EnvType,ParameterValue=dev \ ParameterKey=SCMLOpsPortfolioPrincipalRoleArn,ParameterValue=$PRINCIPAL_ROLE_ARN \ ParameterKey=SCMLOpsProductLaunchRoleArn,ParameterValue=$LAUNCH_ROLE_ARN
-
Restart Studio (close the browser window with Studio and open again via AWS console)
Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. SPDX-License-Identifier: MIT-0