Build and Deploy Models Leveraging Cancer Gene Expression Data With SageMaker Pipelines and SageMaker Multi-Model Endpoints

10 分钟阅读
内容级别:高级
3

In this article we show how you can use SageMaker Pipelines and SageMaker Multi-Model Endpoints to efficiently orchestrate and deploy many models in a cost effective and efficient manner. We show how this can be leveraged in the context of cancer survival analysis to deploy many models that leverage gene expression signatures.

Authors: Joshua Broyde and Shamika Ariyawansa

Introduction

When building machine learning models that leverage genomic data, a key problem is how to allow users to select which features should be used when querying models. To address this, data scientists will sometimes build multiple models to handle specific sub-problems within the dataset. In the context of survival analysis for cancer, a common approach is to analyze gene signatures, and to predict the survival of patients based on the gene expression signatures. Such analyses have been applied in many different cancer types 1 and there are many different techniques for doing such survival analysis 2.

A problem that may occur is that, should an application require publishing models based on many hundreds or thousands of gene signatures, managing and deploying all such models may become difficult to maintain and thus unwieldily. In this article, we show how you can leverage SageMaker Pipelines and SageMaker MultiModel Endpoints to build and deploy many such models.

To give a specific example, we will leverage the cancer RNA expression dataset from the paper A radiogenomic dataset of non-small cell lung cancer[3]. You can follow the preprocessing steps outlined in this blog post for preprocessing the data. If you run the pipeline described in that blog post, you will get the entire gene expression profile based on the raw FASTQ files, or you can also access the entire gene expression at GEO. To simplify the use case, we will focus on 21 co-expressed groups that have been found in the paper Non-Small Cell Lung Cancer Radiogenomics Map Identifies Relationships between Molecular and Imaging Phenotypes with Prognostic Implications[4] to be clinically significant in Non-Small Cell Lung Cancer (see that paper, Table 2). These groups of genes, which the authors term metagenes, are annotated in different cellular pathways. For example, the first group of genes LRIG1, HPGD and GDF15 are relate to the EGFR signaling pathay, while CIM,LMO2 and EFR2 all are involved in cell hypoxia/inflammation. Thus, each cancer patient (row) has gene expression values (columns). In addition, each of the 199 patients is annotated by their survival status; each described by their Survival Status (1 for deceased; 0 for alive at time of collection of the dataset).

While a data scientist can build a model that leverages all of the data, we will address how a data scientist can build and deploy multiple models, each around the separate clusters of genes. This can be of use if downstream users want to understand whether survival can be well predicted by only a specific pathway (e.g. can survival be well predicted exclusively based on EGFR pathway?).

In this article, we will show:

  • How SageMaker Pipelines can be leveraged for building multiple models for gene group.

  • How to deploy the those multiple models with SageMaker Multi-Model Endpoints.

  • How to query the models, specifying which model type associated with the particular gene group to retrieve predictions from.

This article post presents a broad overview of the notebook we have published in the AWS Healthcare and and Life Sciences sample notebooks here. You can clone that repository into SageMaker Studio and run it cell-by-cell if you wish. In this post, we will highlight the architecture, key pieces of code and processes in that notebook.

Architecture and Approach

The architecture for this approach is as follows:

Enter image description here

As can be seen in the diagram, we first start with data that is located in S3. We then create a SageMaker Pipeline. SageMaker Pipelines is a feature that allows data scientists to wrap different components of their workload as a pipeline. This allows for a deployment strategy whereby each step of the analysis is automatically kicked off after the previous job finishes. See the associate code repository for the specific syntax for creating a SageMaker Pipeline. For the model deployment choice we use SageMaker Multi-Model Endpoint which allows scalable and effective solution to deploy multiple models using a shared container. This way users can reduce the cost utilizing single endpoint/resources to host multiple models compared with using single model endpoints.

The pipeline consists of:

  • A SageMaker Processing job: for preprocessing the data

  • A SageMaker Training job for training the model.

  • A SageMaker Processing job for evaluating and registering the model in SageMaker Model Registry.

  • A separate SageMaker Processing job for deploying the model on SageMaker Multi Model Endpoint (MME)

At this point, the trained models are stored on S3, and the Multi-Model Endpoint can dynamically retrieve the needed model based on the user request. The user specifies not only the input data to run, but which specific model to use.

Thinking back to the gene expression data, the following diagram represents an overview of the modeling process FIX:

Enter image description here

In this diagram, we first start with the original gene expression data (red indicates higher expression; blue lower expression), and then split that data into N separate subsets of gene expression data. Model 1, for example, is built on genes 1,2,3; Model 2 on genes 4,5,6 etc. We then train multiple models, where each subsample of gene expression data is leveraged to predict survival. Note that each execution of the SageMaker Pipeline corresponds to building one model based on a gene signature.

As mentioned in the introduction, we are leveraging a small data set for just 21 genes found to be significant in predicting survival in lung cancer. However, you could do similar analysis with others groups of genes, such as those present in the KEGG pathway database or Molecular Signatures Database

Model Training, Evaluation and Registration

Preliminary Steps and Model Deployment

First, we load the training and testing sets into S3; all of the future steps for training and testing will point to this data.

Next, we create and deploy the multi-model endpoint. Because the model is a PyTorch model, we leverage the SageMaker prebuilt PyTorch container. Note that for now we are deploying a MME model that points to an empty collection of models; we will populate the collection of models later in the SageMaker Pipeline step. We also specify a custom inference.py script, which will allow users to choose which model to invoke.

You might be wondering why we are leveraging the MME in the first place; and why not just deploy all of these models on the standard SageMaker Endpoints. The answer is that while it is technically possible to deploy each of these models as a separate endpoint; this can increase costs, since you are billed for each endpoint separately. MME allows you to deploy many models backed by a single endpoint. MME maintains a cache of the most used models to decrease latency; and can be used for even deploying thousands of models. The models themselves are stored as .tar.gz format in S3; the MME will dynamically load models as needed into the cache.

Create and Execute the SageMaker Pipeline

Next, we will create a SageMaker Pipeline that reads the data and trains the model for the gene groups of choice. With the SageMaker Pipelines defined, the key parameter here is genome_group which defines which specific sets of genes to build and train the models on. You could kick off multiple pipelines, where a brand new model is trained for a specific genome group. The key lines of code here once the pipeline is defined are:

from sagemaker.workflow.pipeline import Pipeline

pipeline_name = f"Genome-Survival-Prediction-Pipeline"
pipeline = Pipeline(
name=pipeline_name,
parameters=[
    input_train_data,
    input_validation_data,
    training_instance_type,
    genome_group,
    mme_model_location
],
steps=[step_train, step_eval, step_cond]

)

This piece of code is what builds the pipeline. The pipeline is called Genome-Survival-Prediction-Pipeline; and it has 4 parameters, the location of the training data, the location of the validation data, the instance training type, the group of genes to use in the model and the location of where to put the final model so it can be seen by the MME.

If you are using SageMaker Studio, you can visualize what each step of the pipeline actually looks like:

Enter image description here

For executing the pipeline, an example is:

execution = pipeline.start(
   genomeGroup = "ALL"
)

This will in turn execute the entire pipeline, building a model for all of the genes. If you wanted to create a model for just one group of genes, you would execute that specific pipeline with:

execution = pipeline.start(
    parameters=dict(
        genomeGroup="metagene_19"  #points to a specific group of predefined metagenes
    )
)

Once a model is trained, it is registered in the SageMaker Model Registry. Note that an important piece of this logic is that if the performance of the model is below a certain threshold, the orchestrator exits and does not register the model. The model is only registered and deployed to the MME if it is above the desired threshold (in the example code, we use a threshold of .4.

Invocation

When invoking the model, the user will pass not only the data, but also which specific model to use. An example is:

payload = {
    "inputs" : X_val[['LRIG1', 'HPGD', 'GDF15']].iloc[0:5, :].values
}

predictor.predict(payload, target_model="/model-metagene_19.tar.gz")

In this specific model, the user is specifying the 19th gene cluster, and pointing to that specific model; the prediction will thus, reflect the results as predicted exclusively from the model built using that gene signature only.

Conclusion

In this article, we have shown how you can leverage SageMaker Pipelines in conjunction with the SageMaker MultiModel Endpoint for building and querying models based on gene expression data. Specifically, leveraging the MME allows the building of applications where the data scientists can deploy distinct models that allow users to specify the specific features they which to pass, and which model to query. This can be useful for publishing models based of of gene expression profiles where there may be be models supporting predictions based on many scores of gene signatures.

References

[1] Nagy Á, Munkácsy G, Győrffy B. Pancancer survival analysis of cancer hallmark genes. Sci Rep. 2021 Mar 15;11(1):6047. doi: 10.1038/s41598-021-84787-5. PMID: 33723286; PMCID: PMC7961001.

[2]Raman P, et al. A comparison of survival analysis methods for cancer gene expression RNA-Sequencing data. Cancer Genet. 2019 Jun;235-236:1-12. doi: 10.1016/j.cancergen.2019.04.004. Epub 2019 Apr 12. PMID: 31296308.

[3] Bakr S, et al. A radiogenomic dataset of non-small cell lung cancer. Sci Data. 2018 Oct 16;5:180202. doi: 10.1038/sdata.2018.202. PMID: 30325352; PMCID: PMC6190740.

[4] Zhou M, et al. Non-Small Cell Lung Cancer Radiogenomics Map Identifies Relationships between Molecular and Imaging Phenotypes with Prognostic Implications. Radiology. 2018 Jan;286(1):307-315. doi: 10.1148/radiol.2017161845. Epub 2017 Jul 20. PMID: 28727543; PMCID: PMC5749594.

About the Authors

Joshua Broyde is a Senior AI/ML Solutions Architect within the Healthcare and Life Sciences Industry Business Unit at AWS. Shamika Ariyawansa is an AI/ML Solutions Architect within the Healthcare and Life Sciences Industry Business Unit at AWS.