Easily right-size bioinformatics workflows in AWS HealthOmics

4 minute read
Content level: Intermediate
0

This article provides guidance to right size compute, memory and storage resources specified to run private workflows for processing omics data using AWS HealthOmics. The guidance helps customers to optimize their per sample processing cost without compromising on their pace of innovation.

Article contributed by Sunil Aladhi, Senior Technical Account Manager

AWS HealthOmics helps customers run bioinformatics workflows at scale. Customers can bring their own private workflows to process their biological data without needing to manage the underlying infrastructure. HealthOmics supports bioinformatics domain-specific languages, such as WDL, Nextflow, or CWL. In these languages, customers specify vCPU, memory, and storage resources to process their data. This often involves making educated guesses on resource requirements to start with and tune those requirements based on multiple runs of the workflow. To simplify this price-performance optimization, AWS HealthOmics announced support for providing detailed utilization metrics about workflow runs. These metrics show the utilization rates of vCPU, memory and storage, enabling those resources to be rightsized. These metrics are generated for every workflow run on or after Feb 23, 2024, with zero intervention from customers.

Upon completion of a workflow run, utilization metrics are included in the workflow manifest and reported to ‘AWS/Omics/WorkflowLog’ namespace in customer’s CloudWatch. Omics Run Analyzer(OR Analyzer) is a tool written in Python to read those utilization metrics and generate statistics for a successful HealthOmics workflow run. OR Analyzer outputs the statistics in the form of a CSV file. To run OR Analyzer, download it and execute it. The GitHub link for OR Analyzer listed above has examples on how to use the tool. Note that OR Analyzer depends on ‘docopt’ python package, which should be installed using pip prior to using it.

Once OR Analyzer is confirmed to run successfully, customers can run it against a workflow run to get compute, memory, and storage recommendations based on run utilization metrics. Customers can find workflow runs in AWS HealthOmics Console, under ‘Workflows’ and ‘Runs’. They can also use AWS CLI and API to list workflow runs.

Once a run is identified, customers can run OR Analyzer using the command “omics-run-analyzer <Run-ID> -o <output.csv>”. Here is a snippet of output from a sample CSV file generated by OR Analyzer for a nf-core/scrnaseq workflow run.

Enter image description here

The column sizeReserved shows the omics instance type provisioned by AWS HealthOmics based on resource requirements specified in the Nextflow script. The next column named sizeMinimum shows the right sized instance type to run the task based on utilization metrics. Whenever instance type differs between these two columns, it is a potential opportunity to right size resource requirements. In the above screenshot, there are 10 tasks and there is an opportunity to right size the resources for 9 of those tasks.

To illustrate this further, let’s review several tasks.

  • NFCORE_SCRNASEQ:SCRNASEQ:GTF_GENE_FILTER: Requested resources that required HealthOmics to provision an omics.r.large instance type with 2 vCPUs and 16GB Memory. But based on utilization metrics, omics.c.large with 2vCPUs and 4GB Memory is recommended as the ideal instance type due to low memory utilization of the task. This recommendation saves ~32.5% on the instance cost for the given task.
  • NFCORE_SCRNASEQ:SCRNASEQ:MTX_CONVERSION:MTX_TO_SEURAT has potential opportunity for both vCPU and memory rightsizing. Low CPU and memory utilization suggest omics.r.2xlarge to be rightsized to an omics.c.large instance, potentially saving up to 83% on that individual task.
  • NFCORE_SCRNASEQ:SCRNASEQ:STARSOLO:STAR_GENOMEGENERATE has no clear opportunity for reducing resources on the cpu utilization and memory utilization being above 50%. However, the high vCPU utilization (75%) may open up opportunities to increase compute resource and shorten the turnaround time for the task.

Storage is specified at a workflow level and HealthOmics can provision a static filesystem as shared storage for all the tasks. Static storage utilization metrics are at the end of the first row in CSV output file. Here is a snippet from nf-core/scrnaseq workflow run output showing static storage utilization.

Enter image description here

In the screenshot, static storage reserved is 2400GiB. But the maximum storage utilization was ~519GiB. This allows right sizing of static storage to 1200GiB to realize savings of 50%. Note that HealthOmics supports a minimum size of 1200GiB for static storage and hence requested storage cannot be lower than that. Also, storage needs for a workflow can increase with omics data input size and customers should account for that while rightsizing.

Summary: Customers are used to going with default resource specifications for industry standard workflows or making educated guesses with resources for private workflows. They used to spend time and effort to check utilization of those resources. With the release of utilization metrics feature and OR Analyzer utility, customers can right size their resources with minimal effort and optimize their AWS HealthOmics private workflows spend.