By using AWS re:Post, you agree to the Terms of Use

A Brief Primer to Onboarding Data To a Healthcare and Life Sciences Data Mesh Leveraging AWS Services

11 minute read

When leveraging a data mesh, Healthcare and Life Sciences organizations face unique compliance, regulatory, and other hurdles. This article describes 3 common challenges faced by pharmaceutical and healthcare companies when onboarding data to a data mesh, as well as solutions for how to address those challenges.

Authors: Disha Umarwani and Joshua Broyde

In large pharmaceutical and healthcare companies, unique challenges are frequently faced when purchasing, organizing, and understanding data. This is further complicated by the fact that different datasets have different uses throughout the drug development process; from pre-discovery to distribution and marketing of therapeutic products.

Ultimately, many AWS Healthcare and Life Sciences (HCLS) customers want to get to the following stage:

Enter image description here

A important paradigm that many HCLS customers can adopt to better aid their data ingestion and analytics is a data mesh architecture. A key idea behind a data mesh architecture is that data is viewed as a product, which is maintained and updated by a single group and then users are granted access to this product.

As you can see from the diagram, data is provided by 3rd party vendors and needs to be onboarded prior to being used for analytics. Note that Lines of Business (LOBs) also create their own datasets, which can be used by other LOBs.

In this blog post, we briefly discuss 3 challenges commonly faced within the Healthcare and Life Sciences Industry when onboarding datasets to a data mesh:

  • How to prevent redundant purchase and maintenance of data.
  • How to ensure that datasets have enough context and information when it is used for analytics
  • How to ensure that datasets are onboarded in a consistent manner

Let's first explain what a data mesh is.

What is a Data Mesh?

A data mesh is a de-centralized way of organizing data by allowing different Lines-Of-Business (LOBs) to process, catalog, and share their datasets with other teams/Lines of Business (LOB). A key point though is that while data creation and processing is decentralized; the data catalog and governance model for data access is centralized. A data mesh is in contrast to a purely centralized or pub-sub model for managing datasets. The data mesh paradigm has been found to be helpful in many use cases.

There are a few key concepts and personas in the data mesh. A Producer creates the refined dataset, included cleaning, doing data quality checks, ETL, etc. Each producer owns and operates multiple Data Products with its own data and technology stack, which is independent from others. The Consumer will leverage the data product; note that there may be many consumers of a single dataset. The Data Steward is the gatekeeper and maintainer for the data product and approves and maintains data glossary and access. For example, if you wanted to access the data, you would request access from the data steward. The Central Data Platform regulates what producers and consumers can do; for example, setting standards for producers to acquire data, providing shared capabilities across producers when needed. Frequently a separate Central Data Office will also sign legal contracts, sets governance guiding principles, trains technical resources on compliance and data handling. A data product includes the physical data assets, and all datasets from source to final transformed version, storage infrastructure, code to transform the data, metadata tags, glossary, and associated personas for management. This can be compared to a microservice, where the data is a self-sufficient unit which is a part of larger application.

You can read more information about the data mesh here or Zhamak Dehghani’s (who coined the term data mesh) discussion of the data mesh here. You can see here an example of an AWS reference architecture for a data mesh.

Next, we will dive into some challenges that need to be considered when leveraging a data mesh in HCLS industries.

How to Prevent Redundant Purchase of Healthcare and Life Science Data


In large enterprises, the same dataset can frequently help different lines of business (LOBs) and teams to solve different use-cases. However, this can lead different LOBs to purchase and process the same datasets separately; leading to redundant work and loss of money for the redundant purchase. In addition, underlying datasets may themselves have duplicate records. For example, the Diabetes Therapeutic Area, which contains physician prescription data, can be obtained from pharmacy sales data and well as from insurance claims data. Because different data vendors tend to work with different pharmacies and insurance companies, it is possible for an LOB to purchase this dataset twice unless the data is analyzed and cataloged carefully.


The data mesh allows for easily maintaining data products without duplicating datasets across LOBs. Specifically, a Central Data Office, which comprises of acquisition leads, collaborates with LOBs to source, evaluate, subscribe, and provision “fit for purpose” data needs. It adopts data evaluation standards before data is purchased by LOBs. After the data is acquired, it should be handed over to LOB Data Stewards to create a team to manage the data asset for ingestion, storage and data access. The Data Steward, along with data experts who understand the data, puts restrictions on usability and access according to data platform governance standards.

The LOB manages the day-to-day activities of managing Service-Level-Agreement (SLA) for data delivery, apply data quality checks, respond to failures to deliver data, approving access and auditing and monitoring. The datasets need to have complete information regarding data source, subscription details, expiry of dataset, frequency of delivery and associated data personas. In health care and life science it becomes an even bigger responsibility as the steward needs to be aware of data coding guidelines, be a point of contact to answer all questions about appropriate use of the data, and set data quality standards in accordance with industry standards like FHIR and HL7.

Once the data is purchased, it is important to make it findable and accessible by authorized users. This can be achieved using an enterprise catalog where the data producers can list their data products and data consumers have read access to find the datasets. This requires a business glossary along with integrated connectors providing information about data lineage, data quality, data profile, associated personas and hierarchal and logical grouping of databases and tables. It should also integrate with access workflow.

Giving Healthcare and Life Science data context by leveraging Data-as-a-Product


Data frequently lacks context; it is unclear how the dataset relates to other datasets, or how it is to be used. The data may be unprocessed, and those who wish to use the data need to go through the laborious process of cleaning the data and ensuring it lacks errors.


In order to address this challenge, it is important to maintain a data-as-a-product mentality when onboarding data to a data mesh. This solution needs to incorporate knowledge and skills from 2 separate people:

  • A data engineer or data custodian that performs the ETL and transformation of the data, necessary joining, and other cleaning of the data. The data engineer may be joined by a data scientist, who will help with any statistical analysis.
  • A Subject Matter Expert(SME) or data product owner who understands the data. This SME is expected to have a deep understanding of the underlying data, and how the data relates to itself and other data sources. They will communicate with the Data Engineer and Data Scientist specific operations that need to be performed.

An example of this process would be if the SME identifies a column of the data that is irrelevant and not useful, and communicates with the Data Engineer to design the pipeline to remove that one column. Note that the data engineer would not be expected to identify this without SME input.

The data product then is exposed to the consumer who performs further analysis and research. For the data to be consumable, it needs to have all the information such as the source of the data, a data dictionary including column description, and other metadata.

An Example: Real World Evidence as a Data Product

An example of how pharmaceutical companies can leverage data-as-a-product is the context of Real World Evidence (RWE) data sources. RWE is data that pertains to drug safety that may not be found in standard clinical trials, such as patient records from multiple sources for a specific disease and patient illness and treatment history. In an enterprise scenario, each LOB makes strategic decisions to invest in RWE and reach out to the RWE team (which can be centralized pr de-centralized) to perform market research and acquire the data assets and provide a list back to the LOB. The LOBs negotiate contracts and acquire datasets. Once the assets are acquired via different mediums, it is aggregated and augmented with other data assets to create a data product leveraging the data mesh.

Standardizing Purchasing and Data On-Boarding Procedures

To on-board data products on the data mesh, producers build Extract-Transform-Load (ETL) pipelines. A key issue is that organizations need to have standard data ingestion patterns, which are approved by the security and infrastructure team. Furthermore, there is a need to automated security checks (e.g. data encryption) to ensure they are followed without much manual intervention.


There are many challenges when on-boarding datasets from 3rd party vendors. Major ones commonly seen in the context of HCLS are:

  • When analyzing patient data, leveraging Master Data Management (MDM) is essential. MDM ensures that de-identified patient data are correctly mapped across different datasets. For example, MDM would help keep track that a specific patient in a health insurance dataset is the same patient as a patient in a clinical trial.
  • Ensuring HIPAA and GxP compliance is critical. For example, ensuring proper encryption and security of PHI and PII is a common issue.
  • Managing data lineage from source to target data product in a drug development lifecycle is critical to respond to any regulatory enquiries for a specific drug.
  • Since some vendors deliver the data in realtime, and others in batch; it is essential that the ingestion pipeline be able to seamlessly do both.
  • Some data vendors have terms and conditions to use the dataset for a specific LOB. In addition, vendors may have a time-out on the data, after which the dataset may not be used.
  • Data purchasing contracts are frequently not standardized, and thus organizing different contracts is complex.
  • The external data producer may have legacy methods for delivering data.


AWS Data Exchange can help with many of these complexities. For example, AWS Data Exchange allows for signing contracts and the renewal of data and there is no need for any ETL pipeline management or connection establishment. Within Data Exchange, a vendor will put the data on an S3 bucket along with a glossary of the data. In turn, a subscriber has the authority to sign the contract in the AWS console after reading the terms and conditions for the license. AWS Data Exchange also supports the ability to renew batches of data.

Furthermore, because the AWS Data Exchange supports license management using the AWS License Manager, businesses can choose which LOBs have access to the datasets; thus allowing selective use of the data. Furthermore, licenses can expire after a set of time, thus automatically blocking access. Within the context of real-time data, the AWS Data Exchange can be used for signing of contracts, and a number of AWS services can be used for streaming of real-time data. For example,Amazon Kinesis Firehouse can be used to stream data to S3.

For addressing regulatory requirements for data lineage, the ETL process should be configured to not only transform the dataset, but also to preserve data lineage. The AWS Data Exchange supports revisions when publishing datasets, which can help with this requirement. Another example is the SageMaker Feature Store, which allows for multiple versions of data features in the context of machine learning.

Putting it All Together

Putting it all together, a properly architected data mesh allows for onboarding data that can be use for complex analytics, where Lines of Business can leverage datasets and also publish their own datasets. This allows for reducing redundancy, streamlining onboarding of datasets, and publishing datasets within an enterprise catalog. This blog highlights some of the challenges that data mesh helps solve with respect to acquisition of external data assets and sharing it across the organization. For this model to work, it is important to have a strong Data and Platform Governance model with well defined people and processes, thus creating an environment of accountability and responsibility in a strictly regulated Healthcare and Life Science Enterprise.

About the Authors: Disha Umarwani is a Data/ML Engineer on the ProServe at AWS. Joshua Broyde is a Senior AI/ML Specialist Solutions Architect within the Healthcare and Life Sciences Industry Business Unit at AWS.