How to use Azure data lake with Sagemaker data wrangler

1

I am working on a project related to Azure data lake and I am running into issues with connecting to SageMaker data wrangler. Need some ideas on how to achieve this and what will be pros and cons with S3 ?

asked a year ago400 views
2 Answers
1
Accepted Answer

Amazon SageMaker data wrangler supports S3, Athena and other as data source: https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-import.html#data-wrangler-import-storage https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-import.html

Please review Athena federated query that can be used to connect to other data stores outside S3. Athena Federated Query has a JDBC compliant connector and Azure Data Lake has a JDBC driver, so can review to connect using those resources: https://www.cdata.com/drivers/azuredatalake/jdbc/#:%7E:text=The%20Azure%20Data%20Lake%20Storage,with%20Azure%20Data%20Lake%20Storage.

With this solution the speed/throughput will be less than using S3 as a data source, as it as a intermediate Lambda connector. Also data transfer charges will also apply. Please review following limitations: https://github.com/awslabs/aws-athena-query-federation/wiki/Limitations_And_Issues

Also wanted to share following blog posts for reference:

https://aws.amazon.com/about-aws/whats-new/2022/11/amazon-sagemaker-data-wrangler-over-40-third-party-applications-data-sources/ https://aws.amazon.com/blogs/machine-learning/configure-a-custom-amazon-s3-query-output-location-and-data-retention-policy-for-amazon-athena-data-sources-in-amazon-sagemaker-data-wrangler/

Hopefully this is helpful!

AWS
answered a year ago
profile picture
EXPERT
reviewed a year ago
0

You can use RClone to migrate data from Azure Blog Storage to Amazon S3. Once the data are in S3, you can easily access the data in S3 using Amazon SageMaker Data Wrangler.

Pros and cons will be highly subjective based on your organization's goals but generally speaking, the cons are paying double storage to multiple cloud providers, and the administrative burden of having to migrate data between them versus simply having one. Additionally, you pay the cloud egress charges. A pro is having data in multiple hyperscalers so that if one is unavailable, you can use the other one.

profile pictureAWS
EXPERT
pechung
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions