How to use Azure data lake with Sagemaker data wrangler

1

I am working on a project related to Azure data lake and I am running into issues with connecting to SageMaker data wrangler. Need some ideas on how to achieve this and what will be pros and cons with S3 ?

已提问 1 年前412 查看次数
2 回答
1
已接受的回答

Amazon SageMaker data wrangler supports S3, Athena and other as data source: https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-import.html#data-wrangler-import-storage https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-import.html

Please review Athena federated query that can be used to connect to other data stores outside S3. Athena Federated Query has a JDBC compliant connector and Azure Data Lake has a JDBC driver, so can review to connect using those resources: https://www.cdata.com/drivers/azuredatalake/jdbc/#:%7E:text=The%20Azure%20Data%20Lake%20Storage,with%20Azure%20Data%20Lake%20Storage.

With this solution the speed/throughput will be less than using S3 as a data source, as it as a intermediate Lambda connector. Also data transfer charges will also apply. Please review following limitations: https://github.com/awslabs/aws-athena-query-federation/wiki/Limitations_And_Issues

Also wanted to share following blog posts for reference:

https://aws.amazon.com/about-aws/whats-new/2022/11/amazon-sagemaker-data-wrangler-over-40-third-party-applications-data-sources/ https://aws.amazon.com/blogs/machine-learning/configure-a-custom-amazon-s3-query-output-location-and-data-retention-policy-for-amazon-athena-data-sources-in-amazon-sagemaker-data-wrangler/

Hopefully this is helpful!

AWS
已回答 1 年前
profile picture
专家
已审核 1 年前
0

You can use RClone to migrate data from Azure Blog Storage to Amazon S3. Once the data are in S3, you can easily access the data in S3 using Amazon SageMaker Data Wrangler.

Pros and cons will be highly subjective based on your organization's goals but generally speaking, the cons are paying double storage to multiple cloud providers, and the administrative burden of having to migrate data between them versus simply having one. Additionally, you pay the cloud egress charges. A pro is having data in multiple hyperscalers so that if one is unavailable, you can use the other one.

profile pictureAWS
专家
pechung
已回答 1 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则