How to use Azure data lake with Sagemaker data wrangler

1

I am working on a project related to Azure data lake and I am running into issues with connecting to SageMaker data wrangler. Need some ideas on how to achieve this and what will be pros and cons with S3 ?

posta un anno fa414 visualizzazioni
2 Risposte
1
Risposta accettata

Amazon SageMaker data wrangler supports S3, Athena and other as data source: https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-import.html#data-wrangler-import-storage https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-import.html

Please review Athena federated query that can be used to connect to other data stores outside S3. Athena Federated Query has a JDBC compliant connector and Azure Data Lake has a JDBC driver, so can review to connect using those resources: https://www.cdata.com/drivers/azuredatalake/jdbc/#:%7E:text=The%20Azure%20Data%20Lake%20Storage,with%20Azure%20Data%20Lake%20Storage.

With this solution the speed/throughput will be less than using S3 as a data source, as it as a intermediate Lambda connector. Also data transfer charges will also apply. Please review following limitations: https://github.com/awslabs/aws-athena-query-federation/wiki/Limitations_And_Issues

Also wanted to share following blog posts for reference:

https://aws.amazon.com/about-aws/whats-new/2022/11/amazon-sagemaker-data-wrangler-over-40-third-party-applications-data-sources/ https://aws.amazon.com/blogs/machine-learning/configure-a-custom-amazon-s3-query-output-location-and-data-retention-policy-for-amazon-athena-data-sources-in-amazon-sagemaker-data-wrangler/

Hopefully this is helpful!

AWS
con risposta un anno fa
profile picture
ESPERTO
verificato un anno fa
0

You can use RClone to migrate data from Azure Blog Storage to Amazon S3. Once the data are in S3, you can easily access the data in S3 using Amazon SageMaker Data Wrangler.

Pros and cons will be highly subjective based on your organization's goals but generally speaking, the cons are paying double storage to multiple cloud providers, and the administrative burden of having to migrate data between them versus simply having one. Additionally, you pay the cloud egress charges. A pro is having data in multiple hyperscalers so that if one is unavailable, you can use the other one.

profile pictureAWS
ESPERTO
pechung
con risposta un anno fa

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande