Glue Jobs on Large Data files

0

Hi Team,

I have a requirement to create ETL to transform data from 100's of data files (each has unique schema) into a common format CSV file. Source files are in S3 bucket folders. (each folder is unique dataset). Sometimes the requirement is to join multiple files in a folder and also write business logic in transformation. These files have millions of records.

I have tried Glue Cralwer and Glue jobs to create target files using limited data. My question is, How Glue will perform on millions of records and will it be cost effective? Can you please share information on this one?And also, I'm planning to orchetsrate each Glue crawler and Glue job from Step Functions. Is this correct approach? Thank you.

질문됨 일 년 전247회 조회
1개 답변
0
수락된 답변

AWS Glue main focus is the kind of use case you describe and much larger datasets.
Obviously, depending on the complexity of your joins and transformation logic, you can run into challenges if you don't have previous experience using Apache Spark (which Glue ETL is based on). It's probably worth investing some time understanding how it works and how to monitor it.
The cost effectiveness depends on how efficient is your logic is and how you tune your configuration. Glue 4.0 provides a number of improvements and optimizations out of the box, that should really help you with that.
Crawlers are an optional convenience, you could read the csv files directly if you only need to read them once (if is not a table you to use for other purposes).
Step Functions require a bit learning but allow building advanced workflows, for simple workflows Glue provides triggers and visual workflows inside Glue.

profile pictureAWS
전문가
답변함 일 년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠