I need to read S3 data, transform and put into Data Catalog. Should I be using a Crawler?

0

Files are uploaded every hour to an S3 bucket. I currently have a Glue ETL job reading the S3 bucket, transforming data and inserting into a Glue Data Catalog. I have seen examples where people have a Glue Crawler which reads the S3, writes data to Data Catalog table, and then an ETL job reads from a table transforms and then writes back to another table (or wherever it needs to go). Should I be using a Crawler? I don't see the need for it if I can just use the ETL job to go S3->Transform->Data Catalog. It would seem the ETL job supports bookmarking (init/commit) just like Crawlers do.

bfeeny
질문됨 2년 전1854회 조회
1개 답변
1
수락된 답변

Hi,

AWS Glue Crawlers are used to automatically discover the schema of the data in Amazon S3 or other data sources. They also help in capturing schema evolution.

If your schema is fixed (do not change often), already known and you do not have issues creating your tables manually via the console or your code using the APIs, then you do not need to use them.

Consider also that the Crawler do have a cost, so cost optimization might be another reason to not use them if you are fine with self managing the schemas of your datasets.

for additional information on Crawlers, you can refer to this section of the AWS Glue Documentation.

hope this helps

AWS
전문가
답변함 2년 전
  • As Fabrizio said correctly, You only need to run the AWS Glue Crawler again if your schema changes. Also, ETL job support bookmarking and its recommended to use when your data grows day by day and when your job runs, you don't want to perform ETL job operations all over your data again, with bookmarking it will start processing operation on new data (wont perform operation on a processed data).

    Read more at: https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인