내용으로 건너뛰기

Specify crawler to crawl from specific files?

0

I am keeping track of some data, based on the date on S3. The files are stored in directories like this: year=yyyy/month=mm/day=dd and inside this directory, there are multiple csv files. I want a crawler to only crawl from one file, for a whole month or a year. The files are saved in this format: regionA_yyyy_mm_dd.csv, regionB_yyyy_mm_dd.csv. I was thinking about if there is a way to specify the name of the file, like regionA, to crawler so that it crawls data only from region A. Is there a way to do this?

질문됨 일 년 전1.2천회 조회
1개 답변
2
수락된 답변

You can configure an AWS Glue Crawler to selectively crawl specific files from your S3 bucket using include patterns. By specifying the pattern regionA.csv, for example, you instruct the crawler to only consider files containing regionA in their names. This approach allows you to focus the crawling process on the desired data subset, improving efficiency and reducing processing time. Alternatively, you can create a table in the AWS Glue Data Catalog for the specific files you're interested in and configure the crawler to update that table. Additionally, you have the flexibility to automate this process using the AWS CLI or Boto3, providing you with greater control and customization options.

전문가
답변함 일 년 전
전문가
검토됨 일 년 전
전문가
검토됨 일 년 전
  • Is there a way for the crawler to generate multiple metadata? For example, is there a way a crawler can generate separate metatables for regionA, regionB, regionC, etc? Or can it only be done through assigning each crawler for each region?

  • In AWS Glue, a single crawler can generate metadata for multiple regions by using a combination of custom classifiers, filters, and partitioning strategies. if it is not too urgent i can come up with something before tomorrow

  • That would be awesome! Also, where can I use the 'patterns' so that I can specify the name of the files to crawl from?

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

관련 콘텐츠