How to specify 'ScanAll' procedure for AWS::Glue::Crawler DynamoDBTarget

0

I was looking at the Glue Crawler resource creation docs, and saw that the DynamoDB Target object: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-glue-crawler-dynamodbtarget.html

The only allowed parameter is 'Path' for a DynamoDB Target of a AWS Glue Crawler resource. Interestingly, when I deployed my crawler, I noticed that the 'data sampling' setting was automatically enabled for my DDB data source. This is NOT the setting I want, so I am looking for a way to specify that the Crawler should scan the entire data source (DDB table).

已提问 1 年前302 查看次数
3 回答
1

You need to set ScanAll to true, I agree it is not well documented but seems to be the correct behavior looking at the core API.

Resources:
  GlueCrawler:
    Type: AWS::Glue::Crawler
    Properties:
      Name: myCrawler
      DatabaseName: myGlueDatabase
      TablePrefix: myTable
      Targets:
        Type: DynamoDB
        Path: myDynamoDBTable
      SchemaChangePolicy:
        UpdateBehavior: UpdateInPlace
        DeleteBehavior: DeleteFromMetadata
      Configuration:
        ScanAll: true
profile pictureAWS
专家
已回答 1 年前
0

Currently, the data sample can only be set to scanAll on the AWS console or CLI, so you would not be able to do this from CloudFormation. Scanning all the records from the table can take a very long time depending on the size of the table and is generally not recommended as it can also exhaust all your RCU for the table.

If your intention for scanning whole table is to account for DynamoDB's non conformant schema, then a better approach would be to export your table to S3 using the export to S3 feature. Since the table content is dumped on an external system it will not affect your table and you will have more control over the performance (since you can control the reads without worrying about table limits or partition limits).

AWS
odwa_y
已回答 1 年前
  • I tried running a crawler on an S3 bucket containing a direct export from DyanmoDB, but it just ran and created nothing. The crawler didn't fail, but it didn't create a data catalog. Could you clarify how you might configure a crawler to run on a DDB export like you mentioned?

0

Spoke with AWS support, they claimed that the feature isn't currently implemented. As of now, you are only able to provide a 'Path' value when creating crawler resources via YAML/JSON in CF. The only solution is to add an 'update-crawler' CLI command via script or pipeline after deploying the resource.

已回答 1 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则