How to specify 'ScanAll' procedure for AWS::Glue::Crawler DynamoDBTarget

Question

I was looking at the Glue Crawler resource creation docs, and saw that the DynamoDB Target object: [https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-glue-crawler-dynamodbtarget.html](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-glue-crawler-dynamodbtarget.html)

The only allowed parameter is 'Path' for a DynamoDB Target of a AWS Glue Crawler resource. Interestingly, when I deployed my crawler, I noticed that **the 'data sampling' setting was automatically enabled** for my DDB data source. This is NOT the setting I want, so I am looking for a way to specify that the Crawler should scan the **entire** data source (DDB table).

Answer

You need to set `ScanAll` to true, I agree it is not well documented but seems to be the correct behavior looking at the core API.

```
Resources:
  GlueCrawler:
    Type: AWS::Glue::Crawler
    Properties:
      Name: myCrawler
      DatabaseName: myGlueDatabase
      TablePrefix: myTable
      Targets:
        Type: DynamoDB
        Path: myDynamoDBTable
      SchemaChangePolicy:
        UpdateBehavior: UpdateInPlace
        DeleteBehavior: DeleteFromMetadata
      Configuration:
        ScanAll: true
```

Answer

Spoke with AWS support, they claimed that the feature isn't currently implemented. As of now, you are only able to provide a 'Path' value when creating crawler resources via YAML/JSON in CF.
The only solution is to add an 'update-crawler' CLI command via script or pipeline after deploying the resource.

Answer

Currently, the data sample can only be set to scanAll on the AWS console or CLI, so you would not be able to do this from CloudFormation. Scanning all the records from the table can take a very long time depending on the size of the table and is generally not recommended as it can also exhaust all your RCU for the table.

If your intention for scanning whole table is to account for DynamoDB's non conformant schema, then a better approach would be to export your table to S3 using the export to S3 feature. Since the table content is dumped on an external system it will not affect your table and you will have more control over the performance (since you can control the reads without worrying about table limits or partition limits).

How to specify 'ScanAll' procedure for AWS::Glue::Crawler DynamoDBTarget

Contenido relevante