How to specify 'ScanAll' procedure for AWS::Glue::Crawler DynamoDBTarget

0

I was looking at the Glue Crawler resource creation docs, and saw that the DynamoDB Target object: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-glue-crawler-dynamodbtarget.html

The only allowed parameter is 'Path' for a DynamoDB Target of a AWS Glue Crawler resource. Interestingly, when I deployed my crawler, I noticed that the 'data sampling' setting was automatically enabled for my DDB data source. This is NOT the setting I want, so I am looking for a way to specify that the Crawler should scan the entire data source (DDB table).

preguntada hace un año302 visualizaciones
3 Respuestas
1

You need to set ScanAll to true, I agree it is not well documented but seems to be the correct behavior looking at the core API.

Resources:
  GlueCrawler:
    Type: AWS::Glue::Crawler
    Properties:
      Name: myCrawler
      DatabaseName: myGlueDatabase
      TablePrefix: myTable
      Targets:
        Type: DynamoDB
        Path: myDynamoDBTable
      SchemaChangePolicy:
        UpdateBehavior: UpdateInPlace
        DeleteBehavior: DeleteFromMetadata
      Configuration:
        ScanAll: true
profile pictureAWS
EXPERTO
respondido hace un año
0

Currently, the data sample can only be set to scanAll on the AWS console or CLI, so you would not be able to do this from CloudFormation. Scanning all the records from the table can take a very long time depending on the size of the table and is generally not recommended as it can also exhaust all your RCU for the table.

If your intention for scanning whole table is to account for DynamoDB's non conformant schema, then a better approach would be to export your table to S3 using the export to S3 feature. Since the table content is dumped on an external system it will not affect your table and you will have more control over the performance (since you can control the reads without worrying about table limits or partition limits).

AWS
odwa_y
respondido hace un año
  • I tried running a crawler on an S3 bucket containing a direct export from DyanmoDB, but it just ran and created nothing. The crawler didn't fail, but it didn't create a data catalog. Could you clarify how you might configure a crawler to run on a DDB export like you mentioned?

0

Spoke with AWS support, they claimed that the feature isn't currently implemented. As of now, you are only able to provide a 'Path' value when creating crawler resources via YAML/JSON in CF. The only solution is to add an 'update-crawler' CLI command via script or pipeline after deploying the resource.

respondido hace un año

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas