AWS Glue Crawler Scalability for Large Number of Delta Tables

0

Question: We currently have approximately 100 tables in delta format, partitioned by yyyy, mm, dd, hh, mm. Our current process involves reading these delta tables via a crawler, cataloging them, and utilizing spectrum tables in Redshift for building business logic.

However, we are encountering scalability limitations due to the maximum of 10 tables per crawler. As we continue to add more tables, adding additional crawlers becomes cumbersome. Additionally, the data volume on some of these tables is substantial, with up to 500k records per hour.

Considering these constraints, what would be the optimal approach to read the delta tables in parallel via the crawler? Can we configure the crawler to utilize an RDS database for improved scalability? Any insights or best practices would be appreciated.

  • Could you share how you're creating these Delta tables? Where is the source data coming from for these tables?

  • We are creating the delta tables via Glue ETL. Source - API.

preguntada hace un mes356 visualizaciones
No hay respuestas

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas