Athena taking too long to query hudi dataset.

0

We are trying to query a hudi table stored in s3 and cataloged in glue using Athena. The table has around 400 partitions (based on date) and is of size 120 MB (5.5M records). We have used the configs:

"hoodie.metadata.enable": "true",
"hoodie.metadata.index.column.stats.enable": "true" 

while writing and added a partition index on it as well. The table properties include:

partition_filtering.enabled=TRUE
hudi.metadata-listing-enabled=TRUE

Our sql query is like:

select * from table_name where partitioned_col between 20221006 and 20230530 and created_time between timestamp'2022-10-06 12:01:23 +0530' and timestamp'2023-05-21 00:00:00 +0530';

Without using metadata/partition filtering, the read takes about 12-14 seconds (2 sec planning time, rest execution). After turning on the partition filtering and metadata use, planning phase takes about 9 seconds and execution time comes down to 3-4 seconds.

This ideally should not take so long as the data size is very small. Is there any config we might have messed up or missed? There are also a few topics which we couldn't find online:

  1. Does athena v3 use presto or trino to query on hudi? If it uses presto, is it updated to latest version, as there was recently a change to improve performance.
  2. Does the engine use Hive connector or Hudi connector while querying?
  3. Is the hudi version deployed while querying the latest (0.13.0) or older. We know that EMR is updated with latest Hudi changes but don't know about Athena.
preguntada hace un año86 visualizaciones
No hay respuestas

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas