Athena taking too long to query hudi dataset.

0

We are trying to query a hudi table stored in s3 and cataloged in glue using Athena. The table has around 400 partitions (based on date) and is of size 120 MB (5.5M records). We have used the configs:

"hoodie.metadata.enable": "true",
"hoodie.metadata.index.column.stats.enable": "true" 

while writing and added a partition index on it as well. The table properties include:

partition_filtering.enabled=TRUE
hudi.metadata-listing-enabled=TRUE

Our sql query is like:

select * from table_name where partitioned_col between 20221006 and 20230530 and created_time between timestamp'2022-10-06 12:01:23 +0530' and timestamp'2023-05-21 00:00:00 +0530';

Without using metadata/partition filtering, the read takes about 12-14 seconds (2 sec planning time, rest execution). After turning on the partition filtering and metadata use, planning phase takes about 9 seconds and execution time comes down to 3-4 seconds.

This ideally should not take so long as the data size is very small. Is there any config we might have messed up or missed? There are also a few topics which we couldn't find online:

  1. Does athena v3 use presto or trino to query on hudi? If it uses presto, is it updated to latest version, as there was recently a change to improve performance.
  2. Does the engine use Hive connector or Hudi connector while querying?
  3. Is the hudi version deployed while querying the latest (0.13.0) or older. We know that EMR is updated with latest Hudi changes but don't know about Athena.
demandé il y a un an86 vues
Aucune réponse

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions