Athena taking too long to query hudi dataset.

0

We are trying to query a hudi table stored in s3 and cataloged in glue using Athena. The table has around 400 partitions (based on date) and is of size 120 MB (5.5M records). We have used the configs:

"hoodie.metadata.enable": "true",
"hoodie.metadata.index.column.stats.enable": "true" 

while writing and added a partition index on it as well. The table properties include:

partition_filtering.enabled=TRUE
hudi.metadata-listing-enabled=TRUE

Our sql query is like:

select * from table_name where partitioned_col between 20221006 and 20230530 and created_time between timestamp'2022-10-06 12:01:23 +0530' and timestamp'2023-05-21 00:00:00 +0530';

Without using metadata/partition filtering, the read takes about 12-14 seconds (2 sec planning time, rest execution). After turning on the partition filtering and metadata use, planning phase takes about 9 seconds and execution time comes down to 3-4 seconds.

This ideally should not take so long as the data size is very small. Is there any config we might have messed up or missed? There are also a few topics which we couldn't find online:

  1. Does athena v3 use presto or trino to query on hudi? If it uses presto, is it updated to latest version, as there was recently a change to improve performance.
  2. Does the engine use Hive connector or Hudi connector while querying?
  3. Is the hudi version deployed while querying the latest (0.13.0) or older. We know that EMR is updated with latest Hudi changes but don't know about Athena.
asked a year ago81 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions