Athena taking too long to query hudi dataset.

0

We are trying to query a hudi table stored in s3 and cataloged in glue using Athena. The table has around 400 partitions (based on date) and is of size 120 MB (5.5M records). We have used the configs:

"hoodie.metadata.enable": "true",
"hoodie.metadata.index.column.stats.enable": "true" 

while writing and added a partition index on it as well. The table properties include:

partition_filtering.enabled=TRUE
hudi.metadata-listing-enabled=TRUE

Our sql query is like:

select * from table_name where partitioned_col between 20221006 and 20230530 and created_time between timestamp'2022-10-06 12:01:23 +0530' and timestamp'2023-05-21 00:00:00 +0530';

Without using metadata/partition filtering, the read takes about 12-14 seconds (2 sec planning time, rest execution). After turning on the partition filtering and metadata use, planning phase takes about 9 seconds and execution time comes down to 3-4 seconds.

This ideally should not take so long as the data size is very small. Is there any config we might have messed up or missed? There are also a few topics which we couldn't find online:

  1. Does athena v3 use presto or trino to query on hudi? If it uses presto, is it updated to latest version, as there was recently a change to improve performance.
  2. Does the engine use Hive connector or Hudi connector while querying?
  3. Is the hudi version deployed while querying the latest (0.13.0) or older. We know that EMR is updated with latest Hudi changes but don't know about Athena.
質問済み 1年前86ビュー
回答なし

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ