Athena taking too long to query hudi dataset.

0

We are trying to query a hudi table stored in s3 and cataloged in glue using Athena. The table has around 400 partitions (based on date) and is of size 120 MB (5.5M records). We have used the configs:

"hoodie.metadata.enable": "true",
"hoodie.metadata.index.column.stats.enable": "true" 

while writing and added a partition index on it as well. The table properties include:

partition_filtering.enabled=TRUE
hudi.metadata-listing-enabled=TRUE

Our sql query is like:

select * from table_name where partitioned_col between 20221006 and 20230530 and created_time between timestamp'2022-10-06 12:01:23 +0530' and timestamp'2023-05-21 00:00:00 +0530';

Without using metadata/partition filtering, the read takes about 12-14 seconds (2 sec planning time, rest execution). After turning on the partition filtering and metadata use, planning phase takes about 9 seconds and execution time comes down to 3-4 seconds.

This ideally should not take so long as the data size is very small. Is there any config we might have messed up or missed? There are also a few topics which we couldn't find online:

  1. Does athena v3 use presto or trino to query on hudi? If it uses presto, is it updated to latest version, as there was recently a change to improve performance.
  2. Does the engine use Hive connector or Hudi connector while querying?
  3. Is the hudi version deployed while querying the latest (0.13.0) or older. We know that EMR is updated with latest Hudi changes but don't know about Athena.
질문됨 일 년 전86회 조회
답변 없음

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠