Choosing Timestream Dimension and Query Performance

0

Hello! I was reading https://docs.aws.amazon.com/timestream/latest/developerguide/data-modeling.html

I'm trying to understand why choosing the proper set of dimensions is important for query performance. I know it sounds right but I'm more curious what exact mechanism is performed when a query with a WHERE clause with some specific dimensions. Does Timestream use some variant of traditional technique like b-tree index or hash table to find the record that containing a certain dimension? I would understand if it's like a trade secret, but I wanted to hear something like "because of <certain mechanism> Timestream will be always responding with reasonable latency when you you used a query that leverages dimensions, compare to using WHERE clause with measure columns."

Let's say I have a table with 2 dimensions (device_type, location_id) and one measure value (event_type)

time | device_id | device_type | location | event_type
00:01 | device-1 | Mark-1 | home | ON
00:02 | device-2 | Mark-1 | office | ON
00:03 | device-1 | Mark-1 | home | OFF
00:05 | device-3 | Mark-2 | home | ON
00:04 | device-2 | Mark-1 | office | OFF
00:09 | device-3 | Mark-2 | home | OFF

and there are 2 types of query for different business purposes

SELECT COUNT(DISTINCT device_id) FROM "test_db"."table"
WHERE time between ago(1h) and now() 
AND device_type='Mark-1' -- Query against dimension
SELECT COUNT(DISTINCT device_id) FROM "test_db"."table" 
WHERE time between ago(1h) and now() 
AND event_type='ON' -- Query against measure value

Do we know why the first query that leverages dimension will be responding faster than the second, and Can we list the exact steps that will be happening, like when we analyze the query plans in traditional RDBMS?

Thank you and have a good day!

Kyeong
已提問 5 個月前檢視次數 354 次
1 個回答
0
已接受的答案

Choosing the 'right' dimensions is important for query performance in Amazon Timestream because it allows the service to optimize how it stores and retrieves data. When ingesting data, Timestream partitions and stores the records based on the dimensions specified.

During a query with a WHERE clause filtering on one of these dimensions, Timestream quickly identifies the exact partitions to scan rather than having to scan all the data. This is kind of like how a B-Tree index in a relational database allows for fast lookups and retrieval of records based on indexed columns.

Things to Consider

  • Timestream stores records within internal partitions based on the ingestion dimensions like customer ID, device ID, etc.

  • When a query filters on one of these dimensions, Timestream knows the exact partitions to scan rather than a full table scan. (It's quicker!)

  • This partition-level filtering enables Timestream to achieve high query performance even for very large datasets as it doesn't have to scan irrelevant partitions.

  • In contrast, filtering only on measure columns would require a full table scan as Timestream doesn't have the dimension context to narrow down the partitions.

Reference -

AWS TimeStream Does TimeUnit influence the Query performance?

Introducing customer-defined partition keys for Amazon Timestream: Optimizing query performance

Amazon Timestream Query Performance

profile pictureAWS
已回答 5 個月前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南