Choosing Timestream Dimension and Query Performance

0

Hello! I was reading https://docs.aws.amazon.com/timestream/latest/developerguide/data-modeling.html

I'm trying to understand why choosing the proper set of dimensions is important for query performance. I know it sounds right but I'm more curious what exact mechanism is performed when a query with a WHERE clause with some specific dimensions. Does Timestream use some variant of traditional technique like b-tree index or hash table to find the record that containing a certain dimension? I would understand if it's like a trade secret, but I wanted to hear something like "because of <certain mechanism> Timestream will be always responding with reasonable latency when you you used a query that leverages dimensions, compare to using WHERE clause with measure columns."

Let's say I have a table with 2 dimensions (device_type, location_id) and one measure value (event_type)

time | device_id | device_type | location | event_type
00:01 | device-1 | Mark-1 | home | ON
00:02 | device-2 | Mark-1 | office | ON
00:03 | device-1 | Mark-1 | home | OFF
00:05 | device-3 | Mark-2 | home | ON
00:04 | device-2 | Mark-1 | office | OFF
00:09 | device-3 | Mark-2 | home | OFF

and there are 2 types of query for different business purposes

SELECT COUNT(DISTINCT device_id) FROM "test_db"."table"
WHERE time between ago(1h) and now() 
AND device_type='Mark-1' -- Query against dimension
SELECT COUNT(DISTINCT device_id) FROM "test_db"."table" 
WHERE time between ago(1h) and now() 
AND event_type='ON' -- Query against measure value

Do we know why the first query that leverages dimension will be responding faster than the second, and Can we list the exact steps that will be happening, like when we analyze the query plans in traditional RDBMS?

Thank you and have a good day!

Kyeong
asked 5 months ago324 views
1 Answer
0
Accepted Answer

Choosing the 'right' dimensions is important for query performance in Amazon Timestream because it allows the service to optimize how it stores and retrieves data. When ingesting data, Timestream partitions and stores the records based on the dimensions specified.

During a query with a WHERE clause filtering on one of these dimensions, Timestream quickly identifies the exact partitions to scan rather than having to scan all the data. This is kind of like how a B-Tree index in a relational database allows for fast lookups and retrieval of records based on indexed columns.

Things to Consider

  • Timestream stores records within internal partitions based on the ingestion dimensions like customer ID, device ID, etc.

  • When a query filters on one of these dimensions, Timestream knows the exact partitions to scan rather than a full table scan. (It's quicker!)

  • This partition-level filtering enables Timestream to achieve high query performance even for very large datasets as it doesn't have to scan irrelevant partitions.

  • In contrast, filtering only on measure columns would require a full table scan as Timestream doesn't have the dimension context to narrow down the partitions.

Reference -

AWS TimeStream Does TimeUnit influence the Query performance?

Introducing customer-defined partition keys for Amazon Timestream: Optimizing query performance

Amazon Timestream Query Performance

profile pictureAWS
answered 5 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions