DynamoDB schema for getting sorted data with multiple partition key

0

Hello everyone, I need your help and guidance.

I have an application that keeps saving task logs (helpdesk) for different machines. Two things are relevant to the data schema:

  1. the machine id
  2. the date when the task occurs

Therefore, we store all the records in a DynamoDB table with the following keys:

  • Partition key: machine id.
  • Sort key: date (timestamp).
  • Attributes: miscellaneous data of the task performed.

With this schema it is very easy to list all the elements for a machine. We can filter by dates, latest records, etc., but always for a single partition key (as a background, we use the AWS SDK to display everything in a web application).

However, there is a dashboard, where the latest records that have been stored for several machines (several partition keys) should be listed depending on the user (each user can view a group of machines) and sorted by date (sort key) in descending order. For example: the latest records that have been stored in the last hour in the database according to a group of partition keys.

However, I still can't figure out the correct configuration of the table in DynamoDB. I understand that a table with secondary indexes must be created, because without this type of arrangement I can't use a query or scan to get the latest records without having to indicate a partition key.

I have read the documentation and have seen some similar cases, but I can't figure out the correct schema for the secondary index (if that were the case, because maybe there is something I don't know about).

I hope you can guide me. Thank you very much.

asked 4 days ago23 views
3 Answers
3

Ultimately it depends on what determines a group. If it is a groupId or some other common attribute then you can create a Global Secondary Index on that attribute and timestamp as the sort key.

If the group is determined by the user knowing each of the MachineIds, then simply just send an individual query in parallel for each of the machines and combine the results on the client side and do Arr.sort() on the timestamp;

profile pictureAWS
EXPERT
answered 4 days ago
profile picture
EXPERT
reviewed 3 days ago
1

To address your requirement of listing the latest records for multiple machines sorted by date, you can indeed use a Global Secondary Index (GSI) in DynamoDB. Here's a suggested approach:

  1. Keep your current table structure with machine ID as the partition key and date as the sort key.

  2. Create a GSI with the following structure:

    • Partition key: A new attribute, let's call it "UserGroup"
    • Sort key: Date (timestamp)
    • Project other necessary attributes
  3. When saving task logs, add the "UserGroup" attribute to each item. This attribute should represent the group of machines a user can view.

  4. To query the latest records for multiple machines:

    • Use the GSI
    • Specify the "UserGroup" as the partition key
    • Use the Date as the sort key with a condition for the desired time range
    • Sort in descending order

This setup allows you to efficiently query across multiple machines (partition keys) that belong to a specific user group, sorted by date.

For example, to get the latest records from the last hour for a specific user group:

  1. Use the GSI
  2. Set the partition key to the user's group
  3. Set a condition on the sort key (Date) to be greater than one hour ago
  4. Limit the results as needed and sort in descending order

This approach provides a flexible and efficient way to retrieve sorted data across multiple partition keys in DynamoDB, addressing your dashboard requirements without the need for expensive scan operations.

Remember to carefully consider the additional write capacity and storage costs associated with GSIs, and ensure that the "UserGroup" attribute provides a good distribution of data to avoid hot partitions.
Sources
Get started with DynamoDB - Serverless
Step 5: Create a DynamoDB data model - AWS Prescriptive Guidance
Effective data sorting with Amazon DynamoDB | AWS Database Blog

profile picture
answered 4 days ago
profile picture
EXPERT
reviewed 3 days ago
0

Hi, I appreciate the answers, they are very close to what we have been doing.

As Leeroy mentions, groups are associated with users having access to certain machines. Grouping in a "user to machine" way greatly restricts the way in which the schema should be generated.

I understand that making parallel queries for each machine that a particular user should see is a very simple way, not even needing to use a GSI, but it can be very inefficient (let's say we want to get the last 10 records for 10,000 machines).

That's the main problem, generating a schema that allows us to quickly obtain the results we need without making parallel queries and then filtering on the client side, which can consume a lot of memory, depending on the number of records and machines involved.

I thought about using PartiQL, but it seems to have the same limitation (single partition key).

answered 2 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions