DynamoDB: Duplicate a lot of data to save a second lookup?


There are two typical uses for my new database.

  1. A user will access their record/item via their userID, and they will manage their info and a number of devices, each with a unique devID.
  2. A device connects and using its devID, will find the owning userID, then takes action based on attributes in the user item.

Two options I could use, each with a single DynamoDB table.

A. The table has items that are users or devices, with a partition key of ID and sort key of itemType. User type items have associated attributes like addresses, account and profile info, etc, and a set of deviceIDs. Device type items have associated attributes like their preferences, their type, their capabilities, and an attribute for the userID that owns it.

You can access both users and devices really quickly. If you are doing (1) you lookup and find a user, then you will have to use a set attribute that lists the one or more deviceIDs it owns, and then make individual lookups for each device. That's 2 lookups for a user that owns one device and more for multiple devices. Or if you are doing (2), search and find a device, you grab its userID attribute and then lookup the userID item. That's 2 lookups.

B. I could reduce the multiple lookups this way: Still one table, but all entries in the table are more homogenous: Every item includes all user related attributes, and includes one device's attributes. The provisioning key is the userID and sort key is deviceID, another indexed attribute is just the deviceID. If you are doing (1) then you lookup the userID and you get one or more records depending on whether they own one device or more. If you are doing (2), then we quickly find the device, and that same item includes all the user info we need and we don't need to do another lookup.

The problem with B is that I am duplicating a lot of data about the user in each of the items. Keeping them all synced is going to be problematic too, but that's a lot rarer.

So, am I overthinking the lookup costs, and should just go with the multiple lookup as in A, or is the multiple lookups going to be expensive enough that I need to have a better data design?

demandé il y a 2 ans1742 vues
2 réponses

Your cost is proportional to the amount of data scanned by the query, so whatever scans less data, will be less costly. What I understand from your description, the case A won't have more data scanned than the case B - every bit of data is retrieved once by its ID, difficult to be more efficient than that.

Now, I don't know how much read activity you think there will be - if the difference is couple of dollars per month then maybe you shouldn't bother yourself with cost optimizations a this stage, just implement what you see as easier to maintain in the long run? :)

répondu il y a 2 ans
  • You're right. However, this is more of a learning question for me. There is so much emphasis on "reduce number of tables" and "reduce lookups", that I would like to figure out how they want me to do that. The use case isn't that strange either: Imagine 50 employees at a company. Each employee is an item in a table. Then you have Departments, also each as one item. Each department has an attribute called "department employees", which is a set of employees - an array. It also has the department's address, phone number, etc.. Now, if there were a way that I could lookup an employee id, within that set, with a query and not a scan, then I'd immediately be able to find an employee's department and grab the department phone number without a second lookup. Otherwise, I have to first query the employeeID, get the employee's item, find the departmentID, and then lookup the Department, to find out where the department's address is. So, is there a simple way to do this, because if so, I want to learn it. Otherwise, yes, it really isn't a big deal and I should just move on.


I know this is an old question but I'll answer it anyway, maybe it'll help somebody :)

With DynamoDB the schema* depends strongly on the access patterns. Given your description I personally would design the table in the following way:

  • Hash/Partition key - user Id
  • Range/Sort key - either a string "USER" or the device Id (it may make sense to prefix the device Id with some kind of disriminator like DEVICE#, but want to keep it simple)
  • The rest of the attributes - more or less like you've described in the first approach, i.e. user item has user's attributes while a device item has device attributes, no duplication.
  • DeviceIdToUserId GSI:
    • Hash key - device Id
    • Range key - user Id

To get a user and her devices you need only one query - just get all items using a query by the user Id. To get a user while only having a device - use the GSI to map device Id to user Id and the get the item that has that hash key and range key equal to "USER". This is achieved by using only a small amount of data duplication in the form of the GSI. One caveat - GSIs are eventually consistent with the main table data, there's no strongly consistent read support on GSIs. It can be solved by using the main table as the index and leveraging transactional writes, but as long as you can handle eventual consistency I would just use GSIs.

(* - I know that NoSQL DBs are called schemaless, but I've read in M. Kleppmann's book "Designing Data-Intensive Applications" that there definitely is a schema, it's just applied on read instead of being enforced on write, which means there's more flexibility. It makes 100% sense to me so I'll say that DynamoDB data has a schema)

répondu il y a un an

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions