- Newest
- Most votes
- Most comments
Below is an example demonstrating parallel and serial shard querying.
Note: I did run into a strange problem when I set the Limit to 1. This is probably caused by the fact that there are multiple records in the index with the save exact value. Somehow the paging didn't work properly...
Note that doing this in parallel might be a "premature optimization". In my example I found an improvement in speed 300ms serial or 50ms parallel (excluding the cold start one :-))
In the example below I added the user name to the index. I think your use case does not require this.
Still I think the table schema is not really good for your use case.
Maybe a design like this would be better:
GSI - Partition Key - Non-Truncated-DateTime
Then to get the most recent items you can just start a reverse scan of this index.
async function queryShard(user, date, shard, exclusiveKeyStart, results) { const params = { TableName: "UserActions", IndexName: 'UserDateShard-index', KeyConditionExpression: "UserDateShard = :s", ExpressionAttributeValues: { ":s": `${user}/${date.toISOString()}/${shard}`, }, Limit: 5, ExclusiveKeyStart: exclusiveKeyStart } const res = await ddbDocClient.send(new QueryCommand(params)); if (res.Items) { res.Items.forEach((item) => { results.push(item) }) } if (res.LastEvaluatedKey) { await queryShard(user, date, shard, res.LastEvaluatedKey, results) } } async function queryParallel(user, date) { const results = [] const queries = [] for(let shard=0; shard<10; shard++) { queries.push(queryShard(user, date, shard, undefined, results)); } await Promise.all(queries) return results; } async function querySerial(user, date) { const results = [] const queries = [] for(let shard=0; shard<10; shard++) { await queryShard(user, date, shard, undefined, results); } return results; } export const handler = async (event) => { try { var date = randomDate('01-01-2022', '12-31-2022'); var user = users[Math.floor(Math.random()*users.length)]; console.time("parallel"); var results = await queryParallel(user, date); console.timeEnd("parallel"); console.time("serial"); var results = await querySerial(user, date); console.timeEnd("serial"); return { body: 'Successfully created item!' } } catch (err) { console.log(err) return { error: err } } };
Sometimes the query takes more time than the 30s for an HTTP Lambda call. In these cases I use an async aproach; the Lambda return to the client through a websocket connection.
Ok, am working on corrected code but there a two things that currently are a bit strange:
- First of all it seems a bit strange to want the first items from multiple days. I don't understand why you would want to have those items since you cannot make any statistic with it. And if it were a paging thing one day would suffice. To get it clear you want:
today 100 items of x items yesterday 100 items of y items day before yesterday 100 items of z items
Of course if that really is the data you want it can be arranged but it would make more sense to want:
the newest items from you table:
- max hundred
- max x days back
- (whichever comes first)
- Usually when you want fast code that makes use of the io async situations to execute parrallel queries you will end up with code that waits for an array or results using
const res = await Promise.all([itemPromises]);
but in your code this will only contain one item. Also the recursive calls are done within thethen
part meaning a new query will only be spawned when another returns. The code should spawn the ten shard queries immediately and wait for them to return.
If you clear up my concernes about the 1st point raised I can try make some working code for you :-)
Regards Jacco
Hi Jacco. For this use case, there will be two type of users:
- Users that will be creating items. there is no pattern around it. one day there can be none item creation. some other days there will be dozens, some other days hundreds and some other day thousands of items.
- User that will log into a dashboard and will see the most recently items sorted by creation date starting from the most recently created. These users may choose to have in the dashboard a number of recent items. I'm thinking about max 100. more than that won't provide much value to the use case.
Hot key issue. if the users create, let's say five thousand items one specific day and I decide to add them to the GSI1 with PK item creation date, then I will have all users dashboards hitting that partition. Do I need shard those items to avoid hot key issues? How can I fetch those items in an orderly manner in the fastest possible way.
max hundred, max x days back, (whichever comes first) = makes sense
thanks
Adding the user-id to the gsipk1 will already give much less rise to the hotkey problem. And why did you not include a time in the record? Would be nice to sort most recent first :-)
Relevant content
- asked 2 years ago
- Accepted Answerasked 4 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 7 months ago
Thanks Jacco for your detailed answer. Just one thing. I couldn't find in the documentation a way to do reverse scan on an index. it seems it is not possible using the scan operation.
Oops indeed
ScanIndexForward
is only supported by Query.