How to bulk update edge properties with Gremlin in Neptune?

0

Context

I have a Neptune property graph that comprises of the following:

  1. Vertex (User)
  2. Edge (likes)

We are currently using Gremlin for data access.

The data we are working with comes batched per day, specifically a collection of users and their new likes. We want to convert this batched data into our Neptune database.

Question

  1. Is there a way to update the edge properties using Neptune's bulk uploader?
  2. What is the recommended way of updating these edges, as we can have millions of updates and ideally don't want to iterate through each one.
asked 22 days ago47 views
2 Answers
0

While Neptune's bulk loader is primarily designed for initial data loading and doesn't directly support updating existing edge properties, there are efficient ways to handle bulk updates of edge properties using Gremlin in Neptune.

  1. Neptune's bulk loader: The bulk loader is not designed for updating existing edge properties. It's mainly used for initial data loading or adding new vertices and edges.

  2. Recommended approach for bulk edge property updates:

For efficient bulk updates of edge properties, especially when dealing with millions of updates, you can use a combination of batching and Gremlin's mergeE() step (available in Neptune versions supporting TinkerPop 3.6.x and later). Here's a recommended approach:

a. Batch your updates: Group your updates into batches of about 200 records per request. This helps reduce the transactional overhead and improves throughput.

b. Use mergeE() for each update in the batch: The mergeE() step is optimized for upserting edges. It will update the edge if it exists, or create it if it doesn't. Here's an example of how you might structure a batch update:

g.mergeE([(from): 'user1', (to): 'item1', (T.label): 'likes', 'timestamp': datetime('2023-06-15')])
 .mergeE([(from): 'user2', (to): 'item2', (T.label): 'likes', 'timestamp': datetime('2023-06-15')])
 // ... more mergeE() calls for other updates in the batch
 .id()

c. Parallelize your requests: To further increase throughput, you can submit these batch requests in parallel using multiple clients. Ensure each client submits requests serially to avoid overwhelming the server-side request queue.

d. Error handling: Implement error handling and retries in your client code. If a batch fails, you may want to retry it after a short delay.

e. Avoid ID collisions: If possible, structure your data so that different threads or processes work on non-overlapping sets of edges. This reduces the chance of concurrent modification exceptions.

By following these guidelines, you can efficiently update millions of edge properties in Neptune without having to iterate through each one individually. This approach leverages Neptune's optimizations for high-concurrency scenarios and should provide good performance for bulk updates.

Sources
Making efficient upserts with Gremlin mergeV() and mergeE() steps - Amazon Neptune
Creating Efficient Multithreaded Gremlin Writes - Amazon Neptune
Using Gremlin to access graph data in Amazon Neptune - Amazon Neptune

profile picture
answered 22 days ago
0

Neptune's bulk loader is mainly used for initial data loading or adding new vertices and edges. It is not designed for updating existing edge properties.

It's recommended to chain mergeV() and mergeE() operations together for optimizing high-throughput write scenarios in Neptune. Using a starting batch size of approximately 200 records (though this can be tuned based on your specific workload characteristics), you can significantly reduce transactional overhead when upserting large volumes of vertices and edges. To further maximize throughput, consider implementing parallel processing using multiple clients to handle batch upserts simultaneously. This multi-threaded approach, combined with proper batch sizing, can dramatically improve your write performance. For your daily user likes data updates, you could split your millions of records into these optimally-sized batches and process them concurrently across multiple threads, while maintaining proper error handling and monitoring to ensure data consistency.

Sources:

https://docs.aws.amazon.com/neptune/latest/userguide/gremlin-efficient-upserts.html https://docs.aws.amazon.com/neptune/latest/userguide/best-practices-gremlin-multithreaded-writes.html https://docs.aws.amazon.com/neptune/latest/userguide/get-started-graph-gremlin.html https://docs.aws.amazon.com/neptune/latest/userguide/bulk-load.html https://docs.aws.amazon.com/neptune/latest/userguide/bulk-load-optimize.html

AWS
answered 22 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions