By using AWS re:Post, you agree to the AWS re:Post Terms of Use

Combine Count and Range Query for Efficient Data Retrieval in Amazon Neptune

0

I'm working with Amazon Neptune and using Gremlin to query my graph data. Currently, I have two separate queries: one to get the count of vertices and another to select a specific range of results. I'm looking for a way to consolidate these into a single query to improve efficiency, as making multiple queries has been causing performance issues.

Here's a simplified example of my current approach:

g.V()
.has('acco_id', 'someid')
.hasLabel('User')
.out('KEYS')
.groupCount("TotalCount")  // Side-effect for count
.by(constant("count"))     // Labels every result as "count" to sum them
.project("TotalCount", "Keys")
.by(select("TotalCount").select("count"))  // Fetches the side-effect count
.by(project("Id", "Name", "Type")
    .by("id")
    .by("name")
    .by(__.label()))
.range(0, 2)

The goal is to retrieve both the total count of matching vertices and a subset of data within a specified range in a single query, but my current attempt doesn’t seem to work as expected. Is there a way in Gremlin on Neptune to achieve both without separate queries, ideally preserving the count as a side-effect while applying a range on the actual result set?

Any insights or suggestions would be greatly appreciated. Thank you!

asked a month ago37 views
3 Answers
0

To combine the count and range query for efficient data retrieval in Amazon Neptune using Gremlin, you can modify your query to use side effects more effectively. Here's an approach that should work:

g.V()
 .has('acco_id', 'someid')
 .hasLabel('User')
 .out('KEYS')
 .fold()
 .project('TotalCount', 'Keys')
   .by(count(local))
   .by(unfold()
       .range(0, 2)
       .project('Id', 'Name', 'Type')
         .by(id)
         .by('name')
         .by(label)
       .fold())

This query does the following:

  1. It starts with your initial vertex selection and traversal.
  2. The fold() step collects all results into a list.
  3. The project() step creates two columns: 'TotalCount' and 'Keys'.
  4. For 'TotalCount', it uses count(local) to count the total number of items in the folded list.
  5. For 'Keys', it unfolds the list, applies the range, projects the desired properties, and then folds the results back into a list.

This approach allows you to get both the total count and the ranged subset of results in a single query. The total count represents all matching vertices, while the 'Keys' list contains only the specified range of results with the projected properties.

This method is more efficient as it avoids multiple separate queries and utilizes Neptune's ability to process the entire query in a single operation. It should help improve performance by reducing the number of round trips to the database.

Remember that while this approach is generally more efficient, the actual performance may vary depending on your specific graph structure and data volume. It's always a good practice to test and profile your queries in your specific Neptune environment to ensure optimal performance.

Sources
How Gremlin queries are processed in Neptune - Amazon Neptune
Using the Gremlin explain API in Neptune - Amazon Neptune
Managed Graph Database – Amazon Neptune Features – AWS

profile picture
answered a month ago
  • So, I would prefer avoiding fold and unfold given its not suitable and slow for lots of data. I'm aware about this approach.

0

Doing the range() so late in the query is likely going to cause this to take longer to execute versus doing it earlier in the query.

I'm not entirely certain if you want the total count of all vertices returned by the out() or if you just want the count for those returned within the range()? So I'll try to give you both combinations below:

Here's an option to just return the count and keys for the range, an approach similar to the one returned by the GenAI agent might work, though you way want to move the range() forward in the query. This would reduce the range that is being folded/counted.

g.V()
 .has('acco_id', 'someid')
 .hasLabel('User')
 .out('KEYS')
  .range(0,2).fold()
  .project('TotalCount','Keys')
    .by(count(local))
    .by(unfold().
        project("Id", "Name", "Type")
          .by("id")
          .by("name")
          .by(__.label())
       .fold()
    )

You need the fold() within the by() of the project, otherwise it's only going to return the first result found by the subquery within the by().

If you want the total count of all vertices found after the out() and just a range of those, then here's an option that doesn't need to do the early fold().

g.V()
 .has('acco_id', 'someid')
 .hasLabel('User')
  .project('TotalCount','Keys')
    .by(out('KEYS').count())
    .by(out('KEYS').range(0,2).
        project("Id", "Name", "Type")
          .by("id")
          .by("name")
          .by(__.label())
       .fold()
    )

Going beyond either of these options, it may be helpful to see a Neptune Gremlin Profile output of the query to determine where the bulk of the computation is happening: https://docs.aws.amazon.com/neptune/latest/userguide/gremlin-profile-api.html

profile pictureAWS
answered a month ago
0

Hello,

I understand that you are inquiring about a query in Neptune to combine to get the count of vertices and another to select specific range of results. You want to improve performance by consolidating into single query. As rightly suggested by Taylor, you can use out() to get the total count of all vertices or range() to get the count within a range.

Please let us know if this solves your issue or if you would need any further help. I would request you to raise a support case with AWS so that resource based troubleshooting can be done and suggest a possible workaround.

Thank you!

References:

[1] https://docs.aws.amazon.com/neptune/latest/userguide/gremlin-profile-api.html
AWS
answered 24 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions