I am wanting to use Neptune for an application with cypher as my query language. I have a pretty small dataset of around ~8500 nodes and ~8500 edges edges. I am trying to do what seem to be fairly straightforward queries, but the latency is very high (~6-8 seconds for around 1000 rows). I have tried with various instance types, enabling and disabling caches, enabling and disabling the OSGP index to no avail. I'm really at a loss as to why the query performance is so poor.
Does anyone have any experience with poor query query performance using Neptune? I feel I must be doing something incorrect to have such high query latency.
Here is some more detailed information on my graph structure and my query.
I have a graph with 2 node types A
and B
and a single edge type MAPS_TO
which always is directed from an A
node to a B
node. The relation MAPS_TO
is many to many, but with the current dataset
it is primarily one-to-one, i.e. the graph is mainly disconnected subgraphs of the form:
(A)-[MAPS_TO]-(B)
What I would like to do is for all A nodes to collect the distinct B nodes which they map to satisfying some conditions. I've experimented with my queries a bit and the fastest one I've been able to arrive at is:
MATCH (a:A)
WHERE a.Owner = $owner AND a.IsPublic = true
WITH a
MATCH (a)-[r:MAPS_TO]->(b:B)
WHERE (b)<-[:MAPS_TO {CreationReason: "origin"}]-(:A {Owner: $owner})
OR (b)<-[:MAPS_TO {CreationReason: "origin"}]-(:A {IsPublic: true})
WITH a, r, b ORDER BY a.AId SKIP 0 LIMIT 1000
RETURN a {
.AId
} AS A, collect(distinct b {
B: {BId: b.BId, Name: b.Name, other properties on B nodes...}
R: {CreationReason: r.CreationReason, other relation properties}
})
The above query takes ~6 seconds (looking at explain and http timing) on the t4g.medium
instance type. I tried upping to a r5d.2xlarge
instance type and this cut the query time in half to 3-4 seconds. However, using such a large instance type seems quite excessive for such a small amount of data.
Really I am just trying to figure out why my query seems to perform so poorly. It seems to me that with the amount of data I have it should not really be possible to have a Neptune configuration with such performance.
EDIT for more info:
We are using the t4g.medium
instance type with 3 reader instances and the query going to the reader instances. Again we have around ~8500 nodes split approximately equally into A
nodes and B
nodes. There are around ~8500 edges of a single type MAPS_TO
all going from A
to B
. The output of the status endpoint for OpenCypher is:
{'status': 'healthy',
'startTime': 'Mon Sep 19 18:56:50 UTC 2022',
'dbEngineVersion': '1.1.1.0.R5',
'role': 'reader',
'dfeQueryEngine': 'viaQueryHint',
'gremlin': {'version': 'tinkerpop-3.5.2'},
'sparql': {'version': 'sparql-1.1'},
'opencypher': {'version': 'Neptune-9.0.20190305-1.0'},
'labMode': {'ObjectIndex': 'enabled',
'ReadWriteConflictDetection': 'enabled'},
'features': {'ResultCache': {'status': 'enabled'},
'IAMAuthentication': 'disabled',
'Streams': 'disabled',
'AuditLog': 'enabled'},
'settings': {'clusterQueryTimeoutInMs': '120000'}}
I have tried this with the ObjectIndex
enabled and disabled and do not see much difference in performance. I have also tried the query on a larger instance type, the r5d.2xlarge
to see if performance was improved by the result cache. The response time roughly cut in half, but that still seems to be very slow and a larger instance type then should be necessary. The only thing being run against the database currently are the above queries so I do not see how it could be a concurrency issue.
We have looked at the output of explain (too long to post). It is not clear to me that there is a single place where the query is spending a large amount of time. The DFEPipelineJoin
taking the longest makes sense to me based on the description in the documentation. What is not clear to me would be how to eliminate all the DFEPipelineJoin
's from the query.
Hi Dave! Thanks for the response. I can provide more detailed information. Please see the edited question.