Neo4j Performance — Tips for beginners
Neo4j is the market-leading GraphDB provider. It ranks #1 in the latest GraphDB ranking. I have been playing with it for a couple of days. It’s an interesting product that provides the balance of ease of use and the performance you look forward to.
However, as a beginner, you will normally struggle to get the desired performance (at least, I spend quite some time figuring out how to build high performant and optimized queries with Cypher). So I am going to share my tips with beginners on how to build your first high-performance query in Cypher.
Drop the Identifier
An Identifier is a name you give in front of a label to identify the node you want to create or match. In the simple code below, “p1” is the identifier.
There is nothing wrong with the script above. However, when you are trying to create a high volume of :Person, you are likely to see poor performance. It is because Neo4j will replan the query every time. A simple trick is to drop the identifier. I could run more than 10 times faster.
Use UNWIND for large volume data query
Another great tool to use is UNWIND. You can find more details here https://neo4j.com/docs/cypher-manual/current/clauses/unwind/.
With UNWIND, you can transform any list back into individual rows. These lists can be parameters that were passed in, previously collect -ed result or other list expressions.
In the above example, you can pass in a list of persons you want to create, then UNWIND the list back to individual person for node creation. When it comes to large volumes, e.g. 100k and above, UNWIND can reduce the execution time to one quarter or less.
Use APOC for periodic commit
APOC — Awesome Procedures On Cypher (APOC), is an add-on library for Neo4j that provides hundreds of procedures and functions adding a lot of useful functionality. These functions or procedures simplify the Cypher query or sometimes provide a function that could not be created in Cypher. When it comes to performances, the most important ones are apoc.periodic. There are functions such as commit, iterate, repeat, etc. Take apoc.periodic.commit as an example. This function instruct system to commit periodically instead of process all data in one commit. This improves the performance greatly and avoid out of memory issues.
This is a no-brainer. Neo4j supports Single and Composite indexes. The appropriate index help to improve read performance. You can simply create an index with something like below
Create a separate node for frequently searched attributes
Imagine a data model with :Students, :Courses as nodes and a simple relationship :Student -[:ATTEND] -> :Course. :Course has property like course name, date, duration, …. Based on your use case, you may need to query the database and find the students that attend courses on certain dates.
It seems strange for a people from RMDB background, but dates can be created as a separate node. The relationship becomes :Student -[:ATTEND] -> :Course -[:CONDUCTED_ON] -> :CourseDates.
By doing this, Neo4j query planner would start the scan from dates, which allow the query zoom into one date. Imagine you have lots of students and courses, partition your data by date is a good way to improve the performance, especially if you use complex filters on students and courses, e.g. using wildcard searches etc.
There are many other ways of tuning your query, however the above are the most obvious ones to start with. Hope this is helpful!