Using Apache Tinkerpop Gremlin 3 graph API with OrientDB version 2

This post describes how to get started using the Gremlin 3 console with OrientDB version 2.2.

OrientDB is an open-source multi-model database that combines NoSQL documents with a graph database. Apache Tinkerpop is a vendor-neutral graph framework which includes the Gremlin graph API.

OrientDB version 2.2 includes the Gremlin 2 console. Gremlin 3 support is anticipated in OrientDB version 3, but if you want to use Gremlin 3 with OrientDB 2.2, you can do so using the orientdb-gremlin driver. This post takes you through the steps to do that, and touches on some limitations.

Initial Tinkerpop setup

Download and unzip version 3.2.3 (BACKLEVEL) of the Gremlin Console from the Tinkerpop archive. The orientdb-gremlin driver is currently pinned to this backlevel version.

Start the Gremlin console:

cd gremlin
bin/gremlin.sh

If you’re not familiar with Gremlin, you may want to run through the Tinkerpop Getting Started tutorial.

Using built-in in-memory OrientDB graph

We can try out the orientdb-gremlin driver without installing OrientDB. First install the driver in the Gremlin console:

:install com.michaelpollmeier orientdb-gremlin 3.2.3.0
:import org.apache.tinkerpop.gremlin.orientdb.OrientGraphFactory

You should see a long list of imported packages.

The driver is not enabled as a Gremlin console plugin, so you will need to do the :import (but not the :install) whenever you start the console.

Create a connection to an in-memory database.:

g = new OrientGraphFactory("memory:orienttest").getNoTx()

You should see something like:

INFO: OrientDB auto-config DISKCACHE=3,641MB (heap=3,641MB direct=3,641MB os=16,384MB), assuming maximum direct memory size equals to maximum JVM heap size
Apr 02, 2017 10:44:44 AM com.orientechnologies.common.log.OLogManager log
WARNING: MaxDirectMemorySize JVM option is not set or has invalid value, that may cause out of memory errors. Please set the -XX:MaxDirectMemorySize=16384m option when you start the JVM.
==>orientgraph[com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx@6438a7fe]

For now, you can ignore the warning. If you wish to avoid it, use something like the following when you start the Gremlin console:

JAVA_OPTIONS=-XX:MaxDirectMemorySize=2056m bin/gremlin.sh

Create a graph traverser for the database:

t = g.traversal()

and then try creating and querying some vertices and an edge:

v1 = g.addVertex(T.label,'person','name','fred')
v2 = g.addVertex(T.label,'dog','name','fluffy')
v1.addEdge('feeds',v2)

After each command, you should see the identity of the created vertex or edge, for example:

gremlin> v1.addEdge('feeds',v2)
Apr 02, 2017 10:47:50 AM com.orientechnologies.common.log.OLogManager log
INFO: $ANSI{green {db=orienttest}} created class 'E_feeds' as subclass of 'E'
==>e[#41:0][#25:0-feeds->#33:0]

Now run a query to return the dogs fed by Fred:

t.V().hasLabel('person').has('name', 'fred').out('feeds').values('name')

This should return ‘fluffy’ as the answer:

gremlin> t.V().hasLabel('person').has('name', 'fred').out('feeds').values('name')
Apr 02, 2017 10:50:35 AM com.orientechnologies.common.log.OLogManager log
WARNING: $ANSI{green {db=orienttest}} scanning through all elements without using an index for Traversal [OrientGraphStep(vertex,[~label.eq(person), name.eq(fred)]), VertexStep(OUT,[feeds],vertex), PropertiesStep([name],value), NoOpBarrierStep(2500)]
==>fluffy

We will come back to the warning later.

To quit out of the Gremlin console, type :exit. If you are in the Gremlin console and get into a confused state, type :clear to abort the current command.

Using OrientDB Server

Install OrientDB Server

Download and unzip the community edition of OrientDB. Version 2.2.17 was used for this example.

Start the server and assign it a root password:

cd orientdb
bin/server.sh

Start the console:

cd orientdb
bin/console.sh

As I mentioned earlier, this version of OrientDB distributes Gremlin V2 in its bundle. If you are searching for information, note that the OrientDB Gremlin documentation is not directly applicable to our use of Gremlin V3.

Setup OrientDB Database Index

For the example dataset, we need to setup a database and index in advance. Although this may be possible through the Gremlin driver, its semantics is different for different vendors, so I chose to do this directly in the OrientDB console.

In the orientDB console, create a graph database called ‘votes’, using the root password you created earlier:

create database remote:localhost/votes root <rootpwd> plocal

plocal is a storage strategy using multiple local files. An alternative would be to specify memory storage.

You should see something like:

Creating database [remote:localhost/votes] using the storage type [plocal]...
Connecting to database [remote:localhost/votes] with user 'root'...OK
Database created successfully.

Current database is: remote:localhost/votes

We need to create a vertex class and define a property on it so that we can create an index on the ‘userId’ property of the Vote vertices. The class will correspond to a label that we use later in Gremlin queries.

create class V_Vote extends V
create property V_Vote.userId STRING
create index user_index ON V_Vote (userId) UNIQUE

The name of the class used for the vertices needs to correspond to the label that will be used in Gremlin, prefixed with ‘V_’, i.e. the label in Gremlin in this case will be “Vote”. The classes for edges need to be prefixed with ‘E_’.

Although it is possible to define a property and index on the base vertex class (V), the orientdb-gremlin driver will not use this in its traversal strategy.

To list the indexes created on the database:

indexes

You should see a table of indexes including the ‘user_index’ we just created.

Running Gremlin queries which use index

These instructions are based on the ‘Loading Data’ section of the Tinkerpop Getting Started tutorial, modified to work with the orientdb-gremlin driver.

Download the example data:

curl -L -O http://snap.stanford.edu/data/wiki-Vote.txt.gz
gunzip wiki-Vote.txt.gz

Restart the Gremlin console and create a graph traverser for the votes graph:

:import org.apache.tinkerpop.gremlin.orientdb.OrientGraphFactory
g = new OrientGraphFactory("remote:localhost/votes").getNoTx()
t = g.traversal()

Create a helper function that we will use to load data:

getOrCreate = { id ->
  t.V().hasLabel('Vote').has('userId', id).tryNext().orElseGet{ 
    t.addV('userId', id, T.label, 'Vote').next() 
  }
}

In the above, the pattern: g.V().hasLabel().has(, ) is the only one that will cause the orientdb-gremlin driver to use an index. You can see this even with an empty graph. Try:

t.V().has('userId', "1234")

and you will see a warning message like:

Mar 02, 2017 9:37:01 AM com.orientechnologies.common.log.OLogManager log
WARNING: scanning through all elements without using an index for Traversal [OrientGraphStep(vertex,[userId.eq(id)])]

To actually load the data, run the following. You may need to modify the filepath to point to the one you downloaded:

new File('wiki-Vote.txt').eachLine {
  if (!it.startsWith("#")){
    (fromVertex, toVertex) = it.split('\t').collect(getOrCreate)
    fromVertex.addEdge('votesFor', toVertex)
  }
}

This should take 1 to 2 minutes. Without the index, it will take much longer (10x or more), and will output multiple index warnings.

Exploring query behavior

To query the resulting data:

t.V().hasLabel('Vote').count()
t.E().hasLabel('votesFor').count()
t.V().hasLabel('Vote').has('userId', '80').values()

Note that the edge count will take some time. If you ask Gremlin to explain the query:

t.E().hasLabel('votesFor').count().explain()

You will see that the first step of the optimized query still retrieves all of the ‘votesFor’ edges before counting them:

Original Traversal                 [GraphStep(edge,[]), HasStep([~label.eq(votes
                                      For)]), CountGlobalStep]
...
Final Traversal                    [OrientGraphStep(edge,[~label.eq(votesFor)]),
                                       CountGlobalStep]

Contrast this with the explain for the last query:

t.V().hasLabel('Vote').has('userId', '80').values().explain()

This changes the original traversal to two steps, the first returning a specific record using the index:

Original Traversal                 [GraphStep(vertex,[]), HasStep([~label.eq(Vot
                                      e)]), HasStep([userId.eq(80)]), PropertiesSte
                                      p(value)]
...
Final Traversal                    [OrientGraphStep(vertex,[~label.eq(Vote), use
                                      rId.eq(80)]), PropertiesStep(value)]

To perform the equivalent queries from the orientdb console:

select count(*) from V_Vote
select count(*) from E_votesFor
select * from V_Vote where userId=80

These counts return immediately.

Note: Gremlin automatically creates the ‘E_votesFor’ class from the edge name based on the label used when adding the edge.

Using the built-in Gremlin 2 queries in the OrientDB console:

gremlin g.V().count()
gremlin g.E().count()
gremlin g.V().has('userId', 80)

The performance is better with OrientDB’s built-in Gremlin, although the Edge count still takes much longer than the native query. Hopefully the Gremlin 3 support with OrientDB version 3 will be a significant improvement over Gremlin 3 with OrientDB version 2.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s