Getting started with RDF SPARQL queries and inference using Apache Jena Fuseki

This post describes how to get started using the Apache Jena RDF Server (Fuseki) to perform basic SPARQL queries and inference.

Apache Jena is an open-source framework for building semantic web applications. Fuseki is a component of Jena that lets you query an RDF data model using the SPARQL query language.

This post takes you through the steps to set up a Fuseki 2 server, configure it to perform basic inference, and run some SPARQL queries.

This example was inspired by Jörg Baach’s Getting started with Jena, Fuseki and OWL Reasoning.

Set up Fuseki Server

Download and unzip the Fuseki server from the Jena download site. This version used in this post is Fuseki 2.5.0.

In the fuseki directory, run the server:

cd fuseki
./fuseki-server

Browse to http://localhost:3030. You should see the Fuseki server home page, with a green server status in the top right, and with no datasets on it.

Stop the server (CTRL-C).

Configure Datasets

We are going to create the configuration for three datasets, by creating files in fuseki/run/configuration. The three datasets will illustrate:

  • a persisted dataset without inference
  • an in-memory dataset with inference
  • a persisted dataset with inference

Persisted dataset without inference

Create the configuration for a persisted dataset using Jena TDB with no inference enabled, in a file called fuseki/run/configuration/animals_no_inf.ttl. Replace <path> in the dataset section with the path to your fuseki directory:

@prefix :      <http://base/#> .
@prefix tdb:   <http://jena.hpl.hp.com/2008/tdb#> .
@prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix ja:    <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> .
@prefix fuseki: <http://jena.apache.org/fuseki#> .

:service1  a                          fuseki:Service ;
        fuseki:dataset                :dataset ;
        fuseki:name                   "animals_no_inf" ;
        fuseki:serviceQuery           "query" , "sparql" ;   # SPARQL query service
        fuseki:serviceReadGraphStore  "get" ;                # Graph store protocol (read only)
        fuseki:serviceReadWriteGraphStore "data" ;           # Graph store protocol (read/write)
        fuseki:serviceUpdate          "update" ;             # SPARQL update service
        fuseki:serviceUpload          "upload" .             # File upload

:dataset
        a             tdb:DatasetTDB ;
        tdb:location  "<path>/fuseki/run/databases/animals_no_inf" .

Lines 1-6 set up some prefixes for schemas used in the definition. Lines 8-15 set up a service to access the dataset supporting SPARQL query and update, as well as file upload. Lines 17-19 declare the dataset and where it should be stored.

Note that the run directory is only created when you start the server. If you skipped starting the server, you will need to create the directories manually.

In-memory dataset with inference

Create an in-memory dataset, with inference in the file fuseki/run/configuration/animals_in_mem.ttl:

@prefix :      <http://base/#> .
@prefix tdb:   <http://jena.hpl.hp.com/2008/tdb#> .
@prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix ja:    <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> .
@prefix fuseki: <http://jena.apache.org/fuseki#> .

:service1  a                          fuseki:Service ;
        fuseki:dataset                :dataset ;
        fuseki:name                   "animals_in_mem" ;
        fuseki:serviceQuery           "query" , "sparql" ;
        fuseki:serviceReadGraphStore  "get" ;
        fuseki:serviceReadWriteGraphStore  "data" ;
        fuseki:serviceUpdate          "update" ;
        fuseki:serviceUpload          "upload" .

:dataset  a     ja:RDFDataset ;
  ja:defaultGraph <#model_inf_1> ;
  .

<#model_inf_1> rdfs:label "Inf-1" ;
 ja:reasoner
 [ ja:reasonerURL
 <http://jena.hpl.hp.com/2003/OWLFBRuleReasoner>];
 

Lines 1-15 are as above. Lines 17-19 declare the dataset and a default graph using the inference model, and lines 21-24 set up an inference model using the Jena OWL FB reasoner.

Persisted dataset with inference

Create an TDB (persisted) dataset, with inference in the file fuseki/run/configuration/animals.ttl, replacing <path> with the path to your fuseki directory:

 @prefix :      <http://base/#> .
@prefix tdb:   <http://jena.hpl.hp.com/2008/tdb#> .
@prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix ja:    <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> .
@prefix fuseki: <http://jena.apache.org/fuseki#> .

:service_tdb_all  a                   fuseki:Service ;
        fuseki:dataset                :dataset ;
        fuseki:name                   "animals" ;
        fuseki:serviceQuery           "query" , "sparql" ;
        fuseki:serviceReadGraphStore  "get" ;
        fuseki:serviceReadWriteGraphStore "data" ;
        fuseki:serviceUpdate          "update" ;
        fuseki:serviceUpload          "upload" .

:dataset a ja:RDFDataset ;
    ja:defaultGraph       <#model_inf> ;
     .

<#model_inf> a ja:InfModel ;
     ja:baseModel <#graph> ;
     ja:reasoner [
         ja:reasonerURL <http://jena.hpl.hp.com/2003/OWLFBRuleReasoner>
     ] .

<#graph> rdf:type tdb:GraphTDB ;
  tdb:dataset :tdb_dataset_readwrite .

:tdb_dataset_readwrite
        a             tdb:DatasetTDB ;
        tdb:location  "<path>/fuseki/run/databases/animals"
        .

Lines 1-15 are as above. Lines 17-19 declare the dataset and a default graph using the inference model. Lines 21-25 set up an inference model that references a TDB graph. The TDB graph and the dataset in which it is persisted are defined in lines 27-33.

Create datasets

Exit and restart the server. You should see it load the configuration and register the new datasets:

./fuseki-server
...
[2017-03-02 16:56:50] Config     INFO  Load configuration: file:///Users/cdraper/cortex/fuseki2/run/configuration/animals.ttl
[2017-03-02 16:56:51] Config     INFO  Load configuration: file:///Users/cdraper/cortex/fuseki2/run/configuration/animals_in_mem.ttl
[2017-03-02 16:56:51] Config     INFO  Load configuration: file:///Users/cdraper/cortex/fuseki2/run/configuration/animals_no_inf.ttl
[2017-03-02 16:56:51] Config     INFO  Register: /animals
[2017-03-02 16:56:51] Config     INFO  Register: /animals_in_mem
[2017-03-02 16:56:51] Config     INFO  Register: /animals_no_inf

If you get errors, make sure you replaced the <path> correctly in the two persisted dataset configurations.

Refresh the UI at http://localhost:3030. You should now see the three datasets. Click ‘add data’ for the ‘animals_no_inf’ dataset.

We could load data from a file, but we are going to use a query. Go to the query tab. Make sure the dataset dropdown is set to ‘animals_no_inf’.

Change the SPARQL endpoint to: http://localhost:3030/animals_no_inf/update

Paste the insert command into the command area, overwriting anything that is there:

PREFIX rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:   <http://www.w3.org/2000/01/rdf-schema#>
PREFIX ex:   <http://example.org/>
PREFIX zoo:   <http://example.org/zoo/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>

INSERT DATA {
ex:dog1    rdf:type         ex:animal .
ex:cat1    rdf:type         ex:cat .
ex:cat     rdfs:subClassOf  ex:animal .
zoo:host   rdfs:range       ex:animal .
ex:zoo1    zoo:host         ex:cat2 .
ex:cat3    owl:sameAs       ex:cat2 .
}

Lines 1-5 set up some prefixes for schemas used in the insert command. Lines 8-13 are the RDF triples for this example.

Execute the command by clicking on the arrow. You should get a result in the lower part of the UI indicating a successful update.

Repeat for the other datasets by changing the dropdown and using the ‘update’ URL.

Query datasets

In the ‘animals_no_inf’ dataset, change the SPARQL endpoint to: http://localhost:3030/animals_no_inf/sparql and run the following:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX ex: <http://example.org/>
SELECT * WHERE {
  ?sub a ex:animal
}
LIMIT 10

You should get a single result ‘ex:dog1’. This is because ‘dog1’ is the only subject explicitly identified as being of type ‘ex:animal’.

Now repeat the query for the ‘animals’ dataset (make sure to use the query URL). You should get the full set of inferred results:

ex:dog1
ex:cat2
ex:cat3
ex:cat1

‘cat2’ is inferred to be an animal because it is the object of a ‘zoo:host’ predicate, and because the objects of this predicate are declared to be of type animal by the ‘rdfs:range’ predicate. ‘cat1’ is inferred to be an animal because it is of type ‘cat’, which is declared to be a subclass of animal. ‘cat3’ is an animal because it is the same as ‘cat1’.

Inference Rules

All of the inference in the above is encoded in tuples that are imported by the OWL reasoner that the ‘animals’ dataset is configured to use.

In the browser, go to the ‘Dataset > Edit’ tab. For the ‘animals_no_inf’ dataset, click on ‘List current graphs’ then click on the default graph. You will see the 6 tuples that we inserted.

Then go to one of the other datasets and repeat. You will see there are over 600 tuples, consisting of the ontology rules being used for the inference.

Using Apache Tinkerpop Gremlin 3 graph API with OrientDB version 2

This post describes how to get started using the Gremlin 3 console with OrientDB version 2.2.

OrientDB is an open-source multi-model database that combines NoSQL documents with a graph database. Apache Tinkerpop is a vendor-neutral graph framework which includes the Gremlin graph API.

OrientDB version 2.2 includes the Gremlin 2 console. Gremlin 3 support is anticipated in OrientDB version 3, but if you want to use Gremlin 3 with OrientDB 2.2, you can do so using the orientdb-gremlin driver. This post takes you through the steps to do that, and touches on some limitations.

Initial Tinkerpop setup

Download and unzip version 3.2.3 (BACKLEVEL) of the Gremlin Console from the Tinkerpop archive. The orientdb-gremlin driver is currently pinned to this backlevel version.

Start the Gremlin console:

cd gremlin
bin/gremlin.sh

If you’re not familiar with Gremlin, you may want to run through the Tinkerpop Getting Started tutorial.

Using built-in in-memory OrientDB graph

We can try out the orientdb-gremlin driver without installing OrientDB. First install the driver in the Gremlin console:

:install com.michaelpollmeier orientdb-gremlin 3.2.3.0
:import org.apache.tinkerpop.gremlin.orientdb.OrientGraphFactory

You should see a long list of imported packages.

The driver is not enabled as a Gremlin console plugin, so you will need to do the :import (but not the :install) whenever you start the console.

Create a connection to an in-memory database.:

g = new OrientGraphFactory("memory:orienttest").getNoTx()

You should see something like:

INFO: OrientDB auto-config DISKCACHE=3,641MB (heap=3,641MB direct=3,641MB os=16,384MB), assuming maximum direct memory size equals to maximum JVM heap size
Apr 02, 2017 10:44:44 AM com.orientechnologies.common.log.OLogManager log
WARNING: MaxDirectMemorySize JVM option is not set or has invalid value, that may cause out of memory errors. Please set the -XX:MaxDirectMemorySize=16384m option when you start the JVM.
==>orientgraph[com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx@6438a7fe]

For now, you can ignore the warning. If you wish to avoid it, use something like the following when you start the Gremlin console:

JAVA_OPTIONS=-XX:MaxDirectMemorySize=2056m bin/gremlin.sh

Create a graph traverser for the database:

t = g.traversal()

and then try creating and querying some vertices and an edge:

v1 = g.addVertex(T.label,'person','name','fred')
v2 = g.addVertex(T.label,'dog','name','fluffy')
v1.addEdge('feeds',v2)

After each command, you should see the identity of the created vertex or edge, for example:

gremlin> v1.addEdge('feeds',v2)
Apr 02, 2017 10:47:50 AM com.orientechnologies.common.log.OLogManager log
INFO: $ANSI{green {db=orienttest}} created class 'E_feeds' as subclass of 'E'
==>e[#41:0][#25:0-feeds->#33:0]

Now run a query to return the dogs fed by Fred:

t.V().hasLabel('person').has('name', 'fred').out('feeds').values('name')

This should return ‘fluffy’ as the answer:

gremlin> t.V().hasLabel('person').has('name', 'fred').out('feeds').values('name')
Apr 02, 2017 10:50:35 AM com.orientechnologies.common.log.OLogManager log
WARNING: $ANSI{green {db=orienttest}} scanning through all elements without using an index for Traversal [OrientGraphStep(vertex,[~label.eq(person), name.eq(fred)]), VertexStep(OUT,[feeds],vertex), PropertiesStep([name],value), NoOpBarrierStep(2500)]
==>fluffy

We will come back to the warning later.

To quit out of the Gremlin console, type :exit. If you are in the Gremlin console and get into a confused state, type :clear to abort the current command.

Using OrientDB Server

Install OrientDB Server

Download and unzip the community edition of OrientDB. Version 2.2.17 was used for this example.

Start the server and assign it a root password:

cd orientdb
bin/server.sh

Start the console:

cd orientdb
bin/console.sh

As I mentioned earlier, this version of OrientDB distributes Gremlin V2 in its bundle. If you are searching for information, note that the OrientDB Gremlin documentation is not directly applicable to our use of Gremlin V3.

Setup OrientDB Database Index

For the example dataset, we need to setup a database and index in advance. Although this may be possible through the Gremlin driver, its semantics is different for different vendors, so I chose to do this directly in the OrientDB console.

In the orientDB console, create a graph database called ‘votes’, using the root password you created earlier:

create database remote:localhost/votes root <rootpwd> plocal

plocal is a storage strategy using multiple local files. An alternative would be to specify memory storage.

You should see something like:

Creating database [remote:localhost/votes] using the storage type [plocal]...
Connecting to database [remote:localhost/votes] with user 'root'...OK
Database created successfully.

Current database is: remote:localhost/votes

We need to create a vertex class and define a property on it so that we can create an index on the ‘userId’ property of the Vote vertices. The class will correspond to a label that we use later in Gremlin queries.

create class V_Vote extends V
create property V_Vote.userId STRING
create index user_index ON V_Vote (userId) UNIQUE

The name of the class used for the vertices needs to correspond to the label that will be used in Gremlin, prefixed with ‘V_’, i.e. the label in Gremlin in this case will be “Vote”. The classes for edges need to be prefixed with ‘E_’.

Although it is possible to define a property and index on the base vertex class (V), the orientdb-gremlin driver will not use this in its traversal strategy.

To list the indexes created on the database:

indexes

You should see a table of indexes including the ‘user_index’ we just created.

Running Gremlin queries which use index

These instructions are based on the ‘Loading Data’ section of the Tinkerpop Getting Started tutorial, modified to work with the orientdb-gremlin driver.

Download the example data:

curl -L -O http://snap.stanford.edu/data/wiki-Vote.txt.gz
gunzip wiki-Vote.txt.gz

Restart the Gremlin console and create a graph traverser for the votes graph:

:import org.apache.tinkerpop.gremlin.orientdb.OrientGraphFactory
g = new OrientGraphFactory("remote:localhost/votes").getNoTx()
t = g.traversal()

Create a helper function that we will use to load data:

getOrCreate = { id ->
  t.V().hasLabel('Vote').has('userId', id).tryNext().orElseGet{ 
    t.addV('userId', id, T.label, 'Vote').next() 
  }
}

In the above, the pattern: g.V().hasLabel().has(, ) is the only one that will cause the orientdb-gremlin driver to use an index. You can see this even with an empty graph. Try:

t.V().has('userId', "1234")

and you will see a warning message like:

Mar 02, 2017 9:37:01 AM com.orientechnologies.common.log.OLogManager log
WARNING: scanning through all elements without using an index for Traversal [OrientGraphStep(vertex,[userId.eq(id)])]

To actually load the data, run the following. You may need to modify the filepath to point to the one you downloaded:

new File('wiki-Vote.txt').eachLine {
  if (!it.startsWith("#")){
    (fromVertex, toVertex) = it.split('\t').collect(getOrCreate)
    fromVertex.addEdge('votesFor', toVertex)
  }
}

This should take 1 to 2 minutes. Without the index, it will take much longer (10x or more), and will output multiple index warnings.

Exploring query behavior

To query the resulting data:

t.V().hasLabel('Vote').count()
t.E().hasLabel('votesFor').count()
t.V().hasLabel('Vote').has('userId', '80').values()

Note that the edge count will take some time. If you ask Gremlin to explain the query:

t.E().hasLabel('votesFor').count().explain()

You will see that the first step of the optimized query still retrieves all of the ‘votesFor’ edges before counting them:

Original Traversal                 [GraphStep(edge,[]), HasStep([~label.eq(votes
                                      For)]), CountGlobalStep]
...
Final Traversal                    [OrientGraphStep(edge,[~label.eq(votesFor)]),
                                       CountGlobalStep]

Contrast this with the explain for the last query:

t.V().hasLabel('Vote').has('userId', '80').values().explain()

This changes the original traversal to two steps, the first returning a specific record using the index:

Original Traversal                 [GraphStep(vertex,[]), HasStep([~label.eq(Vot
                                      e)]), HasStep([userId.eq(80)]), PropertiesSte
                                      p(value)]
...
Final Traversal                    [OrientGraphStep(vertex,[~label.eq(Vote), use
                                      rId.eq(80)]), PropertiesStep(value)]

To perform the equivalent queries from the orientdb console:

select count(*) from V_Vote
select count(*) from E_votesFor
select * from V_Vote where userId=80

These counts return immediately.

Note: Gremlin automatically creates the ‘E_votesFor’ class from the edge name based on the label used when adding the edge.

Using the built-in Gremlin 2 queries in the OrientDB console:

gremlin g.V().count()
gremlin g.E().count()
gremlin g.V().has('userId', 80)

The performance is better with OrientDB’s built-in Gremlin, although the Edge count still takes much longer than the native query. Hopefully the Gremlin 3 support with OrientDB version 3 will be a significant improvement over Gremlin 3 with OrientDB version 2.