12 February 2007

Jena SDB

SDB was released for the first time last week. While this is the first alpha release, it's actually at least the second internal query architecture and second loader architecture (Damian did the work to get the loader working fast).

SPARQL now has a formal algebra. ARQ is used to turn the SPARQL query syntax into a algebra expression; SDB takes over and compiles it first to a relational algebra(ish) structure then generates SQL syntax. Now there is a SPARQL algebra, this is all quite a bit simpler for SDB which is why this is the second design for query generation; much of the work now comes naturally from ARQ.

Patterns

At the moment, SDB only translates basic graph patterns, join and leftjoin expressions to SQL and leaves anything else to ARQ to stitch back together. For the current patterns, there are two tricky cases to consider:

  • SPARQL join relationships aren't exactly SQL's because they may involve unbound variables.
  • Multiple OPTIONALs may need to be COALESCEd.

For the first, sometimes the join needs to involve more than equality relationships like "if col1 = null or col1 = col2", which is a bit of scope tracking, and for the second, if a variable can be bound in two or more OPTIONALs , you have to take the first binding. The scope tracking is needed anyway.

Over time, more and more of the SPARQL algebra expression will be translated to SQL.

Layouts

SDB supports a number of databases layouts in two main classes:
  • Type 1 layouts have a single triple table, with all information encoded into the columns of the table.
  • Type 2 layouts use a triple table with a separate RDF terms table

Type 1 layouts are good at supporting fine-grained API calls where the need to join to get the actual RDF terms is removed because they are encoded into the triple tables columns. Jena's existing database layer, RDB, is an example of this. When the move was made from Jena1 to Jena2, the DB layout changed to a type 1 layout and it went faster. The two type 1 layouts supported are Jena's existing RDB layout and a debug version which encodes RDF terms in SPARQL-syntax directly into the triples table so you can simple read the table with an SQL query browser.

Type 2 layouts, where the triples table has pointers to a nodes table, are better as SPARQL queries gets larger. Databases prefer small, fixed width columns to variable string comparisons. SDB has two variations, one using 4 bytes integer indexes and one using 8 byte hashes. The hash form means that hash of query constants can be calculated and don't have to be looked up in the SQL.

It seemed that the hash form would be better all round.  But it isn't - loading was 2x slower (sometimes worse) despite the fact that RDF terms don't have to be inserted into the nodes table first to get their auto-allocated sequence id. Databases we have tried are significant slower indexing 8 byte quantities than 4 byte quantities and this dominates the load performance.

Next

There are three directions to go:

  1. Inference support
  2. Application-specific layout control
  3. Filters support

(1) and (2) are linked by the fact it is looking at a query and deciding, for certain known predicates and part-patterns, that different database tables should be used instead.  See Kevin's work on property tables which uses the approach to put some higher level understanding of the data back into the RDF store. (3) is "just" a matter of doing it.

And if you have some feature request, do suggest it.

No comments: