09 December 2010

Performance benchmarks for the TDB loader (version 2)

CAVEAT

There are "Lies, damned lies, and statistics" but worse are probably performance measurements done by someone else. The real test is what does it mean for any given application and is performance "fit for purpose". Database-related performance measurements are particular murky. The shape of the data matters, the usage made of the data matters, all in ways that can wildly affect whether a system is for for purpose.

Treat these figures with care - they are given to compare the TDB bulker (to version 0.8.7) loader and the new one (version 0.8.8 and later). Even then, the new bulk loader is new, so it is subject to tweaking and tuning but hopefully just to improve performance, not worsen it.

See also

http://esw.w3.org/RdfStoreBenchmarking.

Summary

The new bulk loader is faster by x2 or more depending on the characteristics of the data. As loads can take hours, this saving is very useful. It produces smaller databases and the databases are as good as or better in terms of performance than the ones produced by the current bulk loader.

Setup

The tests were run on a small local server, not tuned or provisioned for database work, just a machine that happened to be easily accessible.

  • 8GB RAM
  • 4 core Intel i5 760 @2.8Ghz
  • Ubuntu 10.10 - ext4 filing system
  • Disk: WD 2 TB - SATA-300 7200 rpm and buffer Size 64 MB
  • Java version Sun/Oracle JDK 1.6.0_22 64-Bit Server VM

BSBM

The BSBM published results from Nov 2009.

The figures here are produced using a modified version of the BSBM tools set used for version 2 of BSBM. The modifications are to run the tests on a local database, not over HTTP. The code is available from github. See also this article.

BSBM - Loader performance

BSBM dataset Triples Loader 1 Rate Loader 2 Rate
50k 50,057 3s 18,011 TPS 7s 7,151 TPS
250k 250,030 8s 31,702 TPS 11s 22,730 TPS
1m 1,000,313 26s 38,956 TPS 27s 37,049 TPS
5m 5,000,339 121s 41,298 TPS 112s 44,646 TPS
25m 25,000,250 666s 37,561 TPS 586s 42,663 TPS
100m 100,000,112 8,584s 11,650 TPS 3,141s 31,837 TPS
200m 200,031,413 30,348s 6,591 TPS 8,309s 24,074 TPS
350m 350,550,000 83,232s 4,212 TPS 21,146s 16,578 TPS

BSBM - Database sizes

Database Size/loader1 Size/loader2
50k 10MB 7.2MB
250k 49MB 35MB
1m 198MB 137MB
5m 996MB 680MB
25m 4.9GB 3.3GB
100m 20GB 13GB
200m 39GB 26GB
350m 67GB 45GB

BSBM - Query Performance

Numbers are "query mix per hour"; larger numbers are better. The BSBM performance engine was run with 100 warmups and 100 timing runs over local databases.

Loader used50k250k1m5m25m100m200m350m
Loader 1102389.187527.458441.65854.71798.4673.0410.7250.0
Loader 2106920.186726.162240.711384.53477.9797.1425.8259.2

What this does show is that for a narrow range of database sizes around 5m to 25m, the databases produced by loader2 are faster. This happens because the majority ogf the working set of databases due to loader1 didn't fit mostly in-memory but those produced by loader2 do.

COINS

COINS is the Combined Online Information System from the UK Treasury. It's a real-wolrd database that has been converted to RDF by my colleague, Ian - see Description of the conversion to RDF done by Ian for data.gov.uk.

General information about COINS.

COINS is all named graphs.

COINS - Loader Performance

COINS dataset Quads Loader 1 Rate Loader 2 Rate
417,792,897 26,425s 15,811 TPS 17,057s 24,494 TPS

COINS - Database sizes

Size/loader1 Size/loader2
152GB 77GB

LUBM

LUBM information

LUBM isn't a very representative benchmark for RDF and linked data applications - it is design more for testing inference. But there is some details of various systems published using this benchmark. To check the new loader on this data, I ran loads for a couple of the larger generated. These are the 1000 and 5000 datasets, with inference applied during data creation. The 5000 dataset, just under a billion triples, was only run through the new loader.

LUBM - Loader Performance

LUBM dataset Triples Loader 1 Rate Loader 2 Rate
1000-inf 190,792,744 7,106s 26,849 TPS 3,965s 48,119 TPS
5000-inf 953,287,749 N/A N/A 86,644s 11,002 TPS

LUBM - Database sizes

Database sizes:

Dataset Size/loader1 Size/loader2
1000-inf 25GB 16GB
5000-inf N/A 80GB

TDB bulk loader - version 2

This article could be subtitled called "Good I/O and Bad I/O". By arranging to use good I/O, the new TDB loader achieves faster loading rates despite writing more data to disk. "Good I/O" is file operations that occurs in a buffered and streaming fashion. "Bad I/O" is file operations that cause the disk to jump the heads about randomly or work in small units of disk transfer.

The new TDB loader "loader2" is a standalone program that bulk loads data into a new TDB database. It does not support incremental loading, and may destroy existing data. It has only been tested on Linux; it should run on Windows with Cygwin but what the performance will be is hard to tell.

Figures demonstrating the loader in action for various large datasets are in a separate blog entry. It is faster than the current loader for datasets over about 1 million triples and comes into it's own above 100 million triples.

Like the current bulk loader ("loader1"), loader2 can load triple and quad RDF formats, and from gzipped files. It runs fastest from N-triples or N-Quads because the parser is fastest, and low overhead, for these formats.

The loader is a shell script that coordinates the various phases. It's available in the TDB development code repository in bin/tdbloader2 and the current 0.8.8 snapshot build.

Loader2 is based on the observation that the speed of loader1 can drop sharply as the memory mapped files fill up RAM (the "can" is because it does not always happen; slightly weird). This fall off is more than one would expect simply by having to use some disk and sometimes the rate of loader1 becomes erratic. This could be due to the OS and the management of memory mapped files but the effect is that the secondary index creation can become rather slow. loader1 tends to do "bad I/O" - as the caches fill up, blocks are written back in what to the disk looks like a random order causing the disk heads to jump round.

Copying from the primary index to a secondary index involves a sort because TDB uses B+trees for it's triple and quad indexing. A B+Tree keeps its records in sorted order and each index is different orders.

Loader1 is much faster than simply loading all indexes at once because in that case there is some much RAM being used for caching of parts of all the indexes. Better is to do one index at a time, using the RAM for caching one index at a time.

Loader2 similarly has an data loading phase and an index creation phase.

The first phase is to build the node table and write out the data for index building. Loader2 takes the stream of triples and quads from the parser and writes out the RDF terms (IRI, Literal, blank node) into the internal node table. It also writes out text files of tuples of NodeId (the internal 64 bit number used to identify each RDF term. This is "good I/O" - the writes of the tuples files are buffered up and the files are written append-only. This phase is a Java program, which exits after the node table and working files have been written.

The next phase is to produce the indexes, including the primary index. Unlike loader1, loader2 does not write the primary index during node loading. Experimentation showed it was quicker to do it separately despite needing more I/O. This is slightly strange.

To build indexes, loader2 uses the B+Tree rebuidler and that requires the data in index-sorted order. Index rebuilding is a sort followed by B+tree building. The sort is done by Unix sort. Unix sort is very easy to use and it smoothly scales from a few lines to gigabytes of data. Having written the tuple data out as text files in the first phase (and fixed width hex numbers at that - quite wasteful) Unix sort can do a text sort on the files. Despite that meaning lots of I/O, it's good I/O and the sort program really knows how to best manage temporary files.

For each index, a Unix sort is done to get a temporary file of tuple data in the right sort order. The B+Tree rebuilder is called with this file as the stream of sorted data it needs to create an index.

There are still opportunities to tune the new loader and to see if the output of the sorts being piped directly into the rebuilder is better or worse than the two step approach using temporary file used at the moment. Using different disks for different temporary files should also help.

The index building phase is parallelisable. Because I/O and memory usage are the bottlenecks, not CPU cycles, the crossover point for this to become effective might be quite high.

To find out whether loader2 is better than loader1, I've run a number of tests. Load and query tests with the Berlin SPARQL Benchmark (2009 version), a load test on the RDF version of COINS (UK Treasury Combined Online Information System - about 420 million quads and it's real data) and a load test using the Lehigh University Benchmark with some inferencing. Details, figures and tables in the next article.

03 December 2010

Repacking B+Trees

TDB uses B+trees for it's triple and quad indexing.

The indexes hold 3 or 4 NodeIds, where a NodeId is a fixed length 64 bit unique number for each RDF term in the database. Numbers, dates and times are encoded directly into the 64 bits where possible, otherwise the NodeId refers to the location in a separate NodeId to RDF term table like all other types,including IRIs.

The B+Trees have a number of leaf blocks, each of which holds only records (key, value pairs, except there's no "value" part in a triple index - just the key of S,P and O in various orders). TDB threads these blocks together so that a scan does not need to touch the rest of the tree - scans happen when you look up, say S?? for known subject and unknown property and object. The scan returns all the triples with a particular S. Counting all the triples only touches the leaves of the B+Tree, not the branches.

B+Trees provide performant indexing over a wide range of memory situations, ranging from very little caching of disk structures in memory, through to being able to cache substantial portions of the tree.

The TDB B+Trees have a number of block storage layers; an in-JVM block caching for use on 32 bit JVMs, memory mapped files, for 64 bit JVMs, and an in-memory RAM-disk for testing. The in-memory RAM disk is not efficient but it is a very good simulation of a disk - it really does copy the blocks used by a client when written to another area so there is no possibility of updating blocks through references held by the client after the block has been written to "disk".

However, one disadvantage can be that the index isn't very well packed. The B+Trees guarantee that each block is at least 50% full. In practice, the blocks are 60-70% full for indexes POS and OSP. But a worse case can arise happens when inserting into the SPO index because data typically arrives with all the triples for one subject, then all the triples for another subject, meaning the data is nearly sorted. While this makes the processing faster, it makes the resulting B+Tree about 50%-60% packed.

Packing density matters because it influences how much of the tree is cached in a fixed amount of computer memory. If it's 50% packed, then it's only 50% efficient in the cache.

There are various ways to improve on this (compress blocks, B#Trees, and many more besides - B-tree variations are very extensively studied data-structures).

I have been working on a B+Tree repacking programme that takes an existing B+Tree and produces a maximumally packed B+Trees. The database is then smaller on disk and the in-memory caches are more efficiently used. The trees produces are legal B+Trees, and have a packing density of close to 100%. Rebuilding indexes is fast and scales linearly.

The Algorithm

Normally, B+Trees grow at the root. A B+tree is the same depth everywhere in the tree and the tree only gets deeper if the root node of the tree is split and a new root is created pointing to down the two blocks formed by splitting the old root. This algorithm, building a tree from a stream of records, grows the tree from the leaves towards the root. While the algorithm is running there isn't a legal tree - it's only when the algorithm finishes, does a legal B+Tree emerge.

All the data of a B+tree resides in the leaves - the branches above tell you which leaf block to look in (this is the key difference between B-Trees and B+Trees). The first stage of repacking takes a stream of records (key and value) from the initial tree. This stream will be in sorted order because it's being read out of a B+Tree and for a TDB B+tree, it's a scan tracing the threading of the leaf blocks together. In other words, it's not memory intensive.

In the first stage, new leaf blocks are produces, one at a time. A block is filled completely, a new block allocated, the threading pointers completed and the full block written out. In addition, the block number and highest key in the block are emitted. The leaf block is not touched again.

The exception is the last two blocks of the leaf layer. A B+Tree must have blocks at least 50% full to be a legal tree. Although the TDB B+Tree code can cope with blocks that are smaller than the B+tree guarantee, it's neater to rebalance the last two blocks in the case the last block is below the minimum size. Because the second-to-last block is completely full, it's always possible to rebalance in just two blocks.

Phase two takes as input the stream of block number and highest key from the level below and builds branch nodes for the B+Tree pointing, by block number, to the blocks produced in the phase before. When a block is finished, the block can be written out and a block number and split key emitted. This split key isn't the highest key in the block - it's the highest key of the entire sub-tree at that point but this the key passed. A B+tree branch node has N block pointers and N-1 keys and the split key is the last key from making the full block, and is the Nth key from below.

Once again, the last two blocks are rebalanced to maintain then B+Tree invariant of all blocks being at least half full. For large trees, there are quite a few blocks, so the rebalance of just two of them is insignificant. For small trees, it not really worth repacking the tree - block caching at runtime hides any advantages there might be.

The second phase is repeated applied to the block number and split key stream from the layer below until a layer in the tree is only one block (it can't be zero blocks). This single block is the new root block. The third phase is to write out the B+Tree details to disk and put the root block somewhere where it can be found when the B+Tree is reopened.

Consequences

The repacking algorithm produces B+Trees that are the approaching half the size of the original trees. For a large dataset, that's several gigabytes.

The repacked trees perform a bit faster than trees formed by normal use except in one case where they are faster. If the tree is small, the majority fits in the RAM caches, then repacking means less RAM is used but the speed is much the same (in fact as few percent slower, hard to measure but less than 5%, presumably because there is a difference ratio of tree decent and in-block binary search being done by the CPU. This may be no more than a RAM cache hierarchy effect).

However, if the tree was large, and repacked now fits mostly in memory, the repacked trees are faster. As the indexes for an RDF dataset grows much large than the cacheable space, then this effect slowly declines. Some figures to show this are in preparation.

The biggest benefit however, is not directly the speed of access or the reduced disk space. It's the fact here is a fast and linear growth way to build a B+Tree from a stream of sorted records. It's much faster than simply using the regular insertion into the B+Tree.

This is part of the new bulk loader for TDB. It uses external sorting to produce the input to index creation using this B+Tree repacking algorithm. The new bulk loader can save hours on large data loads.