Rational

Initially, PyTables users wants to make use of it for a simple way to keep their data structured, and they normally let PyTables to find sensible values for the most critical I/O parameters. However, as users become more expert, they may want to fine tune some parameters in order to optimize the I/O throughput. The most useful way to tune this is by finding the optimal chunkshape for their needs. This brief study tries to help the users in this goal.

The setup

In order to give an idea on what they can expect by changing the chunkshape of a dataset (using the chunkshape parameter in tables.create* constructors), I've created a script... that creates a 512x524288 CArray of Float64 elements, for a total size of 2 GB. With this, the script creates the CArray using different chunkshapes, and using the zlib compression library with different compression levels (0 for no compression and 1, 3, 6 and 9). Also, the experiments have been run with the shuffle filter activated and deactivated, so as to see the effect of it on performance.

Then, a couple of measurements have been done (after emptying the OS pagecache). First, the CArray has been read completely in sequential mode. Then, the CArray have been accessed randomly in 33000 places aproximately. The output of these measurements can be seen in ... (the shuffle was active) and in ... (the shuffle was unactive).

The experiments have been conducted on a machine with an AMD Opteron @ 2 GHz processor and a couple of SATA disks @ 7200 RPM in a RAID 0 configuration.

The results

First of all, in the next figures, the chunksize that appears in the X axis is basically the productory of the chunkshape parameter multiplied by the size of the datatype (in this case, 8 bytes).

Let's have a look at the time to create the datasets::


inline:create-noshuffle-small.png

inline:create-shuffle-small.png


As you can see...

ChoosingChunksize (last edited 2008-04-21 11:12:45 by localhost)