Release notes for PyTables 2.1rc1 series

Author: Francesc Alted i Abad
Contact: faltet@pytables.com
Author: Ivan Vilata i Balaguer
Contact: ivan@selidor.net

Changes from 2.0.4 to 2.1

Main improvements

  • Now, when opening a node, that will be done directly (i.e. without populating first all the parent directories). So, for opening pre-known group and leaf locations, the new code is much faster.
  • The creation of different nodes has been optimized too. For example, creating a new EArray/CArray can be 2x faster and creating a Table object up to 5x faster.
  • Added a PYTABLES_SYS_ATTRS parameter that allows to switch on and off the creation of PyTables system attributes in datasets. This way the resulting files are not too PyTables specific. Moreover, the creation speed of datasets is faster too. Closes #190.
  • Disabling the LRU node cache is now supported by setting the NODE_CACHE_SLOTS (in parameters.py) to 0 (this can also be achieved through the NODE_CACHE_SLOTS parameter of the openFile() function). Besides, this figure can also be negative, meaning that all the touched nodes will be kept in an internal dictionary (thus, taking potentially a large amount of memory for large hierarchies). See more info about these features in the updated "Getting the most from the node LRU cache" section of chapter 5 of User's Guide.
  • It is possible now to add any tunable parameter in tables/parameters.py (like limits for warnings, cache sizes, buffer sizes, etc...) as an argument to openFile(). With this you can select a different parametrization for every file you open. A new appendix has been added to the User's Guide explaining which are and which is the mission of the tunable parameters.
  • The EArray.truncate() method has been generalized and implemented as Leaf.truncate(). Now, it is possible to truncate all enlargeable datasets (i.e. all except Array and CArray objects). Closes #174.
  • The limitation to use only scalar atoms in CArray and EArray objects has been removed. Now, all the Table, CArray, EArray and VLArray objects do support fully multidimensional atoms. This also expands the range of native HDF5 files supported. Closes #133.
  • After some exhaustive benchmarks, I've decided to reduce the number of nodes in the LRU cache for nodes to 64. The experiments shows that this leads to better performance overall and to a more contained consumption of resources.

Main improvements (Pro edition)

  • New light indexes that can take up to 4x less space than 2.0 indexes, and more than 15x less space than indexes in traditional databases. Four levels of index "lightness", namely ultralight, light, medium and full (the latter being the one that implemented the 2.0 version), are available so that the user will be able to choose the most appropriate for her needs.

  • The index query code has been completely revamped and it is based now on the concept of chunkmaps. This allows for a much more effective way to retrieve table data in queries that have low selectivity, while retaining good performance for high selectivity ones.

  • A new query optimizer being able to use several indexes simultaneously in a broad range of complex queries. For example, in the query:

    (((c_int32 == 3) | (c_bool == True)) & (c_int32 == 5)) & (c_extra > 0)

    if c_int32 and c_bool columns are indexed but c_extra is not, both c_int32 and c_bool indexes will be used. That will greatly enhance the response times of potentially complicated queries.

  • An additional optimization in the index creation process permits to achieve completely sorted indexes (CSI), allowing not only to get better performance in queries, but also to create completely sorted tables ordered by a specific field.

API additions from 2.0.4 to 2.1

  • The AttributeSet class has received the next dictionary like methods: __getitem__(), __setitem__() and __delitem__(), so that you can do things like:

    for name in node._v_attrs._f_list():
        print "name: %s, value: %s" % (name, node._v_attrs[name])
  • New File.fileno() added. This returns the underlying OS file descriptor for the file. This is meant to allow File objects to better interact with the fcntl module.

  • A new chunkshape argument has been added to Leaf.copy() allowing to specify a chunkshape. It can also take the special values 'auto' (compute a sensible value) and 'keep' (keep the original value, which is the default).

  • Added a new '--chunkshape' flag to the ptrepack console command that corresponds to the new chunkshape added to Leaf.copy().

API additions from 2.0.4 to 2.1 (Pro edition)

  • A new Table.itersorted() iterator allows to iterate through a table following the order of a certain index. It supports iteration on ranges, including negative steps (i.e. reverse sorted order).
  • New Table.readSorted() method that can read a table following the order of a certain index. It supports the reads on ranges, including negative steps (i.e. reverse sorted order).
  • New Table.colindexes property that returns a dictionary with the indexes of the indexed columns in table.
  • A new sortby argument has been added to Table.copy() allowing to a Table to be sorted during the copy operation.
  • Added a new propindexes argument in Table.copy(). If true, the indexes in the source table are propagated (created) to the new table. If false (the default), the indexes are not propagated.
  • New public Index.readSorted() and Index.readIndices() methods that allow direct access to the index data.
  • Added a new Column.createCSIndex() as a handy way to create a completely sorted index (CSI).
  • Added new '--sortby' (sort a table by a column key), '--forceCSI' (force the creation of a CSI index) and '--propindexes' (propagate the indexes in original tables) flags to the ptrepack utility.

Bug fixes and other small enhancements

  • In order to avoid a long-standing bug, all the possible 64-bit class attributes of leaf objects (like nrows, shape or nrow) have been converted into a new SizeType type (actually an alias for numpy.int64). This change should be backward compatible with existing programs, so you should not need any action to adapt to this. Fixes #118.
  • When in ptrepack a range is not specified, all the elements of leaves are copied now. Before, only the first row was copied, which was clearly wrong.
  • The Atom default value (Atom.dflt) is honored now when creating CArrays. Fixes #176.
  • During the modification of values in tables with indexed columns in the context of iterators, the columns are not re-indexed when the I/O buffer is full anymore, but after the iterator is completed. This allows for far better performance in this scenario. Closes #141.
  • File.copyNode() can copy now complete hierarchies directly from the root. This can be useful when one wants to create a new file by merging the contents of others.

Backward incompatible API changes from 2.0.4 to 2.1

  • The semantics of Leaf.copy() has changed: before the chunkshape of destination was computed 'auto'matically while now the default is that the value is 'keep't. This behaviour is thought to satisfy better the least surprise principle.
  • The trMap argument has been removed from the tables.openFile() function. Also, the Node._v_hdf5name attribute has been removed as well. Fixes #117.
  • The sort parameter of Table.itersequence() has been removed as it will not allow to sort sequences larger than memory. Moreover, it is not clear that the sorting operation would be a clear advantage in every situation.
  • Now, in multidimensional atoms, the Atom.dtype variable contains the shape of the type. This is found to be more consistent than the previous behaviour, where Atom.dtype was equivalent to current Atom.dtype.base.
  • The parameter nodeCacheSize in openFile() has been deprecated. Use NODE_CACHE_SLOTS instead (see above).

Backward incompatible API changes from 2.0.4 to 2.1 (Pro edition)

  • The Column.createIndex() has received a new parameter named kind which is the second now in the argument list. This is intentional and incompatible with previous arglist, so that people using more than one positional parameter in their existing Column.createIndex() calls should update them.
  • The Table.indexFilters property has been removed (after a period of DeprecationWarnings). If you want to change filters in indexes, please use the filters parameter of the Column.createIndex() method (and the like).
  • Table.willQueryUseIndexing() has changed its return value from a list to a frozen set of usable indexed columns.
  • Now, the copy of the 'AUTO_INDEX' system attribute of the Index class is done only if the copyuserattrs in Table.copy() is true (the default).

Enjoy data!

—The PyTables Team

ReleaseNotes/Release 2.1rc1 (last edited 2008-10-31 09:12:58 by FrancescAlted)