Announcing PyTables 1.2
The PyTables development team is happy to announce the availability of a new major version of PyTables package.
This version sports a completely new im-memory tree implementation based on a node cache system. This system loads nodes only when needed and unloads them when they are rarely used. The new feature allows opening and creation of HDF5 files with large hierarchies very quickly and with a low memory consumption (the object tree is no longer completely loaded in-memory), while retaining all the powerful browsing capabilites of the previous implementation of the object tree.
Improvements
The new node cache system will allow you to deal with files with a practically unlimited number of nodes on it, while keeping memory consumption very low.
Added a new Row.update() call that lets you modify rows in the middle of iterator loops:
for row in table.where(table.cols.col1 > 3): row['col1'] = row.nrow() row['col2'] = 'b' row['col3'] = 0.0 row.update()This is a much more efficient (and we think that this is more convenient as well) way of updating tables than Table.modifyRows() or Table.modifyColumns() methods.
Based on some experiments, the chunksize for Table and EArray objects has been tuned to allow faster retrieval of small data regions, while keeping pretty good speed for reading large data regions.
Better support for Numeric objects when reading Table columns (homogeneous data) in both Table.read() and Table.readCoordinates().
Added support for converting objects made of 64-bit integers to 64-bit Numeric (only available for 64-bit platforms).
Leaves are now opened during all its life. This can lead to general speed-ups in the range of 10% when dealing with nodes, but can reach a 2x in special cases, like accessing datasets element-by-element, i.e. in situations like:
tarr = tables.openFile("array.h5").root.tarr l = [tarr[idx[i]] for i in xrange(tarr.nrows)]Appends in tables in the form:
for i in xrange(1000000): row['field1'] = ... ... row['fieldN'] = ... row.append()has been optimized and as a result, the output speed has been accelerated between a factor 4x to 8x (depending on your tables) respect to PyTables 1.1.1. With this, and in some scenarios, you can even reach the total throughput of your system.
General overhaul of C functions in order to remove the unused ones.
Bug fixes
- The default values for tables are persistent now and kept even between different PyTables sessions.
- Solved a bug in Array iterator that make it to crash after opening a file with an existing Array.
- When using the Row.append() for enlarging an existing table, if some columns were not set before this call, these were set to junk data. Now, until the defaults would be saved permanently on-disk on a future release, the defaults are initialized to reasonable values (0's for numerical columns and empty strings for character columns).
- When creating a log for actions, the first mark name was corrupted (instead of the empty string, it had dirty data), due to a bug in numarray. A workaround has been made in order to avoid this to happen.
- In indexing scenarios, in situations where there were no values satisfying the conditions in the indexed region but there were values satisfying them in the non-indexed region, these later values were not returned by Table.where(). This has been fixed.
- Enumerated types are properly converted between big endian <--> little endian platforms now.
- Fixed duplicated column names in Description._v_pathnames.
- Tables can be moved an removed without problems now. More tests units added to check this.
- Conversions from empty numarray.CharArray strings to lists failed because a problem in numarray. This has been workarounded in PyTables until the patch arrives to public versions of numarray.
- Fixed a serious memory leak in table I/O buffers that were relevant in case you fill or read many tables in the same PyTables session. Also, please, note that in order to help PyTables to keep memory requirements low, you should make use of Table.flush() after the writing loops have finished.
- Fixed a couple of memory leaks during the opening of nodes. You should not noticed them before unless you had to deal with tons of nodes in the same file, or used to re-open the file many times.
Know issues
- Time datatypes are non-portable between big-endian and little-endian architectures. This is ultimately a consequence of an HDF5 limitation. See SF bug #1234709 for more info.
Enjoy!
