Contents
-
General questions
- What is PyTables?
- Which are PyTables' licensing terms?
- I'm having problems. How can I get support?
- Is there commercial support available?
- Why HDF5?
- Why Python?
- Why NumPy?
- Where can PyTables be applied?
- Is PyTables safe?
- Can PyTables be used in concurrent access scenarios?
- What kind of containers does PyTables implement?
- Cool! I'd like to see some examples of use...
- Can you show me some screenshots?
- Is PyTables a replacement for a relational database?
- How can PyTables be fast if it is written in an interpreted language like Python?
- If it is designed to deal with very large datasets, then PyTables should consume a lot of memory, shouldn't it?
- Why was PyTables born?
- What is PyTables Pro and how is it related with PyTables?
- Why have you split PyTables 2.x in Std and Pro versions?
- Does PyTables have a client-server interface?
- I've found a bug. What do I do?
- Is it possible to get involved in PyTables development?
- How can I cite PyTables?
-
PyTables 2.0 issues
- I'm having problems migrating my apps from PyTables 1.0 into PyTables 2.0. Please, help!
- For combined searches like `table.where('(x<5) & (x>3)')`, why was a `&` operator chosen instead of an `and`?
- I can not select rows using in-kernel queries with a condition that involves an UInt64Col. Why?
- I'm already using PyTables 2.0 but I'm still getting numarray objects instead of NumPy ones!
- Installation issues
- Testing issues
General questions
What is PyTables?
PyTables is a package for managing hierarchical datasets designed to efficiently cope with extremely large amounts of data.
It is built on top of the HDF5 library, the Python language and the NumPy package. It features an object-oriented interface that, combined with C extensions for the performance-critical parts of the code, makes it a fast, yet extremely easy to use tool for interactively saving and retrieving very large amounts of data.
Which are PyTables' licensing terms?
PyTables' license is free for both commercial and non-commercial use, under the BSD terms. It exists a commercial version of PyTables, dubbed PyTables Pro, which is subject to its own license (see a quick review of it).
I'm having problems. How can I get support?
The most common and efficient way is to subscribe (remember you 'need' to subscribe prior to send messages) to the PyTables users' mailing list, and send there a brief description of your issue and, if possible, a short script that can reproduce it. Hopefully, someone on the list will be able to help you. It is also a good idea to check out the archives of the user's list (you may want to check the Gmane archives instead) so as to see if the answer to your question has already been dealed with.
Is there commercial support available?
Yes. Francesc Alted does offer commercial support for PyTables. Please, contact him at faltet@pytables.com for more information.
Why HDF5?
HDF5 is the underlying C library and file format that enables PyTables to efficiently deal with the data. It has been chosen due to the following reasons:
- Thought out for managing very large datasets in an efficient way.
- Lets you organize datasets hierarchically.
- Very flexible and well tested in scientific environments.
- Good maintenance and improvement rate.
Technical excellence (R&D 100 Award).
It's Open Source software
Why Python?
- Python is interactive.
- People familiar with data processing understand how powerful command line interfaces are for exploring mathematical relationships and scientific data sets. Python does provide an interactive environment with the added benefit of a full featured programming language behind it.
- Python is productive for beginners and experts alike.
PyTables is targeted at engineers, scientists, system analysts, financial analysts, and others who consider programming a necessary evil. Any time spent learning a language or tracking down bugs is time spent not solving their real problem. Python has a short learning curve and most people can do real and useful work with it in a day of learning. Its clean syntax and interactive nature facilitate this.
- Python is data-handling friendly.
Python comes with nice idioms that make the access to data much easier: general slicing (i.e. data[start:stop:step]), comprehension lists, iterators, generators... are constructs that make the interaction with your data very easy.
Why NumPy?
NumPy is a Python package to efficiently deal with large datasets in-memory, providing containers for homogeneous data, heterogeneous data and string arrays. PyTables leverages the use of such NumPy containers and uses them as in-memory buffers so as to push the I/O bandwith towards the platform limits.
Where can PyTables be applied?
In all the scenarios where one needs to deal with large datasets:
- Industrial applications
- Data acquisition in real time
- Quality control
- Fast data processing
- Scientific applications
- Meteorology, oceanography
- Numerical simulations
Medicine (biological sensors, general data gathering & processing)
- Information systems
System log monitoring & consolidation
- Tracing of routing data
- Alert systems in security
Is PyTables safe?
Well, first of all, let me state that PyTables does not support transactional features yet (we don't even know if we will ever be motivated to implement this!), so there is always the risk that you can loose your data in case of an unexpected event during writings (like a power outage, system shutdowns...). Having said that, if your typical scenarios are write once, read many, then the use of PyTables is perfectly safe, even for dealing extremely large amounts of data. Checkout the StressTests experiments that have been conducted in order to ensure this.
Can PyTables be used in concurrent access scenarios?
It depends. If your concurrent access is for read-only, then there is no problem at all. However, whenever a process (or thread) is trying to write, then problems will start to appear. First, PyTables doesn't support locking at any level, so several process writing concurrently to the same PyTables file will probably end corrupting it, so don't do this! On the other hand, having only one process writing and the others reading is fine, except that the reading processes might not be aware of the latest data updated because of several levels of cache implemented in both HDF5 and PyTables itself.
What kind of containers does PyTables implement?
PyTables does support a series of data containers that address specific needs of the user. Below is a brief description of them:
- Table
- Lets you deal with heterogeneous datasets. Allows compression. Enlargeable. Supports nested types. Good performance for read/writing data.
- Array
- Provides quick and dirty array handling. Not compression allowed. Not enlargeable. Can be used only with relatively small datasets (i.e. those that fit in memory). It provides the fastest I/O speed.
- CArray
- Provides compressed array support. Not enlargeable. Good speed at reading, but rather slow at writing.
- EArray
- Most general array support. Compressible and enlargeable. It is quite fast at extending, and pretty good at reading.
- VLArray
- Supports collections of homogeneous data with a variable number of entries. Compressible and enlargeable. I/O is not very fast.
- Group
- The structural component.
Please refer to the documentation for more specific information.
Cool! I'd like to see some examples of use...
Sure. Go to the HowToUse section to find simple examples that will help you getting started.
Can you show me some screenshots?
Well, PyTables is not a graphical library by itself. However, you may want to check out ViTables, a GUI tool to browse and edit PyTables & HDF5 files.
Is PyTables a replacement for a relational database?
No, by no means. PyTables lacks many features that are standard in most relational databases. In particular, it does not have support for relationships (beyond the hierarchical one, of course) between datasets and it does not have transactional features. PyTables is more focused on speed and dealing with really large datasets, than implementing the above features. In that sense, PyTables can be best viewed as a teammate of a relational database.
For example, if you have very large tables in your existing relational database, they will take lots of space on disk, potentially reducing the performance of the relational engine. In such a case, you can move those huge tables out of your existing relational database to PyTables, and let your relational engine do what it does best (i.e. manage relatively small or medium datasets with potentially complex relationships), and use PyTables for what it has been designed for (i.e. manage large amounts of data which are loosely related).
How can PyTables be fast if it is written in an interpreted language like Python?
Actually, all of the critical I/O code in PyTables is a thin layer of code on top of HDF5, which is a very efficient C library. Pyrex is used as the glue language to generate "wrappers" around HDF5 calls so that they can be used in Python. Also, the use of an efficient numerical package such as NumPy makes the most costly operations effectively run at C speed. Finally, time-critical loops are usually implemented in Pyrex (which, if used properly, allows to generate code that goes almost at pure C speeds).
If it is designed to deal with very large datasets, then PyTables should consume a lot of memory, shouldn't it?
Well, you already know that PyTables sits on top of HDF5, Python and NumPy, and if we add its own logic (~7500 lines of code in Python, ~3000 in Pyrex and ~4000 in C), then we should conclude that PyTables isn't effectively a paradigm of lightness.
Having said that, PyTables (as HDF5 itself) tries very hard to optimize the memory consumption by implementing a series of features like dynamic determination of buffer sizes, Least Recently Used cache for keeping unused nodes out of memory, and extensive use of compact NumPy data containers. Moreover, PyTables is in a relatively mature state and most memory leaks have been already addressed and fixed.
Just to give you an idea of what you can expect, a PyTables program can deal with a table with around 30 columns and 1 million entries using as low as 13 MB of memory (on a 32-bit platform). All in all, it is not that much, is it?.
Why was PyTables born?
Because, back in August 2002, one of its authors (Francesc Alted) had a need to save lots of hierachical data in an efficient way for later post-processing it. After trying out several approaches, he found that they presented distinct inconveniences. For example, working with file sizes larger than, say, 100 MB, was rather painful with ZODB (it took lots of memory).
The netCDF3 interface provided by Scientific Python was great, but it did not allow to structure the hierarchically; besides, netCDF3 only supports homogeneous datasets, not heterogeneous ones (i.e. tables). (As an aside, netCDF4 --due to second half of 2007--, will overcome many of the limitations of netCDF3, although curiously enough, it will be based on top of HDF5, the library chosen as the base for PyTables from the very beginning.)
So, he decided to give a try to HDF5, start doing his own wrappings to it and voilĂ , this is how the first public release of PyTables (0.1) saw the light in October 2002, three months after his itch started to eat him ;-).
What is PyTables Pro and how is it related with PyTables?
PyTables Pro is an enhanced version of PyTables that implements, among other things:
Extremely fast selection speed through an implementation of OPSI, an innovative indexing technology. With it, it is possible to complete a query on a table in the order of ten thousand million (10,000,000,000) rows in typically a few hundredths of a second (i.e. a little more than the time that the underlying disk takes for doing one single seek).
Improved cache implementation for both metadata (nodes) and data that lets you achieve maximum speed for intensive object tree browsing and data reads or selections.
All-in-one installers for Windows, for a quick deployment in a windows-based company or institution. Although all-in-one installers for other platforms are not available, PyTables Pro can be quickly installed by using distutils anyway.
Although PyTables Pro is based on plain PyTables, it is a commercial product. Nevertheless, PyTables will continue to be free and Francesc Alted is committed to continue developing it in the same way as always.
Find more information about PyTables Pro in its web page.
Why have you split PyTables 2.x in Std and Pro versions?
Plainly said, because free software needs resources to subsist. Here is a more detailed rationale:
1. During the struggle to improve PyTables, I, Francesc Alted (with some help of Ivan Vilata, most specially in the development of the query interpreter), have developed new and more powerful features in order to optimize its behaviour in specially demanding situations. The license-based model of PyTables Pro has the ultimate goal of provide resources in order to keep the development of PyTables (as well as other possible parallel projects) going.
2. It has not been easy for me to determine which features should go to the free version and which ones to the commercial version. Finally, I have decided that the commercial version will receive mainly optimizations (offering indexation and and optimized LRU cache for data and metadata) that only affect users who make an intensive use of data lookups. Besides, the commercial version will be delivered with professional installers (for some operating systems, see above). Apart from that, the functionality is exactly the same in both versions.
It is important to stress that, in the Pro version, users will continue to have access to the source code; this fact may be mostly useful when doing in-house installations based on Distutils, Setuptools or other tools, but also for modifying the code for its own purposes (provided that they don't redistribute the code).
Does PyTables have a client-server interface?
Not by itself, but you may be interested in using PyTables through pydap, a Python implementation of the OPeNDAP protocol. Have a look at the PyTables plugin of pydap. PyTables also comes with its own plug-in for the DAP protocol (see the NetCDF section in the manual).
I've found a bug. What do I do?
The PyTables development team works hard to make this eventuality as rare as possible, but, as in any software made by human beings, bugs do occur. If you find any bug, please tell us by creating a ticket in the tracker.
Is it possible to get involved in PyTables development?
Indeed. We are keen for more people to help out contributing code, unit tests, documentation, and helping out maintaining this wiki. Drop us a mail on the users mailing list and tell us in which area do you want to work.
How can I cite PyTables?
The recommended way to cite PyTables in a paper or a presentation is as following:
- Author: Francesc Alted, Ivan Vilata, Scott Prater, Vicent Mas, Tom Hedley, Antonio Valentino, Jeffrey Whitaker and others
Title: PyTables: Hierarchical Datasets in Python
- Year: 2002 -
Here's an example of a BibTeX entry:
@Misc{,
author = {Francesc Alted and Ivan Vilata and Scott Prater and Vicent Mas and Tom Hedley and Antonio Valentino and Jeffrey Whitaker and others},
title = {{PyTables}: Hierarchical Datasets in {Python}},
year = {2002--},
url = "http://www.pytables.org/"
}
PyTables 2.0 issues
I'm having problems migrating my apps from PyTables 1.0 into PyTables 2.0. Please, help!
Sure. However, you should first check out the Migrating from PyTables 1.x to 2.x document. It should provide hints to the most frequently asked questions on this regard.
For combined searches like `table.where('(x<5) & (x>3)')`, why was a `&` operator chosen instead of an `and`?
Search expressions are in fact Python expressions written as strings, and they are evaluated as such. This has the advantage of not having to learn a new syntax, but it also implies some limitations with logical and and or operators, namely that they can not be overloaded in Python. Thus, it is impossible right now to get an element-wise operation out of an expression like 'array1 and array2'. That's why one has to choose some other operator, being & and | the most similar to their C counterparts && and ||, which aren't available in Python either.
You should be careful about expressions like 'x<5 & x>3' and others like '3 < x < 5' which won't work as expected, because of the different operator precedence and the absence of an overloaded logical and operator. More on this in the appendix about condition syntax of the manual.
There are quite a few packages affected by those limitations including NumPy themselves and SQLObject, and there have been quite longish discussions about adding the possibility of overloading logical operators to Python (see PEP 335 and this thread for more details).
I can not select rows using in-kernel queries with a condition that involves an UInt64Col. Why?
This turns out to be a limitation of the Numexpr package. Internally, Numexpr uses a limited set of types for doing calculations, and unsigned integers are always upcasted to the immediate signed integer that can fit the information. The problem here is that there is not a (standard) signed integer that can be used to keep the information of a 64-bit unsigned integer.
So, your best bet right now is to avoid uint64 types if you can. If you absolutely need uint64, the only way for doing selections with this is using regular Python selections. For example, if your table has a colM column which is declared as an UInt64Col, then you can still filter its values with:
[row['colN'] for row in table if row['colM'] < X]
However, this approach will generally lead to slow speed (specially on Win32 platforms, where the values will be converted to Python long values).
I'm already using PyTables 2.0 but I'm still getting numarray objects instead of NumPy ones!
This is most probably due to the fact that you are using a file created with PyTables 1.x series. By default, PyTables 1.x was setting an HDF5 attribute FLAVOR with the value 'numarray' to all leaves. Now, PyTables 2.x sees this attribute and obediently converts the internal object (truly a NumPy object) into a numarray one. For PyTables 2.x files the FLAVOR attribute will only be saved when explicitly set via the leaf.flavor property (or when passing data to an Array or Table at creation time), so you will be able to distinguish default flavors from user-set ones by checking the existence of the FLAVOR attribute.
Meanwhile, if you don't want to receive numarray objects when reading old files, you have several possibilities:
- Remove the flavor for your datasets by hand.
for leaf in h5file.walkNodes(classname='Leaf'): del leaf.flavor
Use the ptrepack utility with the flag --upgrade-flavors so as to convert all flavors in old files to the default (effectively by removing the FLAVOR attribute).
Remove the numarray (and/or Numeric) package from your system. Then PyTables 2.0 will return you pure NumPy objects (it can't be otherwise!).
Installation issues
[Windows] Error when importing tables
You have installed the binary installer for Windows and, when importing the tables package you are getting an error like:
The command in "0x6714a822" refers to memory in "0x012011a0". The procedure "written" could not be executed. Click to ok to terminate. Click to abort to debug the program.
This problem can be due to a series of reasons, but the most probable one is that you have a version of a DLL library that is needed by PyTables and it is not at the correct version. Please, double-check the versions of the required libraries for PyTables and install newer versions, if needed. In most cases, this solves the issue.
In case you continue getting problems, there are situations where other programs do install libraries in the PATH that are optional to PyTables (for example BZIP2 or LZO1), but that they will be used if they are found in your system (i.e. anywhere in your PATH). So, if you find any of these libraries in your PATH, upgrade it to the latest version available (you don't need to re-install PyTables).
Testing issues
Tests fail when running from IPython
You may be getting errors related with Doctest when running the test suite from IPython. This is a known limitation in IPython (see http://lists.ipython.scipy.org/pipermail/ipython-dev/2007-April/002859.html). Try running the test suite from the vanilla Python interpreter.
Tests fail when running from Python 2.5 and Numeric is installed
Numeric doesn't get well with Python 2.5, even on 32-bit platforms. This is a consequence of Numeric not being maintained anymore and you should consider migrating to NumPy as soon as possible. To get rid of these errors, just uninstall Numeric.
