## Please edit system and help pages ONLY in the moinmaster wiki! For more ## information, please see MoinMaster:MoinPagesEditorGroup. ##master-page:PyTables #acl ScottPrater:read,write All:read #format wiki #language en #pragma section-numbers off {{{#!sidebar === News === * '''!PyTables 2.2 (final) released''' (2010-07-01) See info in the [[ReleaseNotes/Release_2.2|Release Notes]]. [[http://www.pytables.org/download/stable|Download it]]. * '''[[http://pytables.org/EuroSciPy2010/|Advanced tutorial on PyTables]]''' (2010-06-25) At the next [[http://www.euroscipy.org/conference/euroscipy2010|EuroSciPy 2010 Conference]]. * '''!PyTables 2.2rc2 released''' (2010-06-17) See info in the [[ReleaseNotes/Release_2.2rc2|Release Notes]]. [[http://www.pytables.org/download/preliminary/|Download it]]. }}} {{{#!figure #class right [[http://www.pytables.org/images/objecttree.png|{{attachment:objecttree-small.png|Example of hierarchically structured datasets.}}]] }}} = What is PyTables? = PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data. You can [[Downloads|download]] !PyTables and use it for [[FAQ#head-ea73d26da96cdf8ee837d8eda1517693daaf177b| free]]. You can access documentation, some online examples and presentations in the HowToUse section. !PyTables is built on top of the [[FAQ#head-b32537aba805dac2a1bf9cd6606c4fddcd964f96| HDF5 library]], using the [[FAQ#head-58d78c6c730651727e1a30f113f9509465fed1e3| Python]] language and the [[http://www.pytables.org/moin/FAQ#head-94e41dd2d37c1c484590a6c8bf662aa32feb2f2b|NumPy]] package. It features an object-oriented interface that, combined with C extensions for the performance-critical parts of the code (generated using [[http://www.cython.org/|Cython]]), makes it a fast, yet extremely easy to use tool for interactively browse, process and search very large amounts of data. One important feature of PyTables is that it optimizes memory and disk resources so that data takes much less space (specially if on-flight compression is used) than other solutions such as relational or object oriented databases. You can have a look at the MainFeatures of !PyTables. Also, find more info by reading the [[FAQ|PyTables FAQ]]. !PyTables is developed, maintained and supported by [[FrancescAlted|Francesc Alted]], with contributions from [[IvanVilata|Ivan Vilata]] and the community. {{{#!figure #class right [[http://blosc.pytables.org/trac/wiki/SyntheticBenchmarks|{{attachment:decompr-6t-wingw64-mini.png|Decompressing faster than memcpy() speed}}]] }}} = Strong foundations for solid performance = Besides making use of standard de-facto packages for handling large datasets (!NumPy for in-memory and HDF5 for on-disk ones), !PyTables leverages additional libraries for performing internal computations. The first one is [[http://code.google.com/p/numexpr/|Numexpr]] a just-in-time compiler that is able to evaluate expressions in a way that both optimizes CPU usage and avoids in-memory temporaries. Optionally, Numexpr can make use of the highly optimized [[http://software.intel.com/sites/products/documentation/hpc/mkl/vml/vmldata.htm|Intel's Vector Mathematical Library]], so as to accelerate the evaluation of transcendental functions (like for example `sin()`, `cos()`, `sinh()`, `exp()`, `log()`...) to a maximum. The other pillar for improving performance in !PyTables is [[http://blosc.pytables.org|Blosc]], a compressor designed to transmit data from memory to cache (and back) at very high speeds. It does so by using the full capacities present in modern CPUs, including its SIMD set of instructions (SSE2 or higher) in any number of available cores. As you should already have noted, !PyTables takes every measure to reduce memory and disk usage during its operation. This allows not only to treat larger datasets by using the same hardware, but also to actually accelerate I/O operation, the most frequent source of bottlenecks in nowadays systems (see this [[http://www.pytables.org/docs/CISE-12-2-ScientificPro.pdf|article]] on why this is so). {{{#!figure #class right [[attachment:non-indexed.png|{{attachment:non-indexed-mini.png|Query speed comparison (non-indexed)}}]] }}} = Querying your data in many different ways, fast = One characteristic that sets !PyTables apart from similar tools is its capability to perform extremely fast queries on your tables in order to facilitate as much as possible your main goal: get important information *out* of your datasets. !PyTables achieves so via a very flexible and efficient '''query iterator''', named [[http://www.pytables.org/docs/manual/ch04.html#TableMethods_querying|Table.where()]]. This iterator, in combination with the efficiency of underlying tools like Numpy, HDF5, Numexpr and Blosc, makes of it one of the fastest and powerful query engines available. Just to whet your appetite, you can quickly sum out certain fields in your tables that fulfills a certain set of conditions: {{{ total_energy = sum(row['energy'] for row in mytable.where('(pressure > 10) & (ADCcount < 1e7)')]) }}} where a generator has been used. Or perform queries on several tables at once (table joining): {{{ for r1 in table1: for r2 in table2.where("temperature == %d" % r1['temperature']): results[r1['energy']] += r1['ADCcount1'] + r2['ADCcount2']) }}} {{{#!figure #class right [[PyTablesPro|{{attachment:1g-Q7-zlib1-lzo1-blosc5-indexed-mini.png|OPSI/Numexpr/Blosc combined performance}}]] }}} Of course, this only scratches the full possibilities that the [[http://www.pytables.org/docs/manual/ch04.html#TableMethods_querying|Table.where()]] iterator enables. Provided the high degree of flexibility that Python allows for working with iterators and generators, the only limit for composing new and useful queries is your imagination. In case you are a seasoned user of SQL in relational databases, you may be interested in reading [[HintsForSQLUsers]], a gentle introduction and cookbook to !PyTables based on the concepts of SQL and RDBMS. Last but not least, if you require optimal query speeds no matter how large your tables are, you should have a look at [[http://www.pytables.org/docs/OPSI-indexes.pdf|OPSI]], the powerful indexing engine that comes with PyTablesPro. OPSI, in combination with HDF5, !NumPy, Numexpr and Blosc, makes of '''PyTables Pro''' one of the fastests solutions around for dealing with read-only or append-only datasets. = Using PyTables as a Computing Kernel = After looking at all the weaponry implemented with the main goal of allowing very fast queries, !PyTables developers suddenly realized that the same techniques could be used to accelerate algebraic operations with potentially large vectors and arrays. The [[http://www.pytables.org/docs/manual/ch04.html#ExprClassDescr|tables.Expr]] class, integrated in !PyTables, implements all these machinery in order to allow efficient vector/array operations, not only for disk-based operations, but also for memory-based ones too. {{{#!figure #class right [[ComputingKernel|{{attachment:poly-mini.png|PyTables as a computing kernel}}]] }}} `tables.Expr` typically outperforms the `memmap` module available in !NumPy, which is another solution for out-of-core computations. What's more, even when evaluating complex expressions for in-memory datasets, `tables.Expr` class can be faster than !NumPy itself. This is a great achievement because, contrarily to `tables.Expr`, !NumPy uses an in-core paradigm for performing computations. For example, when it comes to evaluate the next polynomial in Python space: {{{ y = .25*x**3 + .75*x**2 - 1.5*x - 2 }}} where `x` is a vector with, say, 10 millions elements, the plot shows how `tables.Expr` is beating both `numpy.memmap` as well as pure `numpy` libraries both in speed and disk/memory consumption, most specially if Blosc is used. Also, if you are going to use transcendental (trigonometrical, exponential, logarithmic...) functions, you can optionally make use of [[http://software.intel.com/sites/products/documentation/hpc/mkl/vml/vmldata.htm|Intel's Vector Mathematical Library]] so as to accelerate its evaluation. = Where can be PyTables used? = {{{#!figure #class right {{attachment:pytables-powered.png}} }}} !PyTables can be used on any scenario where you need to save and retrieve large amounts of data and provide metadata (that is, data about actual data) for it. Whether you want to work with large datasets of (potentially multidimensional) data, save and structure your [[http://numpy.scipy.org/|NumPy]] datasets or just to provide a categorized structure for some portions of your cluttered RDBMS, then give !PyTables a try. It works well for storing data from data acquisition systems, sensors in geosciences, simulation software, network data monitoring systems or as a centralized repository for system logs, to name only a few possible uses. However, it's important to emphasize the fact that !PyTables is not designed to work as a relational database competitor, but rather as a ''teammate''. For example, if you have very large tables in your existing relational database, then you can move those tables to !PyTables so as to reduce the burden of your existing database while efficiently keeping those huge tables on-disk. Finally, remember that !PyTables is ''Open Source'' software, so you are free to adapt it to your own needs, and due to its liberal [[License| BSD license]], you can include it in any software you like (even if it is commercial). For those users requiring extreme speed and an optimal usage of resources, please consider getting a license of [[PyTablesPro|PyTables Professional Edition]], its commercial counterpart. By doing this you will be contributing to achieve a larger and more productive life for the project. = Design goals = {{{#!figure #class right [[attachment:hierarchy-example.png|{{attachment:hierarchy-example-small.png|A view of some table objects.}}]] }}} !PyTables has been designed to fulfill the next requirements: 1. Allow to structure your data in a '''hierarchical''' way. 2. '''Easy to use'''. It implements the NaturalNaming scheme for allowing convenient access to the data. 3. All the '''cells''' in datasets can be '''multidimensional''' entities. 4. Most of the '''I/O operations speed''' should be '''only limited by the underlying I/O subsystem''', be it '''disk or memory'''. 5. Enable the end user to save and deal with large datasets in a efficient way, i.e. '''each single byte''' of data on disk has to be '''represented by one byte plus a small fraction''' when loaded into memory.