PyTables has a powerful capability to deal with native HDF5 files created with another tools. However, there are situations were you may want to create truly native PyTables files with those tools while retaining fully compatibility with PyTables format. That is perfectly possible, and in this appendix is presented the format that you should endow to your own-generated files in order to get a fully PyTables compatible file.
We are going to describe the 1.2 version of PyTables file format (introduced in PyTables version 0.8). At this stage, this file format is considered stable enough to do not introduce significant changes during a reasonable amount of time. As times goes by, some changes will be introduced (and documented here) in order to cope with new necessities. However, the changes will be carefully analyzed so as to ensure backward compatibility whenever is possible.
A PyTables file is composed with arbitrarily large amounts of HDF5 groups (Groups in PyTables naming scheme) and datasets (Leaves in PyTables naming scheme). For groups, the only requirements are that they must have some system attributes available. By convention, system attributes in PyTables are written in upper case, and user attributes in lower case but this is not enforced by the software. In the case of datasets, besides the mandatory system attributes, some conditions are further needed in their storage layout, as well as in the datatypes used in there, as we will see shortly.
As a final remark, you can use any filter as you want to create a PyTables file, provided that the filter is a standard one in HDF5, like zlib, shuffle or szip (although the last one cannot be used from within PyTables to create a new file, datasets compressed with szip can be read, because it is the HDF5 library which do the decompression transparently).
The File object is, in fact, an special HDF5 group structure that is root for the rest of the objects on the object tree. The next attributes are mandatory for the HDF5 root group structure in PyTables files:
The next attributes are mandatory for group structures:
There exist a special Group, called the root, that, in addition to the attributes listed above, it requires the next one:
This depends on the kind of Leaf. The format for each type follows.
The next attributes are mandatory for table structures:
A Table has a dataspace with a 1-dimensional chunked layout.
The datatype of the elements (rows) of Table must be the H5T_COMPOUND compound datatype, and each of these compound components must be built with only the next HDF5 datatypes classes:
You should note that nested compound datatypes are not allowed in Table objects.
The next attributes are mandatory for array structures:
An Array has a dataspace with a N-dimensional contiguous layout (if you prefer a chunked layout see EArray below).
The elements of Array must have HDF5 atomic datatypes, and can currently be one of the next HDF5 datatypes classes: H5T_BITFIELD, H5T_INTEGER, H5T_FLOAT and H5T_STRING. See the Table format description in section B.3.1 for more info about these types.
You should note that H5T_ARRAY class datatypes are not allowed in Array objects.
The next attributes are mandatory for earray structures:
An EArray has a dataspace with a N-dimensional chunked layout.
The elements of EArray must have HDF5 atomic datatypes, and can currently be one of the next HDF5 datatypes classes: H5T_BITFIELD, H5T_INTEGER, H5T_FLOAT and H5T_STRING. See the Table format description in section B.3.1 for more info about these types.
You should note that H5T_ARRAY class datatypes are not allowed in EArray objects.
The next attributes are mandatory for vlarray structures:
An VLArray has a dataspace with a 1-dimensional chunked layout.
The datatype of the elements (rows) of VLArray objects must be the H5T_VLEN variable-length (or VL for short) datatype, and the base datatype specified for the VL datatype can be of any atomic HDF5 datatype that is listed in the Table format description section B.3.1. That includes the classes:
You should note that this does not include another VL datatype, or compound datatype. Note as well that, for Object and VLString special flavors, the base for the VL datatype is always a H5T_NATIVE_UCHAR. That means that the complete row entry in the dataset has to be used in order to fully serialize the object or the variable length string.
In addition, if you plan to use a VLString flavor for your text data and you are using ascii-7 (7 bits ASCII) codification for your strings, but you don't know (or just don't want) to convert it to the required UTF-8 codification, you should not worry too much about that because the ASCII characters with values in the range [0x00, 0x7f] are directly mapped to Unicode characters in the range [U+0000, U+007F] and the UTF-8 encoding has the useful property that an UTF-8 encoded ascii-7 string is indistinguishable from a traditional ascii-7 string. So, you will not need any further conversion in order to save your ascii-7 strings and have an VLString flavor.