![]() |
NetCDF
4.6.1
|
The HDF5 library (1.8.11 and later) supports a general filter mechanism to apply various kinds of filters to datasets before reading or writing. The netCDF enhanced (aka netCDF-4) library inherits this capability since it depends on the HDF5 library.
Filters assume that a variable has chunking defined and each chunk is filtered before writing and "unfiltered" after reading and before passing the data to the user.
The most common kind of filter is a compression-decompression filter, and that is the focus of this document.
HDF5 supports dynamic loading of compression filters using the following process for reading of compressed data.
In order to compress a variable, the netcdf-c library must be given three pieces of information: (1) some unique identifier for the filter to be used, (2) a vector of parameters for controlling the action of the compression filter, and (3) a shared library implementation of the filter.
The meaning of the parameters is, of course, completely filter dependent and the filter description [3] needs to be consulted. For bzip2, for example, a single parameter is provided representing the compression level. It is legal to provide a zero-length set of parameters. Defaults are not provided, so this assumes that the filter can operate with zero parameters.
Filter ids are assigned by the HDF group. See [4] for a current list of assigned filter ids. Note that ids above 32767 can be used for testing without registration.
The first two pieces of information can be provided in one of three ways: using ncgen, via an API call, or via command line parameters to nccopy. In any case, remember that filtering also requires setting chunking, so the variable must also be marked with chunking information.
The necessary API methods are included in netcdf.h by default. One API method is for setting the filter to be used when writing a variable. The relevant signature is as follows.
This must be invoked after the variable has been created and before nc_enddef is invoked.
A second API methods makes it possible to query a variable to obtain information about any associated filter using this signature.
The filter id wil be returned in the idp argument (if non-NULL), the number of parameters in nparamsp and the actual parameters in params. As is usual with the netcdf API, one is expected to call this function twice. The first time to get nparams and the second to get the parameters in client-allocated memory.
In a CDL file, compression of a variable can be specified by annotating it with the following attribute:
This is a "special" attribute, which means that it will normally be invisible when using ncdump unless the -s flag is specified.
When copying a netcdf file using nccopy it is possible to specify filter information for any output variable by using the "-F" option on the command line; for example:
Assume that unfiltered.nc has a chunked but not bzip2 compressed variable named "var". This command will create that variable in the filtered.nc output file but using filter with id 307 (i.e. bzip2) and with parameter(s) 9 indicating the compression level. See the section on the parameter encoding syntax for the details on the allowable kinds of constants.
The "-F" option can be used repeatedly as long as the variable name part is different. A different filter id and parameters can be specified for each occurrence.
As a rule, any input filter on an input variable will be applied to the equivalent output variable – assuming the output file type is netcdf-4. It is, however, sometimes convenient to suppress output compression either totally or on a per-variable basis. Total suppression of output filters can be accomplished by specifying a special case of "-F", namely this.
Suppression of output filtering for a specific variable can be accomplished using this format.
where "var" is the fully qualified name of the variable.
The rules for all possible cases of the "-F" flag are defined by this table.
-F none | -Fvar,... | Input Filter | Applied Output Filter |
---|---|---|---|
true | unspecified | NA | unfiltered |
true | -Fvar,none | NA | unfiltered |
true | -Fvar,... | NA | use output filter |
false | unspecified | defined | use input filter |
false | -Fvar,none | NA | unfiltered |
false | -Fvar,... | NA | use output filter |
The parameters passed to a filter are encoded internally as a vector of 32-bit unsigned integers. It may be that the parameters required by a filter can naturally be encoded as unsigned integers. The bzip2 compression filter, for example, expects a single integer value from zero thru nine. This encodes naturally as a single unsigned integer.
Note that signed integers and single-precision (32-bit) float values also can easily be represented as 32 bit unsigned integers by proper casting to an unsigned integer so that the bit pattern is preserved. Simple integer values of type short or char (or the unsigned versions) can also be mapped to an unsigned integer by truncating to 16 or 8 bits respectively and then zero extending.
Machine byte order (aka endian-ness) is an issue for passing some kinds of parameters. You might define the parameters when compressing on a little endian machine, but later do the decompression on a big endian machine. Byte order is not an issue for 32-bit values because HDF5 takes care of converting them between the local machine byte order and network byte order.
Parameters whose size is larger than 32-bits present a byte order problem. This typically includes double precision floats and (signed or unsigned) 64-bit integers. For these cases, the machine byte order must be handled by the compression code. This is because HDF5 will treat, for example, an unsigned long long as two 32-bit unsigned integers and will convert each to network order separately. This means that on a machine whose byte order is different than the machine in which the parameters were initially created, the two integers are out of order and must be swapped to get the correct unsigned long long value. Consider this example. Suppose we have this little endian unsigned long long.
1000000230000004
In network byte order, it will be stored as two 32-bit integers.
20000001 40000003
On a big endian machine, this will be given to the filter in that form.
2000000140000003
But note that the proper big endian unsigned long long form is this.
4000000320000001
So, the two words need to be swapped.
But consider the case when both original and final machines are big endian.
where #1 is the original number, #2 is the network order and #3 is the what is given to the filter. In this case we do not want to swap words.
The solution is to forcibly encode the original number using some specified endianness so that the filter always assumes it is getting its parameters in that order and will always do swapping as needed. This is irritating, but one needs to be aware of it. Since most machines are little-endian. We choose to use that as the endianness for handling 64 bit entities.
Both of the utilities __ncgen__ and __nccopy__ allow the specification of filter parameters. These specifications consist of a sequence of comma separated constants. The constants are converted within the utility to a proper set of unsigned int constants (see the parameter encoding section).
To simplify things, various kinds of constants can be specified rather than just simple unsigned integers. The utilities will encode them properly using the rules specified in the parameter encoding section.
The currently supported constants are as follows.
Example | Type | Format Tag | Notes |
---|---|---|---|
-17b | signed 8-bit byte | b|B | Truncated to 8 bits and zero extended to 32 bits |
23ub | unsigned 8-bit byte | u|U b|B | Truncated to 8 bits and zero extended to 32 bits |
-25S | signed 16-bit short | s|S | Truncated to 16 bits and zero extended to 32 bits |
27US | unsigned 16-bit short | u|U s|S | Truncated to 16 bits and zero extended to 32 bits |
-77 | implicit signed 32-bit integer | Leading minus sign and no tag | |
77 | implicit unsigned 32-bit integer | No tag | |
93U | explicit unsigned 32-bit integer | u|U | |
789f | 32-bit float | f|F | |
12345678.12345678d | 64-bit double | d|D | Network byte order |
-9223372036854775807L | 64-bit signed long long | l|L | Network byte order |
18446744073709551615UL | 64-bit unsigned long long | u|U l|L | Network byte order |
Some things to note.
The documentation[1,2] for the HDF5 dynamic loading was (at the time this was written) out-of-date with respect to the actual HDF5 code (see HDF5PL.c). So, the following discussion is largely derived from looking at the actual code. This means that it is subject to change.
The HDF5 loader expects plugins to be in a specified plugin directory. The default directory is:
The default may be overridden using the environment variable HDF5_PLUGIN_PATH.
Given a plugin directory, HDF5 examines every file in that directory that conforms to a specified name pattern as determined by the platform on which the library is being executed.
Platform | Basename | Extension |
---|---|---|
Linux | lib* | .so* |
OSX | lib* | .so* |
Cygwin | cyg* | .dll* |
Windows | * | .dll |
For each dynamic library located using the previous patterns, HDF5 attempts to load the library and attempts to obtain information from it. Specifically, It looks for two functions with the following signatures.
If plugin verification fails, then that plugin is ignored and the search continues for another, matching plugin.
Debugging plugins can be very difficult. You will probably need to use the old printf approach for debugging the filter itself.
One case worth mentioning is when you have a dataset that is using an unknown filter. For this situation, you need to identify what filter(s) are used in the dataset. This can be accomplished using this command.
Since ncdump is not being asked to access the data (the -h flag), it can obtain the filter information without failures. Then it can print out the filter id and the parameters (the -s flag).
Within the netcdf-c source tree, the directory netcdf-c/nc_test4 contains a test case (test_filter.c) for testing dynamic filter writing and reading using bzip2. Another test (test_filter_misc.c) validates parameter passing. These tests are disabled if –enable-shared is not set or if –enable-netcdf-4 is not set.
A slightly simplified version of the filter test case is also available as an example within the netcdf-c source tree directory netcdf-c/examples/C. The test is called __filter_example.c and it is executed as part of the run_examples4.sh shell script. The test case demonstrates dynamic filter writing and reading.
The files example/C/hdf5plugins/Makefile.am and example/C/hdf5plugins/CMakeLists.txt demonstrate how to build the hdf5 plugin for bzip2.
The current matrix of OS X build systems known to work is as follows.
Build System | Supported OS |
---|---|
Automake | Linux, Cygwin |
Cmake | Linux, Cygwin, Visual Studio |
If you do not want to use Automake or Cmake, the following has been known to work.
Since in some cases, it is necessary for a filter to byte swap from little-endian to big-endian, This appendix provides sample code for doing this. It also provides a code snippet for testing if the machine the endianness of a machine.