NetCDF  4.6.1
filters.md
1 Filter Support in netCDF-4 (Enhanced)
2 ============================
3 <!-- double header is needed to workaround doxygen bug -->
4 
5 Filter Support in netCDF-4 (Enhanced) {#compress}
6 =================================
7 
8 [TOC]
9 
10 Introduction {#compress_intro}
11 ==================
12 
13 The HDF5 library (1.8.11 and later)
14 supports a general filter mechanism to apply various
15 kinds of filters to datasets before reading or writing.
16 The netCDF enhanced (aka netCDF-4) library inherits this
17 capability since it depends on the HDF5 library.
18 
19 Filters assume that a variable has chunking
20 defined and each chunk is filtered before
21 writing and "unfiltered" after reading and
22 before passing the data to the user.
23 
24 The most common kind of filter is a compression-decompression
25 filter, and that is the focus of this document.
26 
27 HDF5 supports dynamic loading of compression filters using the following
28 process for reading of compressed data.
29 
30 1. Assume that we have a dataset with one or more variables that
31 were compressed using some algorithm. How the dataset was compressed
32 will be discussed subsequently.
33 
34 2. Shared libraries or DLLs exist that implement the compress/decompress
35 algorithm. These libraries have a specific API so that the HDF5 library
36 can locate, load, and utilize the compressor.
37 These libraries are expected to installed in a specific
38 directory.
39 
40 Enabling A Compression Filter {#Enable}
41 =============================
42 
43 In order to compress a variable, the netcdf-c library
44 must be given three pieces of information:
45 (1) some unique identifier for the filter to be used,
46 (2) a vector of parameters for
47 controlling the action of the compression filter, and
48 (3) a shared library implementation of the filter.
49 
50 The meaning of the parameters is, of course,
51 completely filter dependent and the filter
52 description [3] needs to be consulted. For
53 bzip2, for example, a single parameter is provided
54 representing the compression level.
55 It is legal to provide a zero-length set of parameters.
56 Defaults are not provided, so this assumes that
57 the filter can operate with zero parameters.
58 
59 Filter ids are assigned by the HDF group. See [4]
60 for a current list of assigned filter ids.
61 Note that ids above 32767 can be used for testing without
62 registration.
63 
64 The first two pieces of information can be provided in one of three ways:
65 using __ncgen__, via an API call, or via command line parameters to __nccopy__.
66 In any case, remember that filtering also requires setting chunking, so the
67 variable must also be marked with chunking information.
68 
69 Using The API {#API}
70 -------------
71 The necessary API methods are included in __netcdf.h__ by default.
72 One API method is for setting the filter to be used
73 when writing a variable. The relevant signature is
74 as follows.
75 ````
76 int nc_def_var_filter(int ncid, int varid, unsigned int id, size_t nparams, const unsigned int* parms);
77 ````
78 This must be invoked after the variable has been created and before
79 __nc_enddef__ is invoked.
80 
81 A second API methods makes it possible to query a variable to
82 obtain information about any associated filter using this signature.
83 ````
84 int nc_inq_var_filter(int ncid, int varid, unsigned int* idp, size_t* nparams, unsigned int* params);
85 
86 ````
87 The filter id wil be returned in the __idp__ argument (if non-NULL),
88 the number of parameters in __nparamsp__ and the actual parameters in
89 __params__. As is usual with the netcdf API, one is expected to call
90 this function twice. The first time to get __nparams__ and the
91 second to get the parameters in client-allocated memory.
92 
93 Using ncgen {#NCGEN}
94 -------------
95 
96 In a CDL file, compression of a variable can be specified
97 by annotating it with the following attribute:
98 
99 * ''_Filter'' -- a string containing a comma separated list of
100 constants specifying (1) the filter id to apply, and (2)
101 a vector of constants representing the
102 parameters for controlling the operation of the specified filter.
103 See the section on the <a href="#Syntax">parameter encoding syntax</a>
104 for the details on the allowable kinds of constants.
105 
106 This is a "special" attribute, which means that
107 it will normally be invisible when using
108 __ncdump__ unless the -s flag is specified.
109 
110 Example CDL File (Data elided)
111 ------------------------------
112 ````
113 netcdf bzip2 {
114 dimensions:
115  dim0 = 4 ; dim1 = 4 ; dim2 = 4 ; dim3 = 4 ;
116 variables:
117  float var(dim0, dim1, dim2, dim3) ;
118  var:_Filter = "307,9" ;
119  var:_Storage = "chunked" ;
120  var:_ChunkSizes = 4, 4, 4, 4 ;
121 data:
122 ...
123 }
124 ````
125 
126 Using nccopy {#NCCOPY}
127 -------------
128 When copying a netcdf file using __nccopy__ it is possible
129 to specify filter information for any output variable by
130 using the "-F" option on the command line; for example:
131 ````
132 nccopy -F "var,307,9" unfiltered.nc filtered.nc
133 ````
134 Assume that __unfiltered.nc__ has a chunked but not bzip2 compressed
135 variable named "var". This command will create that variable in
136 the __filtered.nc__ output file but using filter with id 307
137 (i.e. bzip2) and with parameter(s) 9 indicating the compression level.
138 See the section on the <a href="#Syntax">parameter encoding syntax</a>
139 for the details on the allowable kinds of constants.
140 
141 The "-F" option can be used repeatedly as long as the variable name
142 part is different. A different filter id and parameters can be
143 specified for each occurrence.
144 
145 As a rule, any input filter on an input variable will be applied
146 to the equivalent output variable -- assuming the output file type
147 is netcdf-4. It is, however, sometimes convenient to suppress
148 output compression either totally or on a per-variable basis.
149 Total suppression of output filters can be accomplished by specifying
150 a special case of "-F", namely this.
151 ````
152 nccopy -F "none" input.nc output.nc
153 ````
154 Suppression of output filtering for a specific variable can be accomplished
155 using this format.
156 ````
157 nccopy -F "var,none" input.nc output.nc
158 ````
159 where "var" is the fully qualified name of the variable.
160 
161 The rules for all possible cases of the "-F" flag are defined
162 by this table.
163 
164 <table>
165 <tr><th>-F none<th>-Fvar,...<th>Input Filter<th>Applied Output Filter
166 <tr><td>true<td>unspecified<td>NA<td>unfiltered
167 <tr><td>true<td>-Fvar,none<td>NA<td>unfiltered
168 <tr><td>true<td>-Fvar,...<td>NA<td>use output filter
169 <tr><td>false<td>unspecified<td>defined<td>use input filter
170 <tr><td>false<td>-Fvar,none<td>NA<td>unfiltered
171 <tr><td>false<td>-Fvar,...<td>NA<td>use output filter
172 </table>
173 
174 Parameter Encoding {#ParamEncode}
175 ==========
176 
177 The parameters passed to a filter are encoded internally as a vector
178 of 32-bit unsigned integers. It may be that the parameters
179 required by a filter can naturally be encoded as unsigned integers.
180 The bzip2 compression filter, for example, expects a single
181 integer value from zero thru nine. This encodes naturally as a
182 single unsigned integer.
183 
184 Note that signed integers and single-precision (32-bit) float values
185 also can easily be represented as 32 bit unsigned integers by
186 proper casting to an unsigned integer so that the bit pattern
187 is preserved. Simple integer values of type short or char
188 (or the unsigned versions) can also be mapped to an unsigned
189 integer by truncating to 16 or 8 bits respectively and then
190 zero extending.
191 
192 Machine byte order (aka endian-ness) is an issue for passing
193 some kinds of parameters. You might define the parameters when
194 compressing on a little endian machine, but later do the
195 decompression on a big endian machine. Byte order is not an
196 issue for 32-bit values because HDF5 takes care of converting
197 them between the local machine byte order and network byte
198 order.
199 
200 Parameters whose size is larger than 32-bits present a byte order problem.
201 This typically includes double precision floats and (signed or unsigned)
202 64-bit integers. For these cases, the machine byte order must be
203 handled by the compression code. This is because HDF5 will treat,
204 for example, an unsigned long long as two 32-bit unsigned integers
205 and will convert each to network order separately. This means that
206 on a machine whose byte order is different than the machine in which
207 the parameters were initially created, the two integers are out of order
208 and must be swapped to get the correct unsigned long long value.
209 Consider this example. Suppose we have this little endian unsigned long long.
210 
211  1000000230000004
212 
213 In network byte order, it will be stored as two 32-bit integers.
214 
215  20000001 40000003
216 
217 On a big endian machine, this will be given to the filter in that form.
218 
219  2000000140000003
220 
221 But note that the proper big endian unsigned long long form is this.
222 
223 4000000320000001
224 
225 So, the two words need to be swapped.
226 
227 But consider the case when both original and final machines are big endian.
228 
229 1. 4000000320000001
230 2. 40000003 20000001
231 3. 40000003 20000001
232 
233 where #1 is the original number, #2 is the network order and
234 #3 is the what is given to the filter. In this case we do not
235 want to swap words.
236 
237 The solution is to forcibly encode the original number using some
238 specified endianness so that the filter always assumes it is getting
239 its parameters in that order and will always do swapping as needed.
240 This is irritating, but one needs to be aware of it. Since most
241 machines are little-endian. We choose to use that as the endianness
242 for handling 64 bit entities.
243 
244 Filter Specification Syntax {#Syntax}
245 ==========
246 
247 Both of the utilities
248 <a href="#NCGEN">__ncgen__</a>
249 and
250 <a href="#NCCOPY">__nccopy__</a>
251 allow the specification of filter parameters.
252 These specifications consist of a sequence of comma
253 separated constants. The constants are converted
254 within the utility to a proper set of unsigned int
255 constants (see the <a href="#ParamEncode">parameter encoding section</a>).
256 
257 To simplify things, various kinds of constants can be specified
258 rather than just simple unsigned integers. The utilities will encode
259 them properly using the rules specified in
260 the <a href="#ParamEncode">parameter encoding section</a>.
261 
262 The currently supported constants are as follows.
263 <table>
264 <tr halign="center"><th>Example<th>Type<th>Format Tag<th>Notes
265 <tr><td>-17b<td>signed 8-bit byte<td>b|B<td>Truncated to 8 bits and zero extended to 32 bits
266 <tr><td>23ub<td>unsigned 8-bit byte<td>u|U b|B<td>Truncated to 8 bits and zero extended to 32 bits
267 <tr><td>-25S<td>signed 16-bit short<td>s|S<td>Truncated to 16 bits and zero extended to 32 bits
268 <tr><td>27US<td>unsigned 16-bit short<td>u|U s|S<td>Truncated to 16 bits and zero extended to 32 bits
269 <tr><td>-77<td>implicit signed 32-bit integer<td>Leading minus sign and no tag<td>
270 <tr><td>77<td>implicit unsigned 32-bit integer<td>No tag<td>
271 <tr><td>93U<td>explicit unsigned 32-bit integer<td>u|U<td>
272 <tr><td>789f<td>32-bit float<td>f|F<td>
273 <tr><td>12345678.12345678d<td>64-bit double<td>d|D<td>Network byte order
274 <tr><td>-9223372036854775807L<td>64-bit signed long long<td>l|L<td>Network byte order
275 <tr><td>18446744073709551615UL<td>64-bit unsigned long long<td>u|U l|L<td>Network byte order
276 </table>
277 Some things to note.
278 
279 1. In all cases, except for an untagged positive integer,
280  the format tag is required and determines how the constant
281  is converted to one or two unsigned int values.
282  The positive integer case is for backward compatibility.
283 2. For signed byte and short, the value is sign extended to 32 bits
284  and then treated as an unsigned int value.
285 3. For double, and signed|unsigned long long, they are converted
286  to network byte order and then treated as two unsigned int values.
287  This is consistent with the <a href="#ParamEncode">parameter encoding</a>.
288 
289 Dynamic Loading Process {#Process}
290 ==========
291 
292 The documentation[1,2] for the HDF5 dynamic loading was (at the time
293 this was written) out-of-date with respect to the actual HDF5 code
294 (see HDF5PL.c). So, the following discussion is largely derived
295 from looking at the actual code. This means that it is subject to change.
296 
297 Plugin directory {#Plugindir}
298 ----------------
299 
300 The HDF5 loader expects plugins to be in a specified plugin directory.
301 The default directory is:
302  * "/usr/local/hdf5/lib/plugin” for linux/unix operating systems (including Cygwin)
303  * “%ALLUSERSPROFILE%\\hdf5\\lib\\plugin” for Windows systems, although the code
304  does not appear to explicitly use this path.
305 
306 The default may be overridden using the environment variable
307 __HDF5_PLUGIN_PATH__.
308 
309 Plugin Library Naming {#Pluginlib}
310 ---------------------
311 
312 Given a plugin directory, HDF5 examines every file in that
313 directory that conforms to a specified name pattern
314 as determined by the platform on which the library is being executed.
315 <table>
316 <tr halign="center"><th>Platform<th>Basename<th>Extension
317 <tr halign="left"><td>Linux<td>lib*<td>.so*
318 <tr halign="left"><td>OSX<td>lib*<td>.so*
319 <tr halign="left"><td>Cygwin<td>cyg*<td>.dll*
320 <tr halign="left"><td>Windows<td>*<td>.dll
321 </table>
322 
323 Plugin Verification {#Pluginverify}
324 -------------------
325 For each dynamic library located using the previous patterns,
326 HDF5 attempts to load the library and attempts to obtain information
327 from it. Specifically, It looks for two functions with the following
328 signatures.
329 
330 1. __H5PL_type_t H5PLget_plugin_type(void)__ --
331 This function is expected to return the constant value
332 __H5PL_TYPE_FILTER__ to indicate that this is a filter library.
333 2. __const void* H5PLget_plugin_info(void)__ --
334 This function returns a pointer to a table of type __H5Z_class2_t__.
335 This table contains the necessary information needed to utilize the
336 filter both for reading and for writing. In particular, it specifies
337 the filter id implemented by the library and if must match that id
338 specified for the variable in __nc_def_var_filter__ in order to be used.
339 
340 If plugin verification fails, then that plugin is ignored and
341 the search continues for another, matching plugin.
342 
343 Debugging {#Debug}
344 -------
345 Debugging plugins can be very difficult. You will probably
346 need to use the old printf approach for debugging the filter itself.
347 
348 One case worth mentioning is when you have a dataset that is
349 using an unknown filter. For this situation, you need to
350 identify what filter(s) are used in the dataset. This can
351 be accomplished using this command.
352 ````
353 ncdump -s -h <dataset filename>
354 ````
355 Since ncdump is not being asked to access the data (the -h flag), it
356 can obtain the filter information without failures. Then it can print
357 out the filter id and the parameters (the -s flag).
358 
359 Test Case {#TestCase}
360 -------
361 Within the netcdf-c source tree, the directory
362 __netcdf-c/nc_test4__ contains a test case (__test_filter.c__) for
363 testing dynamic filter writing and reading using
364 bzip2. Another test (__test_filter_misc.c__) validates
365 parameter passing. These tests are disabled if __--enable-shared__
366 is not set or if __--enable-netcdf-4__ is not set.
367 
368 Example {#Example}
369 -------
370 A slightly simplified version of the filter test case is also
371 available as an example within the netcdf-c source tree
372 directory __netcdf-c/examples/C. The test is called __filter_example.c__
373 and it is executed as part of the __run_examples4.sh__ shell script.
374 The test case demonstrates dynamic filter writing and reading.
375 
376 The files __example/C/hdf5plugins/Makefile.am__
377 and __example/C/hdf5plugins/CMakeLists.txt__
378 demonstrate how to build the hdf5 plugin for bzip2.
379 
380 Notes
381 ==========
382 
383 Supported Systems
384 -----------------
385 The current matrix of OS X build systems known to work is as follows.
386 <table>
387 <tr><th>Build System<th>Supported OS
388 <tr><td>Automake<td>Linux, Cygwin
389 <tr><td>Cmake<td>Linux, Cygwin, Visual Studio
390 </table>
391 
392 Generic Plugin Build
393 --------------------
394 If you do not want to use Automake or Cmake, the following
395 has been known to work.
396 ````
397 gcc -g -O0 -shared -o libbzip2.so <plugin source files> -L${HDF5LIBDIR} -lhdf5_hl -lhdf5 -L${ZLIBDIR} -lz
398 ````
399 
400 Appendix A. Byte Swap Code {#AppendixA}
401 ==========
402 Since in some cases, it is necessary for a filter to
403 byte swap from little-endian to big-endian, This appendix
404 provides sample code for doing this. It also provides
405 a code snippet for testing if the machine the
406 endianness of a machine.
407 
408 Byte swap an 8-byte chunk of memory
409 -------
410 ````
411 static void
412 byteswap8(unsigned char* mem)
413 {
414  register unsigned char c;
415  c = mem[0];
416  mem[0] = mem[7];
417  mem[7] = c;
418  c = mem[1];
419  mem[1] = mem[6];
420  mem[6] = c;
421  c = mem[2];
422  mem[2] = mem[5];
423  mem[5] = c;
424  c = mem[3];
425  mem[3] = mem[4];
426  mem[4] = c;
427 }
428 
429 ````
430 
431 Test for Machine Endianness
432 -------
433 ````
434 static const unsigned char b[4] = {0x0,0x0,0x0,0x1}; /* value 1 in big-endian*/
435 int endianness = (1 == *(unsigned int*)b); /* 1=>big 0=>little endian
436 ````
437 
438 References {#References}
439 ==========
440 
441 1. https://support.hdfgroup.org/HDF5/doc/Advanced/DynamicallyLoadedFilters/HDF5DynamicallyLoadedFilters.pdf
442 2. https://support.hdfgroup.org/HDF5/doc/TechNotes/TechNote-HDF5-CompressionTroubleshooting.pdf
443 3. https://portal.hdfgroup.org/display/support/Contributions#Contributions-filters
444 4. https://support.hdfgroup.org/services/contributions.html#filters

Return to the Main Unidata NetCDF page.
Generated on Wed Aug 1 2018 05:36:48 for NetCDF. NetCDF is a Unidata library.