Numcodecs¶
Numcodecs is a Python package providing buffer compression and transformation codecs for use in data storage and communication applications. These include:
- Compression codecs, e.g., Zlib, BZ2, LZMA and Blosc.
- Pre-compression filters, e.g., Delta, Quantize, FixedScaleOffset, PackBits, Categorize.
- Integrity checks, e.g., CRC32, Adler32.
All codecs implement the same API, allowing codecs to be organized into pipelines in a variety of ways.
If you have a question, find a bug, would like to make a suggestion or contribute code, please raise an issue on GitHub.
Installation¶
Numcodecs depends on NumPy. It is generally best to install NumPy first using whatever method is most appropriate for you operating system and Python distribution.
Install from PyPI:
$ pip install numcodecs
Alternatively, install via conda:
$ conda install -c conda-forge numcodecs
Numcodecs includes a C extension providing integration with the Blosc library. Installing via conda will install a pre-compiled binary distribution. However, if you have a newer CPU that supports the AVX2 instruction set (e.g., Intel Haswell, Broadwell or Skylake) then installing via pip is preferable, because this will compile the Blosc library from source with optimisations for AVX2.
Note that if you compile the C extensions on a machine with AVX2 support you probably then cannot use the same binaries on a machine without AVX2. To disable compilation with AVX2 support regardless of the machine architecture:
$ export DISABLE_NUMCODECS_AVX2= $ pip install numcodecs
To work with Numcodecs source code in development, install from GitHub:
$ git clone --recursive https://github.com/alimanfoo/numcodecs.git
$ cd numcodecs
$ python setup.py install
To verify that Numcodecs has been fully installed (including the Blosc extension) run the test suite:
$ pip install nose
$ python -m nose -v numcodecs
Contents¶
Codec API¶
This module defines the Codec
base class, a common interface for
all codec classes.
Codec classes must implement Codec.encode()
and Codec.decode()
methods. Inputs to and outputs from these methods may be any Python object
exporting a contiguous buffer via the new-style Python protocol
or array.array
under Python 2.
Codec classes must implement a Codec.get_config()
method,
which must return a dictionary holding all configuration parameters
required to enable encoding and decoding of data. The expectation is that
these configuration parameters will be stored or communicated separately
from encoded data, and thus the codecs do not need to store all encoding
parameters within the encoded data. For broad compatibility,
the configuration object must contain only JSON-serializable values. The
configuration object must also contain an ‘id’ field storing the codec
identifier (see below).
Codec classes must implement a Codec.from_config()
class method,
which will return an instance of the class initiliazed from a configuration
object.
Finally, codec classes must set a codec_id class-level attribute. This must be a string. Two different codec classes may set the same value for the codec_id attribute if and only if they are fully compatible, meaning that (1) configuration parameters are the same, and (2) given the same configuration, one class could correctly decode data encoded by the other and vice versa.
-
class
numcodecs.abc.
Codec
¶ Codec abstract base class.
-
codec_id
= None¶
-
encode
(buf)¶ Encode data in buf.
Parameters: buf : buffer-like
Data to be encoded. May be any object supporting the new-style buffer protocol or array.array under Python 2.
Returns: enc : buffer-like
Encoded data. May be any object supporting the new-style buffer protocol or array.array under Python 2.
-
decode
(buf, out=None)¶ Decode data in buf.
Parameters: buf : buffer-like
Encoded data. May be any object supporting the new-style buffer protocol or array.array under Python 2.
out : buffer-like, optional
Writeable buffer to store decoded data.
Returns: dec : buffer-like
Decoded data. May be any object supporting the new-style buffer protocol or array.array under Python 2.
-
get_config
()¶ Return a dictionary holding configuration parameters for this codec. Must include an ‘id’ field with the codec identifier. All values must be compatible with JSON encoding.
-
classmethod
from_config
(config)¶ Instantiate codec from a configuration object.
-
Codec registry¶
The registry module provides some simple convenience functions to enable applications to dynamically register and look-up codec classes.
-
numcodecs.registry.
get_codec
(config)¶ Obtain a codec for the given configuration.
Parameters: config : dict-like
Configuration object.
Returns: codec : Codec
Examples
>>> import numcodecs as codecs >>> codec = codecs.get_codec(dict(id='zlib', level=1)) >>> codec Zlib(level=1)
-
numcodecs.registry.
register_codec
(cls)¶ Register a codec class.
Parameters: cls : Codec class Notes
This function maintains a mapping from codec identifiers to codec classes. When a codec class is registered, it will replace any class previously registered under the same codec identifier, if present.
Blosc¶
-
class
numcodecs.blosc.
Blosc
¶ Codec providing compression using the Blosc meta-compressor.
Parameters: cname : string, optional
A string naming one of the compression algorithms available within blosc, e.g., ‘zstd’, ‘blosclz’, ‘lz4’, ‘lz4hc’, ‘zlib’ or ‘snappy’.
clevel : integer, optional
An integer between 0 and 9 specifying the compression level.
shuffle : integer, optional
Either NOSHUFFLE (0), SHUFFLE (1) or BITSHUFFLE (2).
blocksize : int
The requested size of the compressed blocks. If 0 (default), an automatic blocksize will be used.
See also
-
codec_id
= 'blosc'¶
-
NOSHUFFLE
= 0¶
-
SHUFFLE
= 1¶
-
BITSHUFFLE
= 2¶
-
-
numcodecs.blosc.
init
()¶ Initialize the Blosc library environment.
-
numcodecs.blosc.
destroy
()¶ Destroy the Blosc library environment.
-
numcodecs.blosc.
compname_to_compcode
(cname)¶ Return the compressor code associated with the compressor name. If the compressor name is not recognized, or there is not support for it in this build, -1 is returned instead.
-
numcodecs.blosc.
list_compressors
()¶ Get a list of compressors supported in the current build.
-
numcodecs.blosc.
get_nthreads
()¶ Get the number of threads that Blosc uses internally for compression and decompression.
-
numcodecs.blosc.
set_nthreads
(int nthreads)¶ Set the number of threads that Blosc uses internally for compression and decompression.
-
numcodecs.blosc.
cbuffer_sizes
(source)¶ Return information about a compressed buffer, namely the number of uncompressed bytes ( nbytes) and compressed (cbytes). It also returns the blocksize (which is used internally for doing the compression by blocks).
Returns: nbytes : int
cbytes : int
blocksize : int
-
numcodecs.blosc.
compress
(source, char *cname, int clevel, int shuffle, int blocksize=0)¶ Compress data.
Parameters: source : bytes-like
Data to be compressed. Can be any object supporting the buffer protocol.
cname : bytes
Name of compression library to use.
clevel : int
Compression level.
shuffle : int
Shuffle filter.
blocksize : int
The requested size of the compressed blocks. If 0, an automatic blocksize will be used.
Returns: dest : bytes
Compressed data.
-
numcodecs.blosc.
decompress
(source, dest=None)¶ Decompress data.
Parameters: source : bytes-like
Compressed data, including blosc header. Can be any object supporting the buffer protocol.
dest : array-like, optional
Object to decompress into.
Returns: dest : bytes
Object containing decompressed data.
LZ4¶
-
class
numcodecs.lz4.
LZ4
¶ Codec providing compression using LZ4.
Parameters: acceleration : int
Acceleration level. The larger the acceleration value, the faster the algorithm, but also the lesser the compression.
See also
-
codec_id
= 'lz4'¶
-
-
numcodecs.lz4.
compress
(source, int acceleration=DEFAULT_ACCELERATION)¶ Compress data.
Parameters: source : bytes-like
Data to be compressed. Can be any object supporting the buffer protocol.
acceleration : int
Acceleration level. The larger the acceleration value, the faster the algorithm, but also the lesser the compression.
Returns: dest : bytes
Compressed data.
Notes
The compressed output includes a 4-byte header storing the original size of the decompressed data as a little-endian 32-bit integer.
-
numcodecs.lz4.
decompress
(source, dest=None)¶ Decompress data.
Parameters: source : bytes-like
Compressed data. Can be any object supporting the buffer protocol.
dest : array-like, optional
Object to decompress into.
Returns: dest : bytes
Object containing decompressed data.
Zstd¶
-
class
numcodecs.zstd.
Zstd
¶ Codec providing compression using Zstandard.
Parameters: level : int
Compression level (1-22).
See also
-
codec_id
= 'zstd'¶
-
-
numcodecs.zstd.
compress
(source, int level=DEFAULT_CLEVEL)¶ Compress data.
Parameters: source : bytes-like
Data to be compressed. Can be any object supporting the buffer protocol.
level : int
Compression level (1-22).
Returns: dest : bytes
Compressed data.
-
numcodecs.zstd.
decompress
(source, dest=None)¶ Decompress data.
Parameters: source : bytes-like
Compressed data. Can be any object supporting the buffer protocol.
dest : array-like, optional
Object to decompress into.
Returns: dest : bytes
Object containing decompressed data.
Zlib¶
BZ2¶
LZMA¶
-
class
numcodecs.lzma.
LZMA
(format=1, check=-1, preset=None, filters=None)¶ Codec providing compression using lzma via the Python standard library (only available under Python 3).
Parameters: format : integer, optional
One of the lzma format codes, e.g.,
lzma.FORMAT_XZ
.check : integer, optional
One of the lzma check codes, e.g.,
lzma.CHECK_NONE
.preset : integer, optional
An integer between 0 and 9 inclusive, specifying the compression level.
filters : list, optional
A list of dictionaries specifying compression filters. If filters are provided, ‘preset’ must be None.
-
codec_id
= 'lzma'¶
-
Delta¶
-
class
numcodecs.delta.
Delta
(dtype, astype=None)¶ Codec to encode data as the difference between adjacent values.
Parameters: dtype : dtype
Data type to use for decoded data.
astype : dtype, optional
Data type to use for encoded data.
Notes
If astype is an integer data type, please ensure that it is sufficiently large to store encoded values. No checks are made and data may become corrupted due to integer overflow if astype is too small. Note also that the encoded data for each chunk includes the absolute value of the first element in the chunk, and so the encoded data type in general needs to be large enough to store absolute values from the array.
Examples
>>> import numcodecs as codecs >>> import numpy as np >>> x = np.arange(100, 120, 2, dtype='i8') >>> f = codecs.Delta(dtype='i8', astype='i1') >>> y = f.encode(x) >>> y array([100, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int8) >>> z = f.decode(y) >>> z array([100, 102, 104, 106, 108, 110, 112, 114, 116, 118])
-
codec_id
= 'delta'¶
-
FixedScaleOffset¶
-
class
numcodecs.fixedscaleoffset.
FixedScaleOffset
(offset, scale, dtype, astype=None)¶ Simplified version of the scale-offset filter available in HDF5. Applies the transformation (x - offset) * scale to all chunks. Results are rounded to the nearest integer but are not packed according to the minimum number of bits.
Parameters: offset : float
Value to subtract from data.
scale : int
Value to multiply by data.
dtype : dtype
Data type to use for decoded data.
astype : dtype, optional
Data type to use for encoded data.
Notes
If astype is an integer data type, please ensure that it is sufficiently large to store encoded values. No checks are made and data may become corrupted due to integer overflow if astype is too small.
Examples
>>> import numcodecs as codecs >>> import numpy as np >>> x = np.linspace(1000, 1001, 10, dtype='f8') >>> x array([ 1000. , 1000.11111111, 1000.22222222, 1000.33333333, 1000.44444444, 1000.55555556, 1000.66666667, 1000.77777778, 1000.88888889, 1001. ]) >>> f1 = codecs.FixedScaleOffset(offset=1000, scale=10, dtype='f8', astype='u1') >>> y1 = f1.encode(x) >>> y1 array([ 0, 1, 2, 3, 4, 6, 7, 8, 9, 10], dtype=uint8) >>> z1 = f1.decode(y1) >>> z1 array([ 1000. , 1000.1, 1000.2, 1000.3, 1000.4, 1000.6, 1000.7, 1000.8, 1000.9, 1001. ]) >>> f2 = codecs.FixedScaleOffset(offset=1000, scale=10**2, dtype='f8', astype='u1') >>> y2 = f2.encode(x) >>> y2 array([ 0, 11, 22, 33, 44, 56, 67, 78, 89, 100], dtype=uint8) >>> z2 = f2.decode(y2) >>> z2 array([ 1000. , 1000.11, 1000.22, 1000.33, 1000.44, 1000.56, 1000.67, 1000.78, 1000.89, 1001. ]) >>> f3 = codecs.FixedScaleOffset(offset=1000, scale=10**3, dtype='f8', astype='u2') >>> y3 = f3.encode(x) >>> y3 array([ 0, 111, 222, 333, 444, 556, 667, 778, 889, 1000], dtype=uint16) >>> z3 = f3.decode(y3) >>> z3 array([ 1000. , 1000.111, 1000.222, 1000.333, 1000.444, 1000.556, 1000.667, 1000.778, 1000.889, 1001. ])
-
codec_id
= 'fixedscaleoffset'¶
-
PackBits¶
-
class
numcodecs.packbits.
PackBits
¶ Codec to pack elements of a boolean array into bits in a uint8 array.
Notes
The first element of the encoded array stores the number of bits that were padded to complete the final byte.
Examples
>>> import numcodecs as codecs >>> import numpy as np >>> codec = codecs.PackBits() >>> x = np.array([True, False, False, True], dtype=bool) >>> y = codec.encode(x) >>> y array([ 4, 144], dtype=uint8) >>> z = codec.decode(y) >>> z array([ True, False, False, True], dtype=bool)
-
codec_id
= 'packbits'¶
-
Categorize¶
-
class
numcodecs.categorize.
Categorize
(labels, dtype, astype='u1')¶ Filter encoding categorical string data as integers.
Parameters: labels : sequence of strings
Category labels.
dtype : dtype
Data type to use for decoded data.
astype : dtype, optional
Data type to use for encoded data.
Examples
>>> import numcodecs as codecs >>> import numpy as np >>> x = np.array([b'male', b'female', b'female', b'male', b'unexpected']) >>> x array([b'male', b'female', b'female', b'male', b'unexpected'], dtype='|S10') >>> f = codecs.Categorize(labels=[b'female', b'male'], dtype=x.dtype) >>> y = f.encode(x) >>> y array([2, 1, 1, 2, 0], dtype=uint8) >>> z = f.decode(y) >>> z array([b'male', b'female', b'female', b'male', b''], dtype='|S10')
-
codec_id
= 'categorize'¶
-
32-bit checksums¶
AsType¶
-
class
numcodecs.astype.
AsType
(encode_dtype, decode_dtype)¶ Filter to convert data between different types.
Parameters: encode_dtype : dtype
Data type to use for encoded data.
decode_dtype : dtype, optional
Data type to use for decoded data.
Notes
If encode_dtype is of lower precision than decode_dtype, please be aware that data loss can occur by writing data to disk using this filter. No checks are made to ensure the casting will work in that direction and data corruption will occur.
Examples
>>> import numcodecs >>> import numpy as np >>> x = np.arange(100, 120, 2, dtype=np.int8) >>> x array([100, 102, 104, 106, 108, 110, 112, 114, 116, 118], dtype=int8) >>> f = numcodecs.AsType(encode_dtype=x.dtype, decode_dtype=np.int64) >>> y = f.decode(x) >>> y array([100, 102, 104, 106, 108, 110, 112, 114, 116, 118]) >>> z = f.encode(y) >>> z array([100, 102, 104, 106, 108, 110, 112, 114, 116, 118], dtype=int8)
-
codec_id
= 'astype'¶
-
Pickle¶
-
class
numcodecs.pickles.
Pickle
(protocol=2)¶ Codec to encode data as as pickled bytes. Useful for encoding an array of Python string objects.
Parameters: protocol : int, defaults to pickle.HIGHEST_PROTOCOL
The protocol used to pickle data.
See also
Examples
>>> import numcodecs as codecs >>> import numpy as np >>> x = np.array(['foo', 'bar', 'baz'], dtype='object') >>> f = codecs.Pickle() >>> f.decode(f.encode(x)) array(['foo', 'bar', 'baz'], dtype=object)
-
codec_id
= 'pickle'¶
-
MsgPack¶
-
class
numcodecs.msgpacks.
MsgPack
(encoding='utf-8')¶ Codec to encode data as msgpacked bytes. Useful for encoding an array of Python string objects.
See also
Notes
Requires msgpack-python to be installed.
Examples
>>> import numcodecs as codecs >>> import numpy as np >>> x = np.array(['foo', 'bar', 'baz'], dtype='object') >>> f = codecs.MsgPack() >>> f.decode(f.encode(x)) array(['foo', 'bar', 'baz'], dtype=object)
-
codec_id
= 'msgpack'¶
-
Release notes¶
0.1.0¶
New codecs:
- Two new compressor codecs
numcodecs.zstd.Zstd
andnumcodecs.lz4.LZ4
have been added (#3, #22). These provide direct support for compression/decompression using Zstandard and LZ4 respectively. - A new
numcodecs.msgpacks.MsgPack
codec has been added which uses msgpack-python to perform encoding/decoding, including support for arrays of Python objects (Jeff Reback; #5, #6, #8, #21). - A new
numcodecs.pickles.Pickle
codec has been added which uses the Python pickle protocol to perform encoding/decoding, including support for arrays of Python objects (Jeff Reback; #5, #6, #21). - A new
numcodecs.astype.AsType
codec has been added which uses NumPy to perform type conversion (John Kirkham; #7, #12, #14).
Other new features:
- The
numcodecs.lzma.LZMA
codec is now supported on Python 2.7 if backports.lzma is installed (John Kirkham; #11, #13). - The bundled c-blosc library has been upgraded to version 1.11.2 (#10, #18).
- An option has been added to the
numcodecs.blosc.Blosc
codec to allow the block size to be manually configured (#9, #19). - The representation string for the
numcodecs.blosc.Blosc
codec has been tweaked to help with understanding the shuffle option (#4, #19). - Options have been added to manually control how the C extensions are built regardless of the architecture of the system on which the build is run. To disable support for AVX2 set the environment variable “DISABLE_NUMCODECS_AVX2”. To disable support for SSE2 set the environment variable “DISABLE_NUMCODECS_SSE2”. To disable C extensions altogether set the environment variable “DISABLE_NUMCODECS_CEXT” (#24, #26).
Maintenance work:
0.0.1¶
Fixed project description in setup.py.
0.0.0¶
First release. This version is a port of the codecs
module from Zarr 2.1.0. The following changes have been made from
the original Zarr module:
- Codec classes have been re-organized into separate modules, mostly one per codec class, for ease of maintenance.
- Two new codec classes have been added based on 32-bit checksums:
numcodecs.checksum32.CRC32
andnumcodecs.checksum32.Adler32
. - The Blosc extension has been refactored to remove code duplications related to handling of buffer compatibility.
Acknowledgments¶
Numcodecs bundles the c-blosc library.
Development of this package is supported by the MRC Centre for Genomics and Global Health.