Numcodecs

Numcodecs is a Python package providing buffer compression and transformation codecs for use in data storage and communication applications. These include:

  • Compression codecs, e.g., Zlib, BZ2, LZMA and Blosc.
  • Pre-compression filters, e.g., Delta, Quantize, FixedScaleOffset, PackBits, Categorize.
  • Integrity checks, e.g., CRC32, Adler32.

All codecs implement the same API, allowing codecs to be organized into pipelines in a variety of ways.

If you have a question, find a bug, would like to make a suggestion or contribute code, please raise an issue on GitHub.

Installation

Numcodecs depends on NumPy. It is generally best to install NumPy first using whatever method is most appropriate for you operating system and Python distribution.

Install from PyPI:

$ pip install numcodecs

Alternatively, install via conda:

$ conda install -c conda-forge numcodecs

Numcodecs includes a C extension providing integration with the Blosc library. Installing via conda will install a pre-compiled binary distribution. However, if you have a newer CPU that supports the AVX2 instruction set (e.g., Intel Haswell, Broadwell or Skylake) then installing via pip is preferable, because this will compile the Blosc library from source with optimisations for AVX2.

Note that if you compile the C extensions on a machine with AVX2 support you probably then cannot use the same binaries on a machine without AVX2. To disable compilation with AVX2 support regardless of the machine architecture:

$ export DISABLE_NUMCODECS_AVX2= $ pip install numcodecs

To work with Numcodecs source code in development, install from GitHub:

$ git clone --recursive https://github.com/alimanfoo/numcodecs.git
$ cd numcodecs
$ python setup.py install

To verify that Numcodecs has been fully installed (including the Blosc extension) run the test suite:

$ pip install nose
$ python -m nose -v numcodecs

Contents

Codec API

This module defines the Codec base class, a common interface for all codec classes.

Codec classes must implement Codec.encode() and Codec.decode() methods. Inputs to and outputs from these methods may be any Python object exporting a contiguous buffer via the new-style Python protocol or array.array under Python 2.

Codec classes must implement a Codec.get_config() method, which must return a dictionary holding all configuration parameters required to enable encoding and decoding of data. The expectation is that these configuration parameters will be stored or communicated separately from encoded data, and thus the codecs do not need to store all encoding parameters within the encoded data. For broad compatibility, the configuration object must contain only JSON-serializable values. The configuration object must also contain an ‘id’ field storing the codec identifier (see below).

Codec classes must implement a Codec.from_config() class method, which will return an instance of the class initiliazed from a configuration object.

Finally, codec classes must set a codec_id class-level attribute. This must be a string. Two different codec classes may set the same value for the codec_id attribute if and only if they are fully compatible, meaning that (1) configuration parameters are the same, and (2) given the same configuration, one class could correctly decode data encoded by the other and vice versa.

class numcodecs.abc.Codec

Codec abstract base class.

codec_id = None
encode(buf)

Encode data in buf.

Parameters:

buf : buffer-like

Data to be encoded. May be any object supporting the new-style buffer protocol or array.array under Python 2.

Returns:

enc : buffer-like

Encoded data. May be any object supporting the new-style buffer protocol or array.array under Python 2.

decode(buf, out=None)

Decode data in buf.

Parameters:

buf : buffer-like

Encoded data. May be any object supporting the new-style buffer protocol or array.array under Python 2.

out : buffer-like, optional

Writeable buffer to store decoded data.

Returns:

dec : buffer-like

Decoded data. May be any object supporting the new-style buffer protocol or array.array under Python 2.

get_config()

Return a dictionary holding configuration parameters for this codec. Must include an ‘id’ field with the codec identifier. All values must be compatible with JSON encoding.

classmethod from_config(config)

Instantiate codec from a configuration object.

Codec registry

The registry module provides some simple convenience functions to enable applications to dynamically register and look-up codec classes.

numcodecs.registry.get_codec(config)

Obtain a codec for the given configuration.

Parameters:

config : dict-like

Configuration object.

Returns:

codec : Codec

Examples

>>> import numcodecs as codecs
>>> codec = codecs.get_codec(dict(id='zlib', level=1))
>>> codec
Zlib(level=1)
numcodecs.registry.register_codec(cls)

Register a codec class.

Parameters:cls : Codec class

Notes

This function maintains a mapping from codec identifiers to codec classes. When a codec class is registered, it will replace any class previously registered under the same codec identifier, if present.

Blosc

class numcodecs.blosc.Blosc

Codec providing compression using the Blosc meta-compressor.

Parameters:

cname : string, optional

A string naming one of the compression algorithms available within blosc, e.g., ‘zstd’, ‘blosclz’, ‘lz4’, ‘lz4hc’, ‘zlib’ or ‘snappy’.

clevel : integer, optional

An integer between 0 and 9 specifying the compression level.

shuffle : integer, optional

Either NOSHUFFLE (0), SHUFFLE (1) or BITSHUFFLE (2).

blocksize : int

The requested size of the compressed blocks. If 0 (default), an automatic blocksize will be used.

codec_id = 'blosc'
NOSHUFFLE = 0
SHUFFLE = 1
BITSHUFFLE = 2
numcodecs.blosc.init()

Initialize the Blosc library environment.

numcodecs.blosc.destroy()

Destroy the Blosc library environment.

numcodecs.blosc.compname_to_compcode(cname)

Return the compressor code associated with the compressor name. If the compressor name is not recognized, or there is not support for it in this build, -1 is returned instead.

numcodecs.blosc.list_compressors()

Get a list of compressors supported in the current build.

numcodecs.blosc.get_nthreads()

Get the number of threads that Blosc uses internally for compression and decompression.

numcodecs.blosc.set_nthreads(int nthreads)

Set the number of threads that Blosc uses internally for compression and decompression.

numcodecs.blosc.cbuffer_sizes(source)

Return information about a compressed buffer, namely the number of uncompressed bytes ( nbytes) and compressed (cbytes). It also returns the blocksize (which is used internally for doing the compression by blocks).

Returns:

nbytes : int

cbytes : int

blocksize : int

numcodecs.blosc.compress(source, char *cname, int clevel, int shuffle, int blocksize=0)

Compress data.

Parameters:

source : bytes-like

Data to be compressed. Can be any object supporting the buffer protocol.

cname : bytes

Name of compression library to use.

clevel : int

Compression level.

shuffle : int

Shuffle filter.

blocksize : int

The requested size of the compressed blocks. If 0, an automatic blocksize will be used.

Returns:

dest : bytes

Compressed data.

numcodecs.blosc.decompress(source, dest=None)

Decompress data.

Parameters:

source : bytes-like

Compressed data, including blosc header. Can be any object supporting the buffer protocol.

dest : array-like, optional

Object to decompress into.

Returns:

dest : bytes

Object containing decompressed data.

LZ4

class numcodecs.lz4.LZ4

Codec providing compression using LZ4.

Parameters:

acceleration : int

Acceleration level. The larger the acceleration value, the faster the algorithm, but also the lesser the compression.

codec_id = 'lz4'
numcodecs.lz4.compress(source, int acceleration=DEFAULT_ACCELERATION)

Compress data.

Parameters:

source : bytes-like

Data to be compressed. Can be any object supporting the buffer protocol.

acceleration : int

Acceleration level. The larger the acceleration value, the faster the algorithm, but also the lesser the compression.

Returns:

dest : bytes

Compressed data.

Notes

The compressed output includes a 4-byte header storing the original size of the decompressed data as a little-endian 32-bit integer.

numcodecs.lz4.decompress(source, dest=None)

Decompress data.

Parameters:

source : bytes-like

Compressed data. Can be any object supporting the buffer protocol.

dest : array-like, optional

Object to decompress into.

Returns:

dest : bytes

Object containing decompressed data.

Zstd

class numcodecs.zstd.Zstd

Codec providing compression using Zstandard.

Parameters:

level : int

Compression level (1-22).

codec_id = 'zstd'
numcodecs.zstd.compress(source, int level=DEFAULT_CLEVEL)

Compress data.

Parameters:

source : bytes-like

Data to be compressed. Can be any object supporting the buffer protocol.

level : int

Compression level (1-22).

Returns:

dest : bytes

Compressed data.

numcodecs.zstd.decompress(source, dest=None)

Decompress data.

Parameters:

source : bytes-like

Compressed data. Can be any object supporting the buffer protocol.

dest : array-like, optional

Object to decompress into.

Returns:

dest : bytes

Object containing decompressed data.

Zlib

class numcodecs.zlib.Zlib(level=1)

Codec providing compression using zlib via the Python standard library.

Parameters:

level : int

Compression level.

codec_id = 'zlib'

BZ2

class numcodecs.bz2.BZ2(level=1)

Codec providing compression using bzip2 via the Python standard library.

Parameters:

level : int

Compression level.

codec_id = 'bz2'

LZMA

class numcodecs.lzma.LZMA(format=1, check=-1, preset=None, filters=None)

Codec providing compression using lzma via the Python standard library (only available under Python 3).

Parameters:

format : integer, optional

One of the lzma format codes, e.g., lzma.FORMAT_XZ.

check : integer, optional

One of the lzma check codes, e.g., lzma.CHECK_NONE.

preset : integer, optional

An integer between 0 and 9 inclusive, specifying the compression level.

filters : list, optional

A list of dictionaries specifying compression filters. If filters are provided, ‘preset’ must be None.

codec_id = 'lzma'

Delta

class numcodecs.delta.Delta(dtype, astype=None)

Codec to encode data as the difference between adjacent values.

Parameters:

dtype : dtype

Data type to use for decoded data.

astype : dtype, optional

Data type to use for encoded data.

Notes

If astype is an integer data type, please ensure that it is sufficiently large to store encoded values. No checks are made and data may become corrupted due to integer overflow if astype is too small. Note also that the encoded data for each chunk includes the absolute value of the first element in the chunk, and so the encoded data type in general needs to be large enough to store absolute values from the array.

Examples

>>> import numcodecs as codecs
>>> import numpy as np
>>> x = np.arange(100, 120, 2, dtype='i8')
>>> f = codecs.Delta(dtype='i8', astype='i1')
>>> y = f.encode(x)
>>> y
array([100,   2,   2,   2,   2,   2,   2,   2,   2,   2], dtype=int8)
>>> z = f.decode(y)
>>> z
array([100, 102, 104, 106, 108, 110, 112, 114, 116, 118])
codec_id = 'delta'

FixedScaleOffset

class numcodecs.fixedscaleoffset.FixedScaleOffset(offset, scale, dtype, astype=None)

Simplified version of the scale-offset filter available in HDF5. Applies the transformation (x - offset) * scale to all chunks. Results are rounded to the nearest integer but are not packed according to the minimum number of bits.

Parameters:

offset : float

Value to subtract from data.

scale : int

Value to multiply by data.

dtype : dtype

Data type to use for decoded data.

astype : dtype, optional

Data type to use for encoded data.

Notes

If astype is an integer data type, please ensure that it is sufficiently large to store encoded values. No checks are made and data may become corrupted due to integer overflow if astype is too small.

Examples

>>> import numcodecs as codecs
>>> import numpy as np
>>> x = np.linspace(1000, 1001, 10, dtype='f8')
>>> x
array([ 1000.        ,  1000.11111111,  1000.22222222,  1000.33333333,
        1000.44444444,  1000.55555556,  1000.66666667,  1000.77777778,
        1000.88888889,  1001.        ])
>>> f1 = codecs.FixedScaleOffset(offset=1000, scale=10, dtype='f8', astype='u1')
>>> y1 = f1.encode(x)
>>> y1
array([ 0,  1,  2,  3,  4,  6,  7,  8,  9, 10], dtype=uint8)
>>> z1 = f1.decode(y1)
>>> z1
array([ 1000. ,  1000.1,  1000.2,  1000.3,  1000.4,  1000.6,  1000.7,
        1000.8,  1000.9,  1001. ])
>>> f2 = codecs.FixedScaleOffset(offset=1000, scale=10**2, dtype='f8', astype='u1')
>>> y2 = f2.encode(x)
>>> y2
array([  0,  11,  22,  33,  44,  56,  67,  78,  89, 100], dtype=uint8)
>>> z2 = f2.decode(y2)
>>> z2
array([ 1000.  ,  1000.11,  1000.22,  1000.33,  1000.44,  1000.56,
        1000.67,  1000.78,  1000.89,  1001.  ])
>>> f3 = codecs.FixedScaleOffset(offset=1000, scale=10**3, dtype='f8', astype='u2')
>>> y3 = f3.encode(x)
>>> y3
array([   0,  111,  222,  333,  444,  556,  667,  778,  889, 1000], dtype=uint16)
>>> z3 = f3.decode(y3)
>>> z3
array([ 1000.   ,  1000.111,  1000.222,  1000.333,  1000.444,  1000.556,
        1000.667,  1000.778,  1000.889,  1001.   ])
codec_id = 'fixedscaleoffset'

PackBits

class numcodecs.packbits.PackBits

Codec to pack elements of a boolean array into bits in a uint8 array.

Notes

The first element of the encoded array stores the number of bits that were padded to complete the final byte.

Examples

>>> import numcodecs as codecs
>>> import numpy as np
>>> codec = codecs.PackBits()
>>> x = np.array([True, False, False, True], dtype=bool)
>>> y = codec.encode(x)
>>> y
array([  4, 144], dtype=uint8)
>>> z = codec.decode(y)
>>> z
array([ True, False, False,  True], dtype=bool)
codec_id = 'packbits'

Categorize

class numcodecs.categorize.Categorize(labels, dtype, astype='u1')

Filter encoding categorical string data as integers.

Parameters:

labels : sequence of strings

Category labels.

dtype : dtype

Data type to use for decoded data.

astype : dtype, optional

Data type to use for encoded data.

Examples

>>> import numcodecs as codecs
>>> import numpy as np
>>> x = np.array([b'male', b'female', b'female', b'male', b'unexpected'])
>>> x
array([b'male', b'female', b'female', b'male', b'unexpected'],
      dtype='|S10')
>>> f = codecs.Categorize(labels=[b'female', b'male'], dtype=x.dtype)
>>> y = f.encode(x)
>>> y
array([2, 1, 1, 2, 0], dtype=uint8)
>>> z = f.decode(y)
>>> z
array([b'male', b'female', b'female', b'male', b''],
      dtype='|S10')
codec_id = 'categorize'

32-bit checksums

CRC32

class numcodecs.checksum32.CRC32
codec_id = 'crc32'

Adler32

class numcodecs.checksum32.Adler32
codec_id = 'adler32'

AsType

class numcodecs.astype.AsType(encode_dtype, decode_dtype)

Filter to convert data between different types.

Parameters:

encode_dtype : dtype

Data type to use for encoded data.

decode_dtype : dtype, optional

Data type to use for decoded data.

Notes

If encode_dtype is of lower precision than decode_dtype, please be aware that data loss can occur by writing data to disk using this filter. No checks are made to ensure the casting will work in that direction and data corruption will occur.

Examples

>>> import numcodecs
>>> import numpy as np
>>> x = np.arange(100, 120, 2, dtype=np.int8)
>>> x
array([100, 102, 104, 106, 108, 110, 112, 114, 116, 118], dtype=int8)
>>> f = numcodecs.AsType(encode_dtype=x.dtype, decode_dtype=np.int64)
>>> y = f.decode(x)
>>> y
array([100, 102, 104, 106, 108, 110, 112, 114, 116, 118])
>>> z = f.encode(y)
>>> z
array([100, 102, 104, 106, 108, 110, 112, 114, 116, 118], dtype=int8)
codec_id = 'astype'

Pickle

class numcodecs.pickles.Pickle(protocol=2)

Codec to encode data as as pickled bytes. Useful for encoding an array of Python string objects.

Parameters:

protocol : int, defaults to pickle.HIGHEST_PROTOCOL

The protocol used to pickle data.

Examples

>>> import numcodecs as codecs
>>> import numpy as np
>>> x = np.array(['foo', 'bar', 'baz'], dtype='object')
>>> f = codecs.Pickle()
>>> f.decode(f.encode(x))
array(['foo', 'bar', 'baz'], dtype=object)
codec_id = 'pickle'

MsgPack

class numcodecs.msgpacks.MsgPack(encoding='utf-8')

Codec to encode data as msgpacked bytes. Useful for encoding an array of Python string objects.

Notes

Requires msgpack-python to be installed.

Examples

>>> import numcodecs as codecs
>>> import numpy as np
>>> x = np.array(['foo', 'bar', 'baz'], dtype='object')
>>> f = codecs.MsgPack()
>>> f.decode(f.encode(x))
array(['foo', 'bar', 'baz'], dtype=object)
codec_id = 'msgpack'

Release notes

0.1.0

New codecs:

Other new features:

  • The numcodecs.lzma.LZMA codec is now supported on Python 2.7 if backports.lzma is installed (John Kirkham; #11, #13).
  • The bundled c-blosc library has been upgraded to version 1.11.2 (#10, #18).
  • An option has been added to the numcodecs.blosc.Blosc codec to allow the block size to be manually configured (#9, #19).
  • The representation string for the numcodecs.blosc.Blosc codec has been tweaked to help with understanding the shuffle option (#4, #19).
  • Options have been added to manually control how the C extensions are built regardless of the architecture of the system on which the build is run. To disable support for AVX2 set the environment variable “DISABLE_NUMCODECS_AVX2”. To disable support for SSE2 set the environment variable “DISABLE_NUMCODECS_SSE2”. To disable C extensions altogether set the environment variable “DISABLE_NUMCODECS_CEXT” (#24, #26).

Maintenance work:

  • CI tests now run under Python 3.6 as well as 2.7, 3.4, 3.5 (#16, #17).
  • Test coverage is now monitored via coveralls (#15, #20).

0.0.1

Fixed project description in setup.py.

0.0.0

First release. This version is a port of the codecs module from Zarr 2.1.0. The following changes have been made from the original Zarr module:

  • Codec classes have been re-organized into separate modules, mostly one per codec class, for ease of maintenance.
  • Two new codec classes have been added based on 32-bit checksums: numcodecs.checksum32.CRC32 and numcodecs.checksum32.Adler32.
  • The Blosc extension has been refactored to remove code duplications related to handling of buffer compatibility.

Acknowledgments

Numcodecs bundles the c-blosc library.

Development of this package is supported by the MRC Centre for Genomics and Global Health.

Indices and tables