Skip to content

Latest commit

 

History

History
244 lines (192 loc) · 7.21 KB

README.md

File metadata and controls

244 lines (192 loc) · 7.21 KB

Ragged

Actions Status PyPI version PyPI platforms GitHub Discussion

Introduction

Ragged is a library for manipulating ragged arrays as though they were NumPy or CuPy arrays, following the Array API specification.

For example, this is a ragged/jagged array:

>>> import ragged
>>> a = ragged.array([[[1.1, 2.2, 3.3], []], [[4.4]], [], [[5.5, 6.6, 7.7, 8.8], [9.9]]])
>>> a
ragged.array([
    [[1.1, 2.2, 3.3], []],
    [[4.4]],
    [],
    [[5.5, 6.6, 7.7, 8.8], [9.9]]
])

The values are all floating-point numbers, so a.dtype is float64,

>>> a.dtype
dtype('float64')

but a.shape has non-integer dimensions to account for the fact that some of its list lengths are non-uniform:

>>> a.shape
(4, None, None)

In general, a ragged.array can have any mixture of regular and irregular dimensions, though shape[0] (the length) is always an integer. This convention follows the Array API's specification for array.shape, which must be a tuple of int or None:

array.shape: Tuple[Optional[int], ...]

(Our use of None to indicate a dimension without a single-valued size differs from the Array API's intention of specifying dimensions of unknown size, but it follows the technical specification. Array API-consuming libraries can try using Ragged to find out if they are ragged-ready.)

All of the normal elementwise and reducing functions apply, as well as slices:

>>> ragged.sqrt(a)
ragged.array([
    [[1.05, 1.48, 1.82], []],
    [[2.1]],
    [],
    [[2.35, 2.57, 2.77, 2.97], [3.15]]
])

>>> ragged.sum(a, axis=0)
ragged.array([
    [11, 8.8, 11, 8.8],
    [9.9]
])

>>> ragged.sum(a, axis=-1)
ragged.array([
    [6.6, 0],
    [4.4],
    [],
    [28.6, 9.9]
])

>>> a[-1, 0, 2]
ragged.array(7.7)

>>> a[a * 10 % 2 == 0]
ragged.array([
    [[2.2], []],
    [[4.4]],
    [],
    [[6.6, 8.8], []]
])

All of the methods, attributes, and functions in the Array API will be implemented for Ragged, as well as conveniences that are not required by the Array API. See open issues marked "todo" for Array API functions that still need to be written (out of 120 in total).

Ragged has two device values, "cpu" (backed by NumPy) and "cuda" (backed by CuPy). Eventually, all operations will be identical for CPU and GPU.

Implementation

Ragged is implemented using Awkward Array (code, docs), which is an array library for arbitrary tree-like (JSON-like) data. Because of its generality, Awkward Array cannot follow the Array API—in fact, its array objects can't have separate dtype and shape attributes (the array type can't be factorized). Ragged is therefore

  • a specialization of Awkward Array for numeric data in fixed-length and variable-length lists, and
  • a formalization to adhere to the Array API and its fully typed protocols.

See Why does this library exist? under the Discussions tab for more details.

Ragged is a thin wrapper around Awkward Array, restricting it to ragged arrays and transforming its function arguments and return values to fit the specification.

Awkward Array, in turn, is time- and memory-efficient, ready for big datasets. Consider the following:

import gc      # control for garbage collection
import psutil  # measure process memory
import time    # measure time

import math
import ragged

this_process = psutil.Process()

def measure_memory(task):
    gc.collect()
    start_memory = this_process.memory_full_info().uss
    out = task()
    gc.collect()
    stop_memory = this_process.memory_full_info().uss
    print(f"memory: {(stop_memory - start_memory) * 1e-9:.3f} GB")
    return out

def measure_time(task):
    gc.disable()
    start_time = time.perf_counter()
    out = task()
    stop_time = time.perf_counter()
    gc.enable()
    print(f"time: {stop_time - start_time:.3f} sec")
    return out

def make_big_python_object():
    out = []
    for i in range(10000000):
        out.append([j * 1.1 for j in range(i % 10)])
    return out

def make_ragged_array():
    return ragged.array(pyobj)

def compute_on_python_object():
    out = []
    for row in pyobj:
        out.append([math.sqrt(x) for x in row])
    return out

def compute_on_ragged_array():
    return ragged.sqrt(arr)

The ragged.array is 3 times smaller:

>>> pyobj = measure_memory(make_big_python_object)
memory: 2.687 GB

>>> arr = measure_memory(make_ragged_array)
memory: 0.877 GB

and a sample calculation on it (square root of each value) is 50 times faster:

>>> result = measure_time(compute_on_python_object)
time: 4.180 sec

>>> result = measure_time(compute_on_ragged_array)
time: 0.082 sec

Awkward Array and Ragged are generally smaller and faster than their Python equivalents for the same reasons that NumPy is smaller and faster than Python lists. See Awkward Array papers and presentations for more.

Installation

Ragged is on PyPI:

pip install ragged

and will someday be on conda-forge.

ragged is a pure-Python library that only depends on awkward (which, in turn, only depends on numpy and a compiled extension). In principle (i.e. eventually), ragged can be loaded into Pyodide and JupyterLite.

Acknowledgements

Support for this work was provided by NSF grant OAC-2103945 and the gracious help of Awkward Array contributors.