Skip to content

Commit

Permalink
Create EmptyArray with unknown type. (#21)
Browse files Browse the repository at this point in the history
* Start work.

* EmptyArray compiles.

* Fixed linker errors.

* Make Windows compiler happy.

* Add tests file.

* Added RegularType to give all arrays a high-level type.

* Added awkward1.typeof to test EmptyArray type.

* EmptyArray::getitem works.

* EmptyArray::getitem works in Numba; done with EmptyArray.
  • Loading branch information
jpivarski authored Nov 12, 2019

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
1 parent d523c31 commit e909889
Showing 26 changed files with 803 additions and 23 deletions.
27 changes: 17 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -60,31 +60,37 @@ Completed items are ☑check-marked. See [closed PRs](https://github.com/scikit-
* [X] Fully implement `__getitem__` for int/slice/intarray/boolarray/tuple (placeholders for newaxis/ellipsis), with perfect agreement with [Numpy basic/advanced indexing](https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html), to all levels of depth.
* [ ] Appendable arrays (a distinct phase from readable arrays, when the type is still in flux) to implement `awkward.fromiter` in C++.
* [X] Implemented all types but records; tested all primitives and lists.
* [ ] Expose appendable arrays to Numba.
* [ ] Implement appendable records.
* [ ] Test all (requires array types for all).
* [X] JSON → Awkward via header-only [RapidJSON](https://rapidjson.org) and `awkward.fromiter`.
* [ ] Explicit broadcasting functions for jagged and non-jagged arrays and scalars.
* [ ] Structure-preserving ufunc-like operation on the C++ side that applies a lambda function to inner data. The Python `__array_ufunc__` implementation will _call_ this to preserve structure.
* [ ] Extend `__getitem__` to take jagged arrays of integers and booleans (same behavior as old).
* [ ] Full suite of array types:
* [ ] `EmptyArray`: 1-dimensional array with length 0 and unknown type (result of `UnknownFillable`, compatible with all types of arrays).
* [X] `EmptyArray`: 1-dimensional array with length 0 and unknown type (result of `UnknownFillable`, compatible with all types of arrays).
* [X] `RawArray`: flat, 1-dimensional array type for pure C++ (header-only).
* [X] `NumpyArray`: rectilinear, N-dimensional array type without Python/pybind11 dependencies, but intended for Numpy.
* [X] `ListArray`: the new `JaggedArray`, based on `starts` and `stops` (i.e. fully general).
* [X] `ListOffsetArray`: the `JaggedArray` case with no unreachable data between reachable data (gaps).
* [ ] `RecordArray`: the new `Table` _without_ lazy-slicing.
* [ ] `RegularArray`: rectilinear, N-dimensional array of arbitrary contents, for putting jagged dimensions inside fixed dimensions (for example).
* [ ] `ChunkedArray`: same as the old version, except that the type is a union if chunks conflict, not an error, and knowledge of all chunk sizes is always required. (Maybe `AmorphousChunkedArray` would fill that role.)
* [ ] `RegularChunkedArray`: like a `ChunkedArray`, but all chunks are known to have the same size.
* [ ] `RecordArray`: the new `Table` _without_ lazy-slicing.
* [ ] `MaskedArray`, `BitMaskedArray`, `IndexedMaskedArray`: same as the old versions.
* [ ] `UnionArray`: same as the old version; `SparseUnionArray`: the additional case found in Apache Arrow.
* [ ] `SlicedArray`: lazy-slicing (from old `Table`) that can be applied to any type.
* [ ] `IndexedArray`: same as the old version.
* [ ] `RedirectArray`: an explicit weak-reference to another part of the structure (no hard-linked cycles). Often used with an `IndexedArray`.
* [ ] `SlicedArray`: lazy-slicing (from old `Table`) that can be applied to any type.
* [ ] `SparseArray`: same as the old version.
* [ ] `ChunkedArray`: same as the old version, except that the type is a union if chunks conflict, not an error, and knowledge of all chunk sizes is always required. (Maybe `AmorphousChunkedArray` would fill that role.)
* [ ] `RegularChunkedArray`: like a `ChunkedArray`, but all chunks are known to have the same size.
* [ ] `VirtualArray`: same as the old version, including caching, but taking C++11 lambda functions for materialization, get-cache, and put-cache. The pybind11 layer will connect this to Python callables.
* [ ] `ObjectArray`: same as the old version, but taking a C++11 lambda function to produce its output. The pybind11 layer will connect this to Python callables.
* [ ] Describe high-level types using [datashape](https://datashape.readthedocs.io/en/latest/) and possibly also an in-house schema. (Emit datashape _strings_ from C++.)
* [ ] Derived classes with ufunc-defined `Methods` and Numba extensions:
* [ ] `StringArray`: a `ListArray`/`ListOffsetArray` of characters with special methods and an optional encoding.
* [ ] `PyVirtualArray`: takes a Python lambda (which gets carried into `VirtualArray`).
* [ ] `PyObjectArray`: same as the old version.
* [X] Describe high-level types using [datashape](https://datashape.readthedocs.io/en/latest/) and possibly also an in-house schema. (Emit datashape _strings_ from C++.)
* [ ] Describe mid-level "persistence types" with no lengths, somewhat minimal JSON, optional dtypes/compression.
* [ ] Describe low-level layouts independently of filled arrays?
* [ ] Describe low-level layouts independently of filled arrays (JSON or something)?
* [ ] Layer 1 interface `Array`:
* [ ] Pass through to the layout classes in Python and Numba.
* [ ] Pass through Numpy ufuncs using [NEP 13](https://www.numpy.org/neps/nep-0013-ufunc-overrides.html) (as before).
@@ -94,7 +100,9 @@ Completed items are ☑check-marked. See [closed PRs](https://github.com/scikit-
* [ ] Mechanism for adding user-defined `Methods` like `LorentzVector`, as before, but only on Layer 1.
* [ ] Inerhit from Pandas so that all Layer 1 arrays can be DataFrame columns.
* [ ] Full suite of operations:
* [X] `awkward.tolist`: invokes iterators to convert arrays to lists and dicts.
* [X] `awkward.tolist`: same as before.
* [X] `awkward.fromiter`: same as before.
* [X] `awkward.typeof`: reports the high-level type (accepting some non-awkward objects).
* [ ] `awkward.tonumpy`: to force conversion to Numpy, if possible. Neither Layer 1 nor Layer 2 will have an `__array__` method; in the Numpy sense, they are not "array-like" or "array-compatible."
* [ ] `awkward.topandas`: flattening jaggedness into `MultiIndex` rows and nested records into `MultiIndex` columns. This is distinct from the arrays' inheritance from Pandas, distinct from the natural ability to use any one of them as DataFrame columns.
* [ ] `awkward.flatten`: same as old with an `axis` parameter.
@@ -110,7 +118,6 @@ Completed items are ☑check-marked. See [closed PRs](https://github.com/scikit-
* [ ] `awkward.choose` (and `awkward.argchoose`): to make combinations by choosing a fixed number from a single array; option to use `Identity` index and an option to include same-object combinations.
* [ ] `awkward.join`: performs an inner join of multiple arrays; requires `Identity`. Because the `Identity` is a surrogate index, this is effectively a per-event intersection, zipping all fields.
* [ ] `awkward.union`: performs an outer join of multiple arrays; requires `Identity`. Because the `Identity` is a surrogate index, this is effectively a per-event union, zipping fields where possible.
* [ ] Derived classes section with `StringArray` as its first member. Derived classes have ufunc-defined `Methods` and Numba extensions.

### Soon after (possibly within) the six-month timeframe

2 changes: 1 addition & 1 deletion VERSION_INFO
Original file line number Diff line number Diff line change
@@ -1 +1 @@
0.1.20
0.1.21
1 change: 1 addition & 0 deletions awkward1/__init__.py
Original file line number Diff line number Diff line change
@@ -4,5 +4,6 @@
import awkward1._numba

from awkward1.operations.convert import *
from awkward1.operations.describe import *

__version__ = awkward1.layout.__version__
1 change: 1 addition & 0 deletions awkward1/_numba/__init__.py
Original file line number Diff line number Diff line change
@@ -14,3 +14,4 @@
import awkward1._numba.array.numpyarray
import awkward1._numba.array.listarray
import awkward1._numba.array.listoffsetarray
import awkward1._numba.array.empty
143 changes: 143 additions & 0 deletions awkward1/_numba/array/empty.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
# BSD 3-Clause License; see https://github.com/jpivarski/awkward-1.0/blob/master/LICENSE

import operator

import numpy
import numba

import awkward1.layout
from ..._numba import cpu, util, content

@numba.extending.typeof_impl.register(awkward1.layout.EmptyArray)
def typeof(val, c):
return EmptyArrayType(numba.typeof(val.id))

class EmptyArrayType(content.ContentType):
def __init__(self, idtpe):
super(EmptyArrayType, self).__init__(name="EmptyArrayType(id={})".format(idtpe.name))
self.idtpe = idtpe

@property
def ndim(self):
return 1

def getitem_int(self):
raise ValueError("cannot compile getitem for EmptyArray, which has unknown element type")

def getitem_range(self):
return self

def getitem_tuple(self, wheretpe):
if len(wheretpe.types) == 0:
return self
elif len(wheretpe.types) == 1 and isinstance(wheretpe.types[0], numba.types.SliceType):
return self
else:
raise ValueError("cannot compile getitem for EmptyArray, which has unknown element type")

def getitem_next(self, wheretpe, isadvanced):
if len(wheretpe.types) == 0:
return self
else:
raise ValueError("cannot compile getitem for EmptyArray, which has unknown element type")

def carry(self):
return self

@property
def lower_len(self):
return lower_len

@property
def lower_getitem_range(self):
return lower_getitem_range

@property
def lower_getitem_next(self):
return lower_getitem_next

@property
def lower_carry(self):
return lower_carry

@numba.extending.register_model(EmptyArrayType)
class EmptyArrayModel(numba.datamodel.models.StructModel):
def __init__(self, dmm, fe_type):
members = []
if fe_type.idtpe != numba.none:
members.append(("id", fe_type.idtpe))
super(EmptyArrayModel, self).__init__(dmm, fe_type, members)

@numba.extending.unbox(EmptyArrayType)
def unbox(tpe, obj, c):
proxyout = numba.cgutils.create_struct_proxy(tpe)(c.context, c.builder)
if tpe.idtpe != numba.none:
id_obj = c.pyapi.obj_getattr_string(obj, "id")
proxyout.id = c.pyapi.to_native_value(tpe.idtpe, id_obj).value
c.pyapi.decref(id_obj)
is_error = numba.cgutils.is_not_null(c.builder, c.pyapi.err_occurred())
return numba.extending.NativeValue(proxyout._getvalue(), is_error)

@numba.extending.box(EmptyArrayType)
def box(tpe, val, c):
EmptyArray_obj = c.pyapi.unserialize(c.pyapi.serialize_object(awkward1.layout.EmptyArray))
proxyin = numba.cgutils.create_struct_proxy(tpe)(c.context, c.builder, value=val)
if tpe.idtpe != numba.none:
id_obj = c.pyapi.from_native_value(tpe.idtpe, proxyin.id, c.env_manager)
out = c.pyapi.call_function_objargs(EmptyArray_obj, (id_obj,))
c.pyapi.decref(id_obj)
else:
out = c.pyapi.call_function_objargs(EmptyArray_obj, ())
c.pyapi.decref(EmptyArray_obj)
return out

@numba.extending.lower_builtin(len, EmptyArrayType)
def lower_len(context, builder, sig, args):
return context.get_constant(numba.intp, 0)

@numba.extending.lower_builtin(operator.getitem, EmptyArrayType, numba.types.slice2_type)
def lower_getitem_range(context, builder, sig, args):
rettpe, (tpe, wheretpe) = sig.return_type, sig.args
val, whereval = args
if context.enable_nrt:
context.nrt.incref(builder, rettpe, val)
return val

@numba.extending.lower_builtin(operator.getitem, EmptyArrayType, numba.types.BaseTuple)
def lower_getitem_tuple(context, builder, sig, args):
rettpe, (tpe, wheretpe) = sig.return_type, sig.args
val, whereval = args
if context.enable_nrt:
context.nrt.incref(builder, rettpe, val)
return val

def lower_getitem_next(context, builder, arraytpe, wheretpe, arrayval, whereval, advanced):
if context.enable_nrt:
context.nrt.incref(builder, arraytpe, arrayval)
return arrayval

def lower_carry(context, builder, arraytpe, carrytpe, arrayval, carryval):
if context.enable_nrt:
context.nrt.incref(builder, arraytpe, arrayval)
return arrayval

@numba.typing.templates.infer_getattr
class type_methods(numba.typing.templates.AttributeTemplate):
key = EmptyArrayType

def generic_resolve(self, tpe, attr):
if attr == "id":
if tpe.idtpe == numba.none:
return numba.optional(identity.IdentityType(numba.int32[:, :]))
else:
return tpe.idtpe

@numba.extending.lower_getattr(EmptyArrayType, "id")
def lower_id(context, builder, tpe, val):
proxyin = numba.cgutils.create_struct_proxy(tpe)(context, builder, value=val)
if tpe.idtpe == numba.none:
return context.make_optional_none(builder, identity.IdentityType(numba.int32[:, :]))
else:
if context.enable_nrt:
context.nrt.incref(builder, tpe.idtpe, proxyin.id)
return proxyin.id
56 changes: 56 additions & 0 deletions awkward1/operations/describe.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# BSD 3-Clause License; see https://github.com/jpivarski/awkward-1.0/blob/master/LICENSE

import numbers

import numpy

import awkward1.layout

def typeof(array):
if array is None:
return awkward1.layout.UnknownType()

elif isinstance(array, (bool, numpy.bool, numpy.bool_)):
return awkward1.layout.PrimitiveType("bool")

elif isinstance(array, numbers.Integral):
return awkward1.layout.PrimitiveType("int64")

elif isinstance(array, numbers.Real):
return awkward1.layout.PrimitiveType("float64")

elif isinstance(array, (numpy.int8, numpy.int16, numpy.int32, numpy.int64, numpy.uint8, numpy.uint16, numpy.uint32, numpy.uint64, numpy.float32, numpy.float64)):
return awkward1.layout.PrimitiveType(typeof.dtype2primitive[array.dtype.type])

elif isinstance(array, numpy.generic):
raise ValueError("cannot describe {0} as a PrimitiveType".format(type(array)))

elif isinstance(array, numpy.ndarray):
if len(array.shape) == 0:
return typeof(array.reshape((1,))[0])
elif len(array.shape) == 1:
return awkward1.layout.ArrayType(array.shape[0], awkward1.layout.PrimitiveType(typeof.dtype2primitive[array.dtype.type]))
else:
return awkward1.layout.ArrayType(array.shape[0], awkward1.layout.RegularType(array.shape[1:], awkward1.layout.PrimitiveType(typeof.dtype2primitive[array.dtype.type])))

elif isinstance(array, awkward1.layout.FillableArray):
return array.type

elif isinstance(array, awkward1.layout.Content):
return array.type

else:
raise TypeError("unrecognized array type: {0}".format(repr(array)))

typeof.dtype2primitive = {
numpy.int8: "int8",
numpy.int16: "int16",
numpy.int32: "int32",
numpy.int64: "int64",
numpy.uint8: "uint8",
numpy.uint16: "uint16",
numpy.uint32: "uint32",
numpy.uint64: "uint64",
numpy.float32: "float32",
numpy.float64: "float64",
}
3 changes: 3 additions & 0 deletions include/awkward/Content.h
Original file line number Diff line number Diff line change
@@ -9,6 +9,7 @@
#include "awkward/Identity.h"
#include "awkward/Slice.h"
#include "awkward/io/json.h"
#include "awkward/type/ArrayType.h"

namespace awkward {
class Content {
@@ -21,6 +22,7 @@ namespace awkward {
virtual void setid(const std::shared_ptr<Identity> id) = 0;
virtual const std::string tostring_part(const std::string indent, const std::string pre, const std::string post) const = 0;
virtual void tojson_part(ToJson& builder) const = 0;
virtual std::shared_ptr<Type> type_part() const = 0;
virtual int64_t length() const = 0;
virtual const std::shared_ptr<Content> shallow_copy() const = 0;
virtual void checksafe() const = 0;
@@ -33,6 +35,7 @@ namespace awkward {
virtual const std::shared_ptr<Content> carry(const Index64& carry) const = 0;
virtual const std::pair<int64_t, int64_t> minmax_depth() const = 0;

const ArrayType type() const;
const std::string tostring() const;
const std::string tojson(bool pretty, int64_t maxdecimals) const;
void tojson(FILE* destination, bool pretty, int64_t maxdecimals, int64_t buffersize) const;
43 changes: 43 additions & 0 deletions include/awkward/array/EmptyArray.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
// BSD 3-Clause License; see https://github.com/jpivarski/awkward-1.0/blob/master/LICENSE

#ifndef AWKWARD_EMPTYARRAY_H_
#define AWKWARD_EMPTYARRAY_H_

#include <cassert>
#include <string>
#include <memory>
#include <vector>

#include "awkward/cpu-kernels/util.h"
#include "awkward/Slice.h"
#include "awkward/Content.h"

namespace awkward {
class EmptyArray: public Content {
public:
EmptyArray(const std::shared_ptr<Identity> id): id_(id) { }

virtual const std::string classname() const;
virtual const std::shared_ptr<Identity> id() const { return id_; }
virtual void setid();
virtual void setid(const std::shared_ptr<Identity> id);
virtual const std::string tostring_part(const std::string indent, const std::string pre, const std::string post) const;
virtual void tojson_part(ToJson& builder) const;
virtual std::shared_ptr<Type> type_part() const;
virtual int64_t length() const;
virtual const std::shared_ptr<Content> shallow_copy() const;
virtual void checksafe() const;
virtual const std::shared_ptr<Content> getitem_at(int64_t at) const;
virtual const std::shared_ptr<Content> getitem_at_unsafe(int64_t at) const;
virtual const std::shared_ptr<Content> getitem_range(int64_t start, int64_t stop) const;
virtual const std::shared_ptr<Content> getitem_range_unsafe(int64_t start, int64_t stop) const;
virtual const std::shared_ptr<Content> getitem_next(const std::shared_ptr<SliceItem> head, const Slice& tail, const Index64& advanced) const;
virtual const std::shared_ptr<Content> carry(const Index64& carry) const;
virtual const std::pair<int64_t, int64_t> minmax_depth() const;

private:
std::shared_ptr<Identity> id_;
};
}

#endif // AWKWARD_EMPTYARRAY_H_
1 change: 1 addition & 0 deletions include/awkward/array/ListArray.h
Original file line number Diff line number Diff line change
@@ -30,6 +30,7 @@ namespace awkward {
virtual void setid(const std::shared_ptr<Identity> id);
virtual const std::string tostring_part(const std::string indent, const std::string pre, const std::string post) const;
virtual void tojson_part(ToJson& builder) const;
virtual std::shared_ptr<Type> type_part() const;
virtual int64_t length() const;
virtual const std::shared_ptr<Content> shallow_copy() const;
virtual void checksafe() const;
1 change: 1 addition & 0 deletions include/awkward/array/ListOffsetArray.h
Original file line number Diff line number Diff line change
@@ -28,6 +28,7 @@ namespace awkward {
virtual void setid(const std::shared_ptr<Identity> id);
virtual const std::string tostring_part(const std::string indent, const std::string pre, const std::string post) const;
virtual void tojson_part(ToJson& builder) const;
virtual std::shared_ptr<Type> type_part() const;
virtual int64_t length() const;
virtual const std::shared_ptr<Content> shallow_copy() const;
virtual void checksafe() const;
1 change: 1 addition & 0 deletions include/awkward/array/NumpyArray.h
Original file line number Diff line number Diff line change
@@ -47,6 +47,7 @@ namespace awkward {
virtual void setid(const std::shared_ptr<Identity> id);
virtual const std::string tostring_part(const std::string indent, const std::string pre, const std::string post) const;
virtual void tojson_part(ToJson& builder) const;
virtual std::shared_ptr<Type> type_part() const;
virtual int64_t length() const;
virtual const std::shared_ptr<Content> shallow_copy() const;
virtual void checksafe() const;
Loading

0 comments on commit e909889

Please sign in to comment.