-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pqarrow/arrowutils: Add SortRecord and ReorderRecord #628
Conversation
This is extract from a previous PR #461.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good start. I think what is missing is:
- Order by a set of columns (e.g. change the
SortRecord
parameter fromcol int
tocols int
) - Support ordering directions (ascending vs descending)
- Support NULLs.
I 100% agree with your list. For now, I'll make the parameter a |
This isn't implemented yet, just the function signature is future proof.
pqarrow/arrowutils/sort.go
Outdated
) | ||
|
||
// SortRecord sorts the given record's rows by the given column. Currently only supports int64, string and binary columns. | ||
func SortRecord(r arrow.Record, cols []int) ([]int, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also add direction []int
which will behave similar bytes.Compare
does
-1
for ascending0
no direction1
for descending
This will simply sorting , in the less function you can return bytes.Compare(...)==direction
// ReorderRecord reorders the given record's rows by the given indices. | ||
// This is a wrapper around compute.Take which handles the type castings. | ||
func ReorderRecord(r arrow.Record, indices arrow.Array) (arrow.Record, error) { | ||
res, err := compute.Take( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, I also wanted to note, this will work majority of the time. We support dictionaries and there is no dictionary kernel for Take
yet.
We need to optimise
- If record has no dictionary column. Use
compute.Take
on hte record - If there is a dictionary column. Use
compute.Take
on individual non dictionary columns and fall back to manual taking for the dictionary and then assembling the record (preferably concurrently).
This can be done on a separate PR. Unless I'm mistaken, the PR just adds these functions but there is no expected call site for them in yet( There will be a need for a lot of changes for this to be used in my logictest PR).
I'm on mobile now , so I cant write what is needed for the logictest to use these functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's nasty and a really good comment!
We should make sure this is supported in follow-up PRs!
NULL always gets sorted to the back. This seems to be the default for other language implementations. It can be made configurable in the future.
I've updated to additionally add support for sorting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I left a few suggestions.
Thinking about it, do we really need to handle *array.Binary
? we are not consistent with how we handle string
schema column. I see a lot of variations ,some cases *array.String
some cases *array.Binary
.
I was wondering maybe we can consolidate this, have *array.String
for string
column and when we add binary
columns then they can be *array.Binary
.
This is enough foundation. I will do folloup patches to cover
- Multi column sort
- Support direction (ascending and descending)
- NullFirst (for ascending and descending)
- Pooling/ Reusing the sorting objects ( indices and the indices array builder should be reused)
Co-authored-by: Geofrey Ernest <[email protected]>
Co-authored-by: Geofrey Ernest <[email protected]>
Co-authored-by: Geofrey Ernest <[email protected]>
Co-authored-by: Geofrey Ernest <[email protected]>
Co-authored-by: Geofrey Ernest <[email protected]>
You're correct with the |
This isn't properly unit tested and was more of an experiment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, we can follow up with the remaining work
Thanks for all the valuable comments and feedback! |
This is extract from a previous PR #461.