pqarrow/arrowutils: Add SortRecord and ReorderRecord #628

metalmatze · 2023-12-14T11:06:05Z

This is extract from a previous PR #461.

asubiotto

This is a good start. I think what is missing is:

Order by a set of columns (e.g. change the SortRecord parameter from col int to cols int)
Support ordering directions (ascending vs descending)
Support NULLs.

metalmatze · 2023-12-14T11:19:51Z

I 100% agree with your list.
Would you like us to add those before merging this PR?
Even with this small feature set some use cases benefit from this already.

For now, I'll make the parameter a cols []int for the future.

This isn't implemented yet, just the function signature is future proof.

pqarrow/arrowutils/sort.go

gernest · 2023-12-14T12:24:32Z

pqarrow/arrowutils/sort.go

+)
+
+// SortRecord sorts the given record's rows by the given column. Currently only supports int64, string and binary columns.
+func SortRecord(r arrow.Record, cols []int) ([]int, error) {


Can we also add direction []int which will behave similar bytes.Compare does

-1 for ascending

0 no direction

1 for descending

This will simply sorting , in the less function you can return bytes.Compare(...)==direction

gernest · 2023-12-14T16:20:35Z

pqarrow/arrowutils/sort.go

+// ReorderRecord reorders the given record's rows by the given indices.
+// This is a wrapper around compute.Take which handles the type castings.
+func ReorderRecord(r arrow.Record, indices arrow.Array) (arrow.Record, error) {
+	res, err := compute.Take(


Awesome, I also wanted to note, this will work majority of the time. We support dictionaries and there is no dictionary kernel for Take yet.

We need to optimise

If record has no dictionary column. Use compute.Take on hte record

If there is a dictionary column. Use compute.Take on individual non dictionary columns and fall back to manual taking for the dictionary and then assembling the record (preferably concurrently).

This can be done on a separate PR. Unless I'm mistaken, the PR just adds these functions but there is no expected call site for them in yet( There will be a need for a lot of changes for this to be used in my logictest PR).

I'm on mobile now , so I cant write what is needed for the logictest to use these functions.

Oh, that's nasty and a really good comment!
We should make sure this is supported in follow-up PRs!

NULL always gets sorted to the back. This seems to be the default for other language implementations. It can be made configurable in the future.

metalmatze · 2023-12-14T17:59:58Z

I've updated to additionally add support for sorting NULL. It always gets sorted to the back.

gernest

Looks good, I left a few suggestions.

Thinking about it, do we really need to handle *array.Binary? we are not consistent with how we handle string schema column. I see a lot of variations ,some cases *array.String some cases *array.Binary.

I was wondering maybe we can consolidate this, have *array.String for string column and when we add binary columns then they can be *array.Binary.

This is enough foundation. I will do folloup patches to cover

Multi column sort
Support direction (ascending and descending)
NullFirst (for ascending and descending)
Pooling/ Reusing the sorting objects ( indices and the indices array builder should be reused)

pqarrow/arrowutils/sort.go

Co-authored-by: Geofrey Ernest <[email protected]>

metalmatze · 2023-12-15T12:10:42Z

You're correct with the *array.Binary column. It's not covered in the unit tests (yet).
This was added for an internal experiment. I'm fine removing it from the sort code in this PR for now.

This isn't properly unit tested and was more of an experiment.

pqarrow/arrowutils/sort.go

asubiotto

LGTM, we can follow up with the remaining work

metalmatze · 2023-12-15T15:52:12Z

Thanks for all the valuable comments and feedback!

pqarrow/arrowutils: Add SortRecord and ReorderRecord

672bc29

This is extract from a previous PR #461.

asubiotto reviewed Dec 14, 2023

View reviewed changes

pqarrow/arrowutils: Update SortRecord to allow for multiple sort columns

e61df89

This isn't implemented yet, just the function signature is future proof.

gernest reviewed Dec 14, 2023

View reviewed changes

pqarrow/arrowutils/sort.go Outdated Show resolved Hide resolved

gernest reviewed Dec 14, 2023

View reviewed changes

gernest mentioned this pull request Dec 14, 2023

Write directly to arrow.Record in logictest #629

Closed

pqarrow/arrowutils: Use compute.Take for ReorderRecord

382716b

gernest reviewed Dec 14, 2023

View reviewed changes

pqarrow/arrowutils: Add support for sorting NULL

1c6f0f7

NULL always gets sorted to the back. This seems to be the default for other language implementations. It can be made configurable in the future.

gernest suggested changes Dec 14, 2023

View reviewed changes

metalmatze and others added 6 commits December 15, 2023 13:03

Update pqarrow/arrowutils/sort.go

50fbae2

Co-authored-by: Geofrey Ernest <[email protected]>

Update pqarrow/arrowutils/sort.go

016693d

Co-authored-by: Geofrey Ernest <[email protected]>

Update pqarrow/arrowutils/sort.go

4388a71

Co-authored-by: Geofrey Ernest <[email protected]>

Update pqarrow/arrowutils/sort.go

9330d14

Co-authored-by: Geofrey Ernest <[email protected]>

Update pqarrow/arrowutils/sort.go

a8d72be

Co-authored-by: Geofrey Ernest <[email protected]>

Merge branch 'main' into arrowutils-sort

45656d8

pqarrow/arrowutils: Remove sorting *array.Binary

e715049

This isn't properly unit tested and was more of an experiment.

asubiotto reviewed Dec 15, 2023

View reviewed changes

pqarrow/arrowutils/sort.go Show resolved Hide resolved

pqarrow/arrowutils/sort.go Outdated Show resolved Hide resolved

pqarrow/arrowutils: Add context and reserve indices length

ae01333

asubiotto approved these changes Dec 15, 2023

View reviewed changes

gernest approved these changes Dec 15, 2023

View reviewed changes

metalmatze merged commit ee6970e into main Dec 15, 2023
4 checks passed

metalmatze deleted the arrowutils-sort branch December 15, 2023 15:51

gernest mentioned this pull request Dec 16, 2023

Upgrade arrowutils.SortRecord #635

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pqarrow/arrowutils: Add SortRecord and ReorderRecord #628

pqarrow/arrowutils: Add SortRecord and ReorderRecord #628

metalmatze commented Dec 14, 2023

asubiotto left a comment

metalmatze commented Dec 14, 2023

gernest Dec 14, 2023

gernest Dec 14, 2023

metalmatze Dec 14, 2023

metalmatze commented Dec 14, 2023

gernest left a comment

metalmatze commented Dec 15, 2023

asubiotto left a comment

metalmatze commented Dec 15, 2023

pqarrow/arrowutils: Add SortRecord and ReorderRecord #628

pqarrow/arrowutils: Add SortRecord and ReorderRecord #628

Conversation

metalmatze commented Dec 14, 2023

asubiotto left a comment

Choose a reason for hiding this comment

metalmatze commented Dec 14, 2023

gernest Dec 14, 2023

Choose a reason for hiding this comment

gernest Dec 14, 2023

Choose a reason for hiding this comment

metalmatze Dec 14, 2023

Choose a reason for hiding this comment

metalmatze commented Dec 14, 2023

gernest left a comment

Choose a reason for hiding this comment

metalmatze commented Dec 15, 2023

asubiotto left a comment

Choose a reason for hiding this comment

metalmatze commented Dec 15, 2023