Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pqarrow/arrowutils: Add SortRecord and ReorderRecord #628

Merged
merged 12 commits into from
Dec 15, 2023
Merged

Conversation

metalmatze
Copy link
Member

This is extract from a previous PR #461.

This is extract from a previous PR #461.
Copy link
Member

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good start. I think what is missing is:

  1. Order by a set of columns (e.g. change the SortRecord parameter from col int to cols int)
  2. Support ordering directions (ascending vs descending)
  3. Support NULLs.

@metalmatze
Copy link
Member Author

I 100% agree with your list.
Would you like us to add those before merging this PR?
Even with this small feature set some use cases benefit from this already.

For now, I'll make the parameter a cols []int for the future.

This isn't implemented yet, just the function signature is future proof.
pqarrow/arrowutils/sort.go Outdated Show resolved Hide resolved
)

// SortRecord sorts the given record's rows by the given column. Currently only supports int64, string and binary columns.
func SortRecord(r arrow.Record, cols []int) ([]int, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add direction []int which will behave similar bytes.Compare does

  • -1 for ascending
  • 0 no direction
  • 1 for descending

This will simply sorting , in the less function you can return bytes.Compare(...)==direction

// ReorderRecord reorders the given record's rows by the given indices.
// This is a wrapper around compute.Take which handles the type castings.
func ReorderRecord(r arrow.Record, indices arrow.Array) (arrow.Record, error) {
res, err := compute.Take(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, I also wanted to note, this will work majority of the time. We support dictionaries and there is no dictionary kernel for Take yet.

We need to optimise

  • If record has no dictionary column. Use compute.Take on hte record
  • If there is a dictionary column. Use compute.Take on individual non dictionary columns and fall back to manual taking for the dictionary and then assembling the record (preferably concurrently).

This can be done on a separate PR. Unless I'm mistaken, the PR just adds these functions but there is no expected call site for them in yet( There will be a need for a lot of changes for this to be used in my logictest PR).

I'm on mobile now , so I cant write what is needed for the logictest to use these functions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's nasty and a really good comment!
We should make sure this is supported in follow-up PRs!

NULL always gets sorted to the back. This seems to be the default for other language implementations. It can be made configurable in the future.
@metalmatze
Copy link
Member Author

I've updated to additionally add support for sorting NULL. It always gets sorted to the back.

Copy link
Contributor

@gernest gernest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I left a few suggestions.

Thinking about it, do we really need to handle *array.Binary? we are not consistent with how we handle string schema column. I see a lot of variations ,some cases *array.String some cases *array.Binary.

I was wondering maybe we can consolidate this, have *array.String for string column and when we add binary columns then they can be *array.Binary.

This is enough foundation. I will do folloup patches to cover

  • Multi column sort
  • Support direction (ascending and descending)
  • NullFirst (for ascending and descending)
  • Pooling/ Reusing the sorting objects ( indices and the indices array builder should be reused)

pqarrow/arrowutils/sort.go Outdated Show resolved Hide resolved
pqarrow/arrowutils/sort.go Outdated Show resolved Hide resolved
pqarrow/arrowutils/sort.go Outdated Show resolved Hide resolved
pqarrow/arrowutils/sort.go Outdated Show resolved Hide resolved
pqarrow/arrowutils/sort.go Outdated Show resolved Hide resolved
@metalmatze
Copy link
Member Author

You're correct with the *array.Binary column. It's not covered in the unit tests (yet).
This was added for an internal experiment. I'm fine removing it from the sort code in this PR for now.

This isn't properly unit tested and was more of an experiment.
pqarrow/arrowutils/sort.go Show resolved Hide resolved
pqarrow/arrowutils/sort.go Outdated Show resolved Hide resolved
Copy link
Member

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, we can follow up with the remaining work

@metalmatze metalmatze merged commit ee6970e into main Dec 15, 2023
4 checks passed
@metalmatze metalmatze deleted the arrowutils-sort branch December 15, 2023 15:51
@metalmatze
Copy link
Member Author

Thanks for all the valuable comments and feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants