Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enh]: Spark Expr missing methods #1714

Open
14 of 52 tasks
FBruzzesi opened this issue Jan 3, 2025 · 6 comments
Open
14 of 52 tasks

[Enh]: Spark Expr missing methods #1714

FBruzzesi opened this issue Jan 3, 2025 · 6 comments
Labels
enhancement New feature or request good first issue Good for newcomers, but anyone is welcome to submit a pull request! help wanted Extra attention is needed

Comments

@FBruzzesi
Copy link
Member

FBruzzesi commented Jan 3, 2025

Methods with one asterisk (*) are row order dependent and should be deprioritized for now, until a decision for the lazy api is reached (see stable v2 discussion).
Methods with two asterisk (**) denote a namespace - namespace methods are not included, and they are all missing as of now.

High priority:

  • abs
  • all
  • any
  • arg_true
  • clip
  • drop_nulls
  • fill_null (* if strategy is prodived)
  • filter
  • is_between
  • is_duplicated
  • is_finite
  • is_in
  • is_nan
  • is_unique
  • len
  • map_batches
  • median
  • mode
  • n_unique
  • null_count
  • over
  • quantile
  • replace_strict
  • round
  • sample
  • skew
  • sort
  • unique

Deprioritized:

  • arg_max (*)
  • arg_min (*)
  • cum_count (*)
  • cum_max (*)
  • cum_min (*)
  • cum_prod (*)
  • cum_sum (*)
  • diff (*)
  • ewm_mean (*)
  • gather_every (*)
  • head (*)
  • is_first_distinct (*)
  • is_last_distinct (*)
  • rolling_mean (*)
  • rolling_std (*)
  • rolling_sum (*)
  • rolling_var (*)
  • shift (*)
  • tail (*)

Namespaces:

  • cat (**)
  • dt (**)
  • list (**)
  • name (**)
  • str (**)
@FBruzzesi FBruzzesi added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers, but anyone is welcome to submit a pull request! labels Jan 3, 2025
@lucas-nelson-uiuc
Copy link
Contributor

lucas-nelson-uiuc commented Jan 4, 2025

Hey @FBruzzesi ,

Working on implementing scalar methods like any and all - should be ready to push later today.

Planning on working on the following methods - want to first check if my thought process is "correct".

  • arg_true
  • drop_nulls
  • filter
  • gather_every
  • sort
  • unique

Thinking of implementing two patterns for these methods:

# if predicate-based (e.g. drop_nulls, which uses predicate function `F.isnull`)
def method(self) -> Self:
        def _method(_input: Column) -> Column:
            from pyspark.sql import functions as F  # noqa: N812

            return F.explode(F.filter(F.array(_input), <predicate_func>))

        return self._from_call(_method, "method", returns_scalar=False)


# if not predicate-based (e.g. unique, which uses array function `F.array_distinct`)
def method(self) -> Self:
        def _method(_input: Column) -> Column:
            from pyspark.sql import functions as F  # noqa: N812

            return F.explode(<array_func>(F.array(_input)))

        return self._from_call(_method, "method", returns_scalar=False)

Not sure how expensive doing this is or if it collides with future API developments. Lmk what you think

@MarcoGorelli
Copy link
Member

thanks @lucas-nelson-uiuc for your efforts here

can we leave the row-order dependent ones out for now, make sure we've got everything done from the others first? there's some broader api decisions we need to make for those

@lucas-nelson-uiuc
Copy link
Contributor

lucas-nelson-uiuc commented Jan 10, 2025

got a working version for the following - all supports the Polars examples and expr_and_series tests:

  • filter
  • drop_null
  • replace_strict
  • fill_null (only strategy='zero' and strategy='one' seem like v1 additions)

@FBruzzesi
Copy link
Member Author

Amazing stuff @lucas-nelson-uiuc ! Looking forward to those as well!
Notice that now we merged the pyspark tests into the main test suite and to run a test you will just need to remove the following snippet from the dedicated feature test:

    if "pyspark" in str(constructor):
        request.applymarker(pytest.mark.xfail)

@FBruzzesi
Copy link
Member Author

FYI I am working on SparkLikeNamespace methods

@lucas-nelson-uiuc
Copy link
Contributor

tried adding is_nan into #1802 but noticed two things:

  • nw._spark_like.expr.cast is not yet fully developed - this causes the tests to fail
  • Spark handles zero division by returning null instead of nan - this also causes the test to fail
    • should the Spark implementation of is_nan check for NaN and NULL?

lmk if I'm missing something

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers, but anyone is welcome to submit a pull request! help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants