Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking β€œSign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add minimal PySpark support #908

Merged
merged 98 commits into from
Dec 5, 2024
Merged
Changes from 1 commit
Commits
Show all changes
98 commits
Select commit Hold shift + click to select a range
72c1b49
first pyspark draft
EdAbati Sep 3, 2024
e67140a
Merge remote-tracking branch 'upstream/main' into pyspark
EdAbati Sep 3, 2024
3316460
added schema
EdAbati Sep 4, 2024
12f62c1
add methods needed for compliant types
EdAbati Sep 4, 2024
2b114eb
fix all_horizontal
EdAbati Sep 7, 2024
378b421
add xfail to some tests
EdAbati Sep 8, 2024
b5957dc
draft with sql
EdAbati Sep 8, 2024
9f8f944
merge upstream
EdAbati Sep 10, 2024
b2aee0e
making all frame tests pass
EdAbati Sep 11, 2024
0e4b2f2
group by
EdAbati Sep 12, 2024
741cdde
skipping tests
EdAbati Sep 12, 2024
2bdfe31
restore type
EdAbati Sep 12, 2024
c0b1a18
smaller diff + mypy fix
EdAbati Sep 12, 2024
ec0b26f
remove print
EdAbati Sep 12, 2024
32b87a3
Merge remote-tracking branch 'upstream/main' into pyspark
EdAbati Sep 12, 2024
a053b07
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 12, 2024
a415bd0
smaller diff
EdAbati Sep 12, 2024
6065eb2
reenable pyspark
EdAbati Sep 12, 2024
1688f7d
count without window
EdAbati Oct 6, 2024
191dcb7
revert expr series tests
EdAbati Oct 6, 2024
41368ef
revert rest of tests
EdAbati Oct 6, 2024
b0dffad
placeholder pyspark test
EdAbati Oct 6, 2024
37ecc70
merge main
EdAbati Oct 6, 2024
1c76b0b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 6, 2024
9802fdc
moved test_column
EdAbati Oct 6, 2024
267f2ff
moved select filter and with_columns
EdAbati Oct 6, 2024
8adee30
add schema head sort tests
EdAbati Oct 6, 2024
9186687
add test add
EdAbati Oct 6, 2024
38d326d
fix rename
EdAbati Oct 8, 2024
223ea88
added more tests
EdAbati Oct 8, 2024
3337014
Merge remote-tracking branch 'upstream/main' into pyspark
EdAbati Oct 13, 2024
d7b2752
fix all_horizontal
EdAbati Oct 13, 2024
95b8395
fixing all tests πŸŽ‰πŸŽ‰
EdAbati Oct 13, 2024
734c140
rename test
EdAbati Oct 13, 2024
1b9a7e7
add backend_version
EdAbati Oct 13, 2024
9d326a4
added group by tests
EdAbati Oct 13, 2024
1a2e804
add pyspark in requirement dev
EdAbati Oct 13, 2024
411f67d
use pyspark.sql to create empty df
EdAbati Oct 13, 2024
3a59240
stddev for older pyspark
EdAbati Oct 13, 2024
08120da
coverage up
EdAbati Oct 13, 2024
177ec5e
min pyspark version test
EdAbati Oct 14, 2024
77e6687
fix for pyspark 3.2
EdAbati Oct 14, 2024
9ccab80
pyspark 3.3 as minimum
EdAbati Oct 14, 2024
ef1944c
trying debugging windows
EdAbati Oct 14, 2024
a8b228f
no test pyspark with pandas <1.0.5
EdAbati Oct 14, 2024
c74772d
removing debug windows
EdAbati Oct 14, 2024
dd0dd39
Merge remote-tracking branch 'upstream/main' into pyspark
EdAbati Oct 14, 2024
d00a2da
testing 3.3.0
EdAbati Oct 14, 2024
6b25971
trying with repartition 2
EdAbati Oct 14, 2024
3713a6d
remove unused data
EdAbati Oct 15, 2024
eb0a2ce
trying to fix sorting problems in tests
EdAbati Oct 15, 2024
df1a37f
no pyspark in minimum_versions
EdAbati Oct 15, 2024
ce503fa
trying to make windows happy
EdAbati Oct 15, 2024
94656b3
fix repartition
EdAbati Oct 15, 2024
33739de
exclude pyspark for python 3.12
EdAbati Oct 15, 2024
5808d71
Merge remote-tracking branch 'upstream/main' into pyspark
EdAbati Oct 27, 2024
5d4b02f
use assert_equal_data
EdAbati Oct 27, 2024
92617f1
only use self._native_frame.sparkSession
EdAbati Oct 27, 2024
5733069
add drop_null_keys in groupby
EdAbati Oct 27, 2024
9b6c4e0
rename _spark
EdAbati Oct 27, 2024
e2344c7
rename spark_test
EdAbati Oct 27, 2024
bb1de48
use PYSPARK_VERSION
EdAbati Oct 28, 2024
36d0886
rename PySpark... classes to Spark...
EdAbati Oct 28, 2024
24676d0
_ in func signature
EdAbati Oct 28, 2024
3defa39
make coverage happy
EdAbati Oct 28, 2024
a8946f2
exception public
EdAbati Nov 17, 2024
86c459d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 17, 2024
dc7fb71
fix docs
EdAbati Nov 17, 2024
d720b58
Merge remote-tracking branch 'upstream/main' into pyspark
EdAbati Nov 17, 2024
a1141f7
rename to _spark_like
EdAbati Nov 17, 2024
94b6777
rename exceptions
EdAbati Nov 17, 2024
10c1b11
update coverage to ignore `_spark_like`
EdAbati Nov 17, 2024
5193bca
better comment
EdAbati Nov 18, 2024
7b513f4
invalidintoexpr error
EdAbati Nov 18, 2024
f25969b
fix pytest warning error
EdAbati Nov 21, 2024
08987cf
small comment
EdAbati Nov 21, 2024
0fb0478
fix F.std for ddof more than 1
EdAbati Nov 21, 2024
50e2e4d
Merge remote-tracking branch 'upstream/main' into pyspark
EdAbati Nov 21, 2024
e86fc5c
fix stddev imports for py <3.5
EdAbati Nov 21, 2024
522a1aa
use F
EdAbati Nov 21, 2024
84d5b6a
Merge remote-tracking branch 'upstream/main' into pyspark
EdAbati Dec 3, 2024
b9c21df
update to latest changes
EdAbati Dec 3, 2024
bb82020
add implementation to expr
EdAbati Dec 3, 2024
010a362
rename SparkLike...
EdAbati Dec 3, 2024
dac5901
rename native_to_narwhals_dtype
EdAbati Dec 3, 2024
6d67b0c
dtype unknown for decimal
EdAbati Dec 3, 2024
15ca58e
simplify return unknown
EdAbati Dec 3, 2024
ce4e2fb
update no_imports_tests
EdAbati Dec 4, 2024
d841ec5
level lazy for spark
EdAbati Dec 4, 2024
ac68a7e
add _change_dtypes
EdAbati Dec 4, 2024
9a1f741
Merge remote-tracking branch 'upstream/main' into pyspark
EdAbati Dec 4, 2024
2121c40
_change_version is back
EdAbati Dec 4, 2024
c0f44b6
fix no imports tests
EdAbati Dec 4, 2024
4b7895f
rename spark_like tests
EdAbati Dec 4, 2024
638c402
same error message as dask
EdAbati Dec 4, 2024
b46f1b5
remove extra expr._call
EdAbati Dec 5, 2024
a3e3dba
update coverage
EdAbati Dec 5, 2024
d8e6064
extract _columns_from_expr
EdAbati Dec 5, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
stddev for older pyspark
EdAbati committed Oct 13, 2024

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
commit 3a59240d62975761016dbc0032974b363f500b57
8 changes: 8 additions & 0 deletions narwhals/_pyspark/expr.py
Original file line number Diff line number Diff line change
@@ -7,6 +7,7 @@

from narwhals._pyspark.utils import get_column_name
from narwhals._pyspark.utils import maybe_evaluate
from narwhals.utils import parse_version

if TYPE_CHECKING:
from pyspark.sql import Column
@@ -272,7 +273,14 @@ def _min(_input: Column) -> Column:
return self._from_call(_min, "min", returns_scalar=True)

def std(self, ddof: int = 1) -> Self:
import numpy as np # ignore-banned-import

def _std(_input: Column) -> Column:
if self._backend_version < (3, 4) or parse_version(np.__version__) > (2, 0):
from pyspark.sql.functions import stddev

_ = ddof
return stddev(_input)
from pyspark.pandas.spark.functions import stddev

return stddev(_input, ddof=ddof)
29 changes: 22 additions & 7 deletions tests/pyspark_test.py
Original file line number Diff line number Diff line change
@@ -12,11 +12,14 @@
from typing import TYPE_CHECKING
from typing import Any

import numpy as np
import pandas as pd
import pyspark
import pytest

import narwhals.stable.v1 as nw
from narwhals._exceptions import ColumnNotFoundError
from narwhals.utils import parse_version
from tests.utils import compare_dicts

if TYPE_CHECKING:
@@ -360,13 +363,25 @@ def test_std(pyspark_constructor: Constructor) -> None:
nw.col("b").std(ddof=2).alias("b_ddof_2"),
nw.col("z").std(ddof=0).alias("z_ddof_0"),
)
expected = {
"a_ddof_default": [1.0],
"a_ddof_1": [1.0],
"a_ddof_0": [0.816497],
"b_ddof_2": [1.632993],
"z_ddof_0": [0.816497],
}
if parse_version(pyspark.__version__) < (3, 4) or parse_version(np.__version__) > (
2,
0,
):
expected = {
"a_ddof_default": [1.0],
"a_ddof_1": [1.0],
"a_ddof_0": [1.0],
"b_ddof_2": [1.154701],
"z_ddof_0": [1.0],
}
else:
expected = {
"a_ddof_default": [1.0],
"a_ddof_1": [1.0],
"a_ddof_0": [0.816497],
"b_ddof_2": [1.632993],
"z_ddof_0": [0.816497],
}
compare_dicts(result, expected)