-
Notifications
You must be signed in to change notification settings - Fork 474
/
Copy pathdataframe-session-2022-05-19-Converting-DataFrame-to-RDD.txt
100 lines (93 loc) · 2.92 KB
/
dataframe-session-2022-05-19-Converting-DataFrame-to-RDD.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
This demo shows how to convert
1. a DataFrame to an RDD
2. an RDD to a DataFrame
~ % /Users/mparsian/spark-3.2.1/bin/pyspark
Python 3.8.9 (default, Jul 19 2021, 09:37:32)
Welcome to Spark version 3.2.1
Spark context Web UI available at http://10.0.0.234:4041
Spark context available as 'sc' (master = local[*], app id = local-1653016254174).
SparkSession available as 'spark'.
>>> data = [('alex', 'sales', 23000), ('jane', 'HR', 29000), ('bob', 'sales', 43000),('mary', 'HR', 93000)]
>>> data
[('alex', 'sales', 23000), ('jane', 'HR', 29000), ('bob', 'sales', 43000), ('mary', 'HR', 93000)]
>>> df = spark.createDataFrame(data, ['name', 'dept', 'salary'])
>>> df.show()
+----+-----+------+
|name| dept|salary|
+----+-----+------+
|alex|sales| 23000|
|jane| HR| 29000|
| bob|sales| 43000|
|mary| HR| 93000|
+----+-----+------+
>>> df.printSchema()
root
|-- name: string (nullable = true)
|-- dept: string (nullable = true)
|-- salary: long (nullable = true)
>>> rdd5 = df.rdd
>>> rdd5.collect()
[
Row(name='alex', dept='sales', salary=23000),
Row(name='jane', dept='HR', salary=29000),
Row(name='bob', dept='sales', salary=43000),
Row(name='mary', dept='HR', salary=93000)
]
>>>
>>> df2 = rdd5.toDF()
>>> df2.show()
+----+-----+------+
|name| dept|salary|
+----+-----+------+
|alex|sales| 23000|
|jane| HR| 29000|
| bob|sales| 43000|
|mary| HR| 93000|
+----+-----+------+
>>> from pyspark.sql import Row
>>> # NOTE: to convert an RDD into a DataFrame,
>>> # each Row() must have the same column names:
>>> rows =
[
Row(name='alex', dept='sales', salary=23000),
Row(name='jane', dept='HR', salary=29000, address='123 main street')
]
>>> rdd = sc.parallelize(rows)
>>> rdd.collect()
[Row(name='alex', dept='sales', salary=23000), Row(name='jane', dept='HR', salary=29000, address='123 main street')]
>>> df44 = rdd.toDF()
>>> df44.show()
22/05/19 20:21:51 ERROR Executor: Exception in task 10.0 in stage 15.0 (TID 100)
java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 3 fields are required while 4 values are provided.
...
>>> # create Row()'s which have the same columns
>>> rows =
[
Row(name='alex', dept='sales', salary=23000, address=None),
Row(name='jane', dept='HR', salary=29000, address='123 main street')
]
>>> rdd = sc.parallelize(rows)
>>> df44 = rdd.toDF()
>>> df44.show()
+----+-----+------+---------------+
|name| dept|salary| address|
+----+-----+------+---------------+
|alex|sales| 23000| null|
|jane| HR| 29000|123 main street|
+----+-----+------+---------------+
>>>
>>> some_data = [('alex', 10), ('jane', 20)]
>>> rdd3 = sc.parallelize(some_data)
>>> rdd3.collect()
[('alex', 10), ('jane', 20)]
>>> rdd3_with_rows = rdd3.map(lambda x: Row(name=x[0], age=x[1]))
>>> rdd3_with_rows.collect()
[Row(name='alex', age=10), Row(name='jane', age=20)]
>>> df3 = rdd3_with_rows.toDF()
>>> df3.show()
+----+---+
|name|age|
+----+---+
|alex| 10|
|jane| 20|
+----+---+