Pyspark and UDF types problem
Published 2020-11-02 12:20:05
Hello!
Here is a fast note that might not be obvious. Beware with UDF types in PySpark.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType, FloatType
def very_fun(idk):
return(22)
def floating_fun(idk):
return(22.0)
df = sqlContext.createDataFrame(
[
(1, 'foo'),
(2, 'bar'),
],
['id', 'txt']
)
funfun_int = udf(very_fun, IntegerType())
funfun_float = udf(very_fun, FloatType())
floatingfun_int = udf(floating_fun, IntegerType())
floatingfun_float = udf(floating_fun, FloatType())
df = df.withColumn('funfun_int', funfun_int(df['id']))
df = df.withColumn('funfun_float', funfun_float(df['id']))
df = df.withColumn('floatingfun_int', floatingfun_int(df['id']))
df = df.withColumn('floatingfun_float', floatingfun_float(df['id']))
df.show()
|
And the result is not very amusing:
+---+---+----------+------------+---------------+-----------------+
| id|txt|funfun_int|funfun_float|floatingfun_int|floatingfun_float|
+---+---+----------+------------+---------------+-----------------+
| 1|foo| 22| null| null| 22.0|
| 2|bar| 22| null| null| 22.0|
+---+---+----------+------------+---------------+-----------------+
Conclusion: Know your types. Pyspark UDF is not going to do a cast for you.
Nota: I haven’t tested PandasUDF in this case, but I suppose it’s going to be a
bit more creative.