'Assign value to specific cell in PySpark dataFrame
I want to change a value in a specific cell of my Spark DataFrame using PySpark.
Trivial example - I create a mock Spark DataFrame:
df = spark.createDataFrame(
[
(1, 1.87, 'new_york'),
(4, 2.76, 'la'),
(6, 3.3, 'boston'),
(8, 4.1, 'detroit'),
(2, 5.70, 'miami'),
(3, 6.320, 'atlanta'),
(1, 6.1, 'houston')
],
('variable_1', "variable_2", "variable_3")
)
Runnning display(df) I get this table:
variable_1 variable_2 variable_3
1 1.87 new_york
4 2.76 la
6 3.3 boston
8 4.1 detroit
2 5.7 miami
3 6.32 atlanta
1 6.1 houston
Let´s say for example, I would like to assign a new value for the cell in the 4th row and 3rd column, i.e. changing detroit for new_orleans. I know assignments as df.iloc[4, 3] = 'new_orleans' or df.loc[4, 'detroit'] = 'new_orleans' are not valid in Spark.
A valid answer to my question using when would be:
from pyspark.sql.functions import when
targetDf = df.withColumn("variable_3", \
when(((df["variable_1"] == 8) & (df["variable_2"] == 4.1)) , 'new_orleans').otherwise(df["variable_3"]))
My question is: could this be done in a more practical way in PySpark without the necessity of entering all the values and column names of the row where I want to change just 1 single cell (maybe achieving the same without using the when function)?
Thanks in advance for your help and @useruser9806664 for his feedback.
Solution 1:[1]
Spark DataFrames are immutable, don't provide random access and, strictly speaking, unordered. As a result:
- You cannot assign anything (because immutable property).
- You cannot access specific row (because no random access).
- Row "indcies" are not well defined (because unordered).
What you can do, is creating a new dataframe with new column, replacing existing, using some conditional expression, which is already covered by the answers you found.
Also, monotonically_increasing_id doesn't add indices (row numbers). It adds monotonically increasing numbers, not necessarily consecutive ones or starting from any particular value (in case of empty partitions).
Solution 2:[2]
You can create a row number using the underlying RDD:
from pyspark.sql import Row
# Function to update dataframe row with a rownumber
def create_rownum(ziprow):
row, index=ziprow
row=row.asDict()
row['rownum']= index
return(Row(**row))
# First create a rownumber then add to dataframe
df.rdd.zipWithIndex().map(create_rownum).toDF().show()
Now you can filter the DataFrame to obtain the row number you want.
Solution 3:[3]
I had the same problem but I used sql expresions:
expr = """CASE WHEN variable1==8 AND variable==4.1 THEN 'new_orleans' ELSE variable3 END AS variable4"""
df = df.selectExpr(*,expr)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | ags29 |
| Solution 3 | Francy Camacho |
