'Assign value to specific cell in PySpark dataFrame

I want to change a value in a specific cell of my Spark DataFrame using PySpark.

Trivial example - I create a mock Spark DataFrame:

df = spark.createDataFrame(
    [
     (1, 1.87, 'new_york'), 
     (4, 2.76, 'la'), 
     (6, 3.3, 'boston'), 
     (8, 4.1, 'detroit'), 
     (2, 5.70, 'miami'), 
     (3, 6.320, 'atlanta'), 
     (1, 6.1, 'houston')
    ],
    ('variable_1', "variable_2", "variable_3")
)

Runnning display(df) I get this table:

variable_1   variable_2   variable_3
    1           1.87    new_york
    4           2.76    la
    6           3.3     boston
    8           4.1     detroit
    2           5.7     miami
    3           6.32    atlanta
    1           6.1     houston

Let´s say for example, I would like to assign a new value for the cell in the 4th row and 3rd column, i.e. changing detroit for new_orleans. I know assignments as df.iloc[4, 3] = 'new_orleans' or df.loc[4, 'detroit'] = 'new_orleans' are not valid in Spark.

A valid answer to my question using when would be:

from pyspark.sql.functions import when
targetDf = df.withColumn("variable_3", \
              when(((df["variable_1"] == 8) & (df["variable_2"] == 4.1)) , 'new_orleans').otherwise(df["variable_3"]))

My question is: could this be done in a more practical way in PySpark without the necessity of entering all the values and column names of the row where I want to change just 1 single cell (maybe achieving the same without using the when function)?

Thanks in advance for your help and @useruser9806664 for his feedback.



Solution 1:[1]

Spark DataFrames are immutable, don't provide random access and, strictly speaking, unordered. As a result:

  • You cannot assign anything (because immutable property).
  • You cannot access specific row (because no random access).
  • Row "indcies" are not well defined (because unordered).

What you can do, is creating a new dataframe with new column, replacing existing, using some conditional expression, which is already covered by the answers you found.

Also, monotonically_increasing_id doesn't add indices (row numbers). It adds monotonically increasing numbers, not necessarily consecutive ones or starting from any particular value (in case of empty partitions).

Solution 2:[2]

You can create a row number using the underlying RDD:

from pyspark.sql import Row

# Function to update dataframe row with a rownumber
def create_rownum(ziprow):
    row, index=ziprow
    row=row.asDict()
    row['rownum']= index
    return(Row(**row))

# First create a rownumber then add to dataframe
df.rdd.zipWithIndex().map(create_rownum).toDF().show()

Now you can filter the DataFrame to obtain the row number you want.

Solution 3:[3]

I had the same problem but I used sql expresions:

expr = """CASE WHEN variable1==8 AND variable==4.1 THEN 'new_orleans' ELSE variable3 END AS variable4"""

df = df.selectExpr(*,expr)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 ags29
Solution 3 Francy Camacho