'Calculate z-score in pyspark

I am trying to remove outliers using the modified z-score in pyspark. Here's the code that I have written:

def remove_outliers(df):
    df.sort(df.total_amount.asc()).show(truncate=False)
    median_amount = df.groupBy(df.total_amount).agg(
    pf.percentile_approx("total_amount", 0.5, pf.lit(1000000)).alias("median")
    )
    
    deviation_from_median = df.total_amount - median_amount  
    mad_amount = 1.483 * abs(percentile_approx(deviation_from_median_amount, 0.5, pf.lit(1000000)).alias("median"))
    zscore_amount = deviation_from_median_amount/mad_amount
    amount_outliers = (df.total_amount > zscore_amount)
    
    df = df.drop(amount_outliers)
    return df

Here's the dataset on which I am working:

https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-01.csv

And here's the error that I get after running the above function:

AttributeError: 'DataFrame' object has no attribute '_get_object_id' 

Can you please guide me on what I am doing wrong here and is this the correct way to find the z-score in pyspark?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source