'Calculate z-score in pyspark
I am trying to remove outliers using the modified z-score in pyspark. Here's the code that I have written:
def remove_outliers(df):
df.sort(df.total_amount.asc()).show(truncate=False)
median_amount = df.groupBy(df.total_amount).agg(
pf.percentile_approx("total_amount", 0.5, pf.lit(1000000)).alias("median")
)
deviation_from_median = df.total_amount - median_amount
mad_amount = 1.483 * abs(percentile_approx(deviation_from_median_amount, 0.5, pf.lit(1000000)).alias("median"))
zscore_amount = deviation_from_median_amount/mad_amount
amount_outliers = (df.total_amount > zscore_amount)
df = df.drop(amount_outliers)
return df
Here's the dataset on which I am working:
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-01.csv
And here's the error that I get after running the above function:
AttributeError: 'DataFrame' object has no attribute '_get_object_id'
Can you please guide me on what I am doing wrong here and is this the correct way to find the z-score in pyspark?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
