'convert into a pandas dataframe after finding missing values in a spark dataframe
I am utilizing the following to find missing values in my spark df:
from pyspark.sql.functions import col,sum
df.select(*(sum(col(c).isNull().cast("int")).alias(c) for c in df.columns)).show()
from my sample spark df below:
import numpy as np
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [
("James","CA",np.NaN), ("Julia","",None),
("Ram",None,200.0), ("Ramya","NULL",np.NAN)
]
df =spark.createDataFrame(data,["name","state","number"])
df.show()
How can I convert result of the prior missing count lines into a pandas dataframe? My real df has 26 columns and showing it in a spark df is messy and misaligned.
Solution 1:[1]
This might not be as clean as the actual pandas df with a table, but hopefully this would work for you:
From your first code, remove the .show() call:
df.select(*(sum(col(c).isNull().cast("int")).alias(c) for c in df.columns))
You can assign a variable for that line or go straight with toPandas() call
sdf = df.select(*(sum(col(c).isNull().cast("int")).alias(c) for c in df.columns))
new_df = sdf.toPandas().T
print(new_df)
The .T call is to transpose the dataframe. If you have several columns, without transposing it will truncate the columns and you will not be able to see all columns.
Again, this does not have the actual table, but at least this is more readable than a spark df.
UPDATE: You can get that table look if after the last variable, you convert it to pandas df if you prefer that look. There could be another way or a more efficient way to do this, but so far this one works.
pd.DataFrame(new_df)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
