'Droping columns based on some value in pyspark
I have the following df and I want to filter out all the columns which contain https
df = spark.createDataFrame([
('https:john', 'john', 1.1, 'httpsasd'),
('https:john', 'john', 1.2, 'httpsasd')
], ['website', 'name', 'value', 'other']
)
I have found an answer which does not address the str inside the column as the filter: PySpark drop columns based on column names / String condition
What I am looking for is an output as follows:
name | value
--------------------
john | 1.1
pete | 1.2
Solution 1:[1]
Use conditional count for each column to check if it contains at least one row with value like https then drop cols with count > 0:
from pyspark.sql import functions as F
cols_https_count = df.select(*[
F.count(F.when(F.col(c).rlike("https"), 1)).alias(c)
for c in df.columns
]).collect()[0].asDict()
cols_to_drop = [k for k, v in cols_https_count.items() if v > 0]
df = df.drop(*cols_to_drop)
df.show()
#+----+-----+
#|name|value|
#+----+-----+
#|john| 1.1|
#|john| 1.2|
#+----+-----+
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | blackbishop |
