'Run SQL query on dataframe via Pyspark
I would like to run sql query on dataframe but do I have to create a view on this dataframe? Is there any easier way?
df = spark.createDataFrame([
('a', 1, 1), ('a',1, None), ('b', 1, 1),
('c',1, None), ('d', None, 1),('e', 1, 1)
]).toDF('id', 'foo', 'bar')
and the query I want to run some complex queries against this dataframe. For example I can do
df.createOrReplaceTempView("temp_view")
df_new = pyspark.sql("select id,max(foo) from temp_view group by id")
but do I have to convert it to view first before querying it? I know there is a dataframe equivalent operation. The above query is only an example.
Solution 1:[1]
You can just do
df.select('id', 'foo')
This will return a new Spark DataFrame with columns id and foo.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | fuzzy-memory |
