'How to pass schema to create a new Dataframe from existing Dataframe?
To pass schema to a json file we do this:
from pyspark.sql.types import (StructField, StringType, StructType, IntegerType)
data_schema = [StructField('age', IntegerType(), True), StructField('name', StringType(), True)]
final_struc = StructType(fields = data_schema)
df =spark.read.json('people.json', schema=final_struc)
The above code works as expected. However now, I have data in table which I display by:
df = sqlContext.sql("SELECT * FROM people_json")
But if I try to pass a new schema to it by using following command it does not work.
df2 = spark.sql("SELECT * FROM people_json", schema=final_struc)
It gives the following error:
sql() got an unexpected keyword argument 'schema'
NOTE: I am using Databrics Community Edition
- What am I missing?
- How do I pass the new schema if I have data in the table instead of some JSON file?
Solution 1:[1]
There is already one answer available but still I want to add something.
- Create DF from RDD
using toDF
newDf = rdd.toDF(schema, column_name_list)
using createDataFrame
newDF = spark.createDataFrame(rdd ,schema, [list_of_column_name])
- Create DF from other DF
suppose I have DataFrame with columns|data type - name|string, marks|string, gender|string.
if I want to get only marks as integer.
newDF = oldDF.select("marks")
newDF_with_int = newDF.withColumn("marks", df['marks'].cast('Integer'))
This will convert marks to integer.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | bhargav3vedi |
