'rdd vs dataframe in pyspark
I just read that the dataframe has 2-dimensional array-like storage where rdd doesn't have any such constraints over storage.
Due to this the queries can be run more optimized with dataframes.
Does this means that creating a dataframe would consume more memory than creating an rdd over the same input dataset?
Also, if I have a defined rdd as rdd1, when I use the toDf method to convert rdd1 into a dataframe, am I consuming more memory over the node?
Similarly, if I have a dataframe and I am converting it to rdd using df.rdd method, am I freeing some space over the node?
Solution 1:[1]
RDD:
Resilient Distributed Datasets. RDD is a fault-tolerant collection of elements that can be operated on in-parallel, also we can say RDD is the fundamental data structure of Spark. Through RDD, we can process structured as well as unstructured data. But, in RDD user need to specify the schema of ingested data, RDD cannot infer its own. It is a distributed collection of data elements. That is spread across many machines over the cluster, they are a set of Scala or Java objects representing data. RDD Supports object-oriented programming style with compile-time type safety RDDs are immutable in nature. That means we can not change anything about RDDs If RDD is in tabular format, we can move from RDD to dataframe by to() method. We can also do the reverse by the .rdd method. There was no provision for optimization engine in RDD. On the basis of its attributes, developers optimise each RDD Spark does not compute their result right away, it evaluates RDDs lazily Since RDD APIs, use schema projection explicitly. Therefore, a user needs to define the schema manually While performing simple grouping and aggregation operations RDD API is slower compare to DataFrame.
DataFrame:
Data frame data is organized into named columns. Basically, it is as same as a table in a relational database If we try to access any column which is not present in the table, then an attribute error may occur at runtime. Dataframe will not support compile-time type safety in such case. One cannot regenerate a domain object, after transforming into dataframe. By using the example, if we generate one test data frame from tested then, we can not recover the original RDD again of the test class. By using Catalyst Optimizer, optimization takes place in dataframes. In 4 phases, dataframes use catalyst tree transformation framework Use of off-heap memory for serialization reduces the overhead also generates, bytecode. So that, many operations can be performed on that serialized data Similarly, computation happens only when action appears as Spark evaluates dataframe lazily In dataframe, there is no need to specify a schema. Generally, it discovers schema automatically In performing exploratory analysis, creating aggregated statistics on data, dataframes are faster. We use dataframe when we need a high level of abstraction and for unstructured data, such as media streams or streams of text.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | narasimha |
