'Encounter an Error Converting Rdd in Dataframe Pyspark
I am trying to turn a rdd into a dataframe. The operation seems to be successful but when I try to count the number of elements in the dataframe I get an error. I encounter no problems when I try to show the first elements but I have an error when I try to .collect() the values of the dataframe. In any case this is my code:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.functions import col
sc = SparkContext(appName = 'ANALYSIS', master = 'local')
rdd = sc.textFile('file.csv')
rdd = rdd.filter(lambda line: line != header)
rdd = rdd.map(lambda line: line.rsplit(',', 6))
spark = SparkSession.builder \
.master("local[*]") \
.appName("ANALYSIS") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
feature = ['to_drop','watched','watching','wantwatch','dropped','rating','votes']
df = spark.createDataFrame(rdd, schema = feature)
rdd.collect() --> **it works**
df.show() --> **it works**
df.count() --> **does not work**
Can someone kindly report any errors to me? Thanks
The error I encounter during the execution is the following
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-15-3c9a60fd698f> in <module>
----> 1 df.count()
/opt/conda/lib/python3.8/site-packages/pyspark/sql/dataframe.py in count(self)
662 2
663 """
--> 664 return int(self._jdf.count())
665
666 def collect(self):
/opt/conda/lib/python3.8/site-packages/py4j/java_gateway.py in __call__(self, *args)
1302
1303 answer = self.gateway_client.send_command(command)
-> 1304 return_value = get_return_value(
1305 answer, self.gateway_client, self.target_id, self.name)
1306
/opt/conda/lib/python3.8/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
109 def deco(*a, **kw):
110 try:
--> 111 return f(*a, **kw)
112 except py4j.protocol.Py4JJavaError as e:
113 converted = convert_exception(e.java_exception)
/opt/conda/lib/python3.8/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|