'Not Able to Run Pyspark in Google Colab
hi I am trying to run pyspark on google colab using following code :
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar xf spark-2.4.5-bin-hadoop2.7.tgz
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
I am getting following error :
/content/spark-2.4.5-bin-hadoop2.7/python/pyspark/java_gateway.py in _launch_gateway(conf, insecure)
106
107 if not os.path.isfile(conn_info_file):
--> 108 raise Exception("Java gateway process exited before sending its port number")
109
110 with open(conn_info_file, "rb") as info:
Exception: Java gateway process exited before sending its port number
NOTE : I was able to run this code till afternoon today, suddenly this error started coming in the evening
Solution 1:[1]
Please check if wget is working. If not, upload the latest version of apache-spark to google drive and unpack it to the google collaboratory and then add the path as given. Your code is not working because it is not able to find the spark folder. wget is not working
Solution 2:[2]
Here are the steps that I start with always: 1st to remove unnecessary ubuntu errors or Java port errors
!sudo add-apt-repository --remove ppa:vikoadi/ppa
!sudo apt update
second code for fresh start
!pip install pyspark
third code fresh java and latest spark table from the website (you can change the link if it shows error and pickup any you like)
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
!tar xf spark-3.1.1-bin-hadoop3.2.tgz
!pip install -q findspark
4th code creating a session or configure the size or memory you need here 4G for example
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
spark=SparkSession.builder.appName('sol').getOrCreate()
spark.conf.set("spark.driver.memory","4g")
5th code to check the session on any my data
from google.colab import files
files.upload() #to upload the testing file for example mydata.csv
dataset = spark.read.csv('mydata.csv',inferSchema=True, header =True)
dataset.printSchema()
Then i hope it's all good. Leave a comment if it doesn't work
Solution 3:[3]
Just the following commands
Cell1
!pip install pyspark
Cell 2
spark = SparkSession.builder\
.master("local")\
.appName("Colab")\
.getOrCreate()
df = spark.read.option("header",True).format("csv").load(url)
df.show()
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Patrick Klein |
| Solution 2 | |
| Solution 3 | Alan |
