'Not Able to Run Pyspark in Google Colab

hi I am trying to run pyspark on google colab using following code :

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar xf spark-2.4.5-bin-hadoop2.7.tgz
!pip install -q findspark

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"

import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

I am getting following error :

/content/spark-2.4.5-bin-hadoop2.7/python/pyspark/java_gateway.py in _launch_gateway(conf, insecure)
    106 
    107             if not os.path.isfile(conn_info_file):
--> 108                 raise Exception("Java gateway process exited before sending its port number")
    109 
    110             with open(conn_info_file, "rb") as info:

Exception: Java gateway process exited before sending its port number

NOTE : I was able to run this code till afternoon today, suddenly this error started coming in the evening

Solution 1:^[1]

Please check if wget is working. If not, upload the latest version of apache-spark to google drive and unpack it to the google collaboratory and then add the path as given. Your code is not working because it is not able to find the spark folder. wget is not working

Solution 2:^[2]

Here are the steps that I start with always: 1st to remove unnecessary ubuntu errors or Java port errors

!sudo add-apt-repository --remove ppa:vikoadi/ppa
!sudo apt update

second code for fresh start

!pip install pyspark

third code fresh java and latest spark table from the website (you can change the link if it shows error and pickup any you like)

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
!tar xf spark-3.1.1-bin-hadoop3.2.tgz
!pip install -q findspark

4th code creating a session or configure the size or memory you need here 4G for example

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
spark=SparkSession.builder.appName('sol').getOrCreate()
spark.conf.set("spark.driver.memory","4g")

5th code to check the session on any my data

from google.colab import files
files.upload() #to upload the testing file for example mydata.csv 
dataset = spark.read.csv('mydata.csv',inferSchema=True, header =True)
dataset.printSchema()

Then i hope it's all good. Leave a comment if it doesn't work

Solution 3:^[3]

Just the following commands

Cell1

!pip install pyspark

Cell 2

spark = SparkSession.builder\
    .master("local")\
    .appName("Colab")\
    .getOrCreate()

df = spark.read.option("header",True).format("csv").load(url)
df.show()

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Patrick Klein
Solution 2
Solution 3	Alan

'Not Able to Run Pyspark in Google Colab

Solution 1:[1]

Solution 2:[2]

Solution 3:[3]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]

Solution 3:^[3]