'Using pyspark in Google Colab

This is my first question here after using a lot of StackOverflow so correct me if I give inaccurate or incomplete info

Up until this week I had a colab notebook setup to run with pyspark following one of the many guides I found throughout the internet, but this week it started popping with a few different errors.

The code used is pretty much this one:

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.0.0-preview2/spark-3.0.0-preview2-bin-hadoop2.7.tgz
!tar -xvf spark-3.0.0-preview2-bin-hadoop2.7.tgz
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-preview2-bin-hadoop2.7"
import findspark
findspark.init()

I have tried changing the Java version from 8 to 11 and using all of the available Spark builds on https://downloads.apache.org/spark/ and changing the HOME paths accordingly. I used pip freeze as seen on one guide to check the Spark version used in colab and it said pyspark 3.0.0 so I tried all the ones on version 3.0.0 and all I keep getting is the error:

Exception: Unable to find py4j, your SPARK_HOME may not be configured correctly

I don't understand much about the need of using Java for this, but I also tried installing pyj4 though !pip install py4j and it says its already installed when I do, and I tried every different guide on the internet, but I can't run my Spark code anymore. Does anyone know how to fix this? I only use colab for college purposes because my PC is quite outdated and I don't know much about it, but I really need to get this notebook running reliably and so how do I know it's time to update the imported builds?



Solution 1:[1]

Following this colab notebook which worked for me:

First cell:

!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

and that pretty much installs pyspark.

But do follow these steps to also launch the Spark UI which is super-helpful for understanding physical plans, storage usage, and much more. Also: it has nice graphs ;)

Second cell:

from pyspark import SparkSession
from pyspark import SparkContext, SparkConf



# create the session
conf = SparkConf().set("spark.ui.port", "4050")

# create the context
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession.builder.getOrCreate()

Third cell:

!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip ngrok-stable-linux-amd64.zip
get_ipython().system_raw('./ngrok http 4050 &')
!sleep 10
!curl -s http://localhost:4040/api/tunnels | python3 -c \
    "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"

after which you'll see a URL where you'll find the Spark UI; my example output was:

--2020-10-03 11:30:58--  https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
Resolving bin.equinox.io (bin.equinox.io)... 52.203.78.32, 52.73.16.193, 34.205.238.171, ...
Connecting to bin.equinox.io (bin.equinox.io)|52.203.78.32|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13773305 (13M) [application/octet-stream]
Saving to: ‘ngrok-stable-linux-amd64.zip.1’

ngrok-stable-linux- 100%[===================>]  13.13M  13.9MB/s    in 0.9s    

2020-10-03 11:31:00 (13.9 MB/s) - ‘ngrok-stable-linux-amd64.zip.1’ saved [13773305/13773305]

Archive:  ngrok-stable-linux-amd64.zip
replace ngrok? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: ngrok                   
http://989c77d52223.ngrok.io

and that last element, http://989c77d52223.ngrok.io, was where my Spark UI lived.

Solution 2:[2]

@Victor I also had similar probem. This is what I did.

  1. Download your existing jupyter notebook from colab to your computer drive.

  2. Create a new notebook in colab

  3. Execute following

    !pip3 install pyspark

  4. Upload your notebook to the same colab session.

  5. Run Spark Session and check

Solution 3:[3]

Spark version 2.3.2 works very well in google colab. Just follow my steps :

!pip install pyspark==2.3.2
import pyspark 

Check the version we have installed

pyspark.__version__

Try to create a Sparksession

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Sparkify").getOrCreate()

And you can now use Spark in colab. ENJOY !

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Dharman
Solution 2 Nihad TP
Solution 3 Fahd Zaghdoudi