'Error while installing Spark on Google Colab

I am getting error while installing spark on Google Colab. It says

tar: spark-2.2.1-bin-hadoop2.7.tgz: Cannot open: No such file or directory tar: Error is not recoverable: exiting now.

These were my steps

enter image description here



Solution 1:[1]

The problem is due to the download link you are using to download spark:

http://apache.osuosl.org/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz

To download spark without having any problem, you should download it from their archive website (https://archive.apache.org/dist/spark).

For example, the following download link from their archive website works fine:

https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz

Here is the complete code to install and setup java, spark and pyspark:

# innstall java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

# install spark (change the version number if needed)
!wget -q https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz

# unzip the spark file to the current folder
!tar xf spark-3.0.0-bin-hadoop3.2.tgz

# set your spark folder to your system path environment. 
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop3.2"


# install findspark using pip
!pip install -q findspark

For python users, you should also install pyspark using the following command.

!pip install pyspark

Solution 2:[2]

This error is about the link you've used in the second line of the code. The following snippet worked for me on the Google Colab. Do not forget to change the spark version to the latest one and SPARK-HOME path accordingly. You can find the latest versions here: https://downloads.apache.org/spark/

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.0.0-preview2/spark-3.0.0-preview2-bin-hadoop2.7.tgz
!tar -xvf spark-3.0.0-preview2-bin-hadoop2.7.tgz
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-preview2-bin-hadoop2.7"
import findspark
findspark.init()

Solution 3:[3]

This is the correct code. I just tested it.

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://mirrors.viethosting.com/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark

Solution 4:[4]

#for the most recent update on 02/29/2020

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-3.0.0-preview2/spark-3.0.0-preview2-bin-hadoop3.2.tgz
!tar -xvf spark-3.0.0-preview2-bin-hadoop3.2.tgz
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-preview2-bin-hadoop3.2

Solution 5:[5]

Just go to https://downloads.apache.org/spark/ and choose the version you need from the folders and follow instructions in https://colab.research.google.com/github/asifahmed90/pyspark-ML-in-Colab/blob/master/PySpark_Regression_Analysis.ipynb#scrollTo=m606eNuQgA82

Steps:

  1. Go to https://downloads.apache.org/spark/
  2. Select folder for example: "spark-3.0.1/"
  3. Copy file name you want for example: "spark-3.0.1-bin-hadoop3.2.tgz" (ends with .tgz)
  4. Paste to the provided script

List item

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget https://downloads.apache.org/spark/FOLDER_YOU_CHOSE/FILE_YOU_CHOSE
!tar -xvf FILE_YOU_CHOSE
!pip install -q findspark

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/FILE_YOU_CHOSE"

import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

Solution 6:[6]

I have tried the following commands and it seems to work.

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget https://downloads.apache.org/spark/spark-3.0.0-preview2/spark-3.0.0-preview2-bin-hadoop3.2.tgz
!tar -xvf spark-3.0.0-preview2-bin-hadoop3.2.tgz
!pip install -q findspark

I got the latest version, changed the download URL, and added the v flag to the tar command for verbose output.

Solution 7:[7]

!pip install pyspark

It worked with just the !pip install pyspark. Please refer screen shot for reference.

enter image description here

Solution 8:[8]

you are using link for the old version , following commands will work(new version)

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
!tar xf spark-2.4.0-bin-hadoop2.7.tgz
!pip install -q findspark

Solution 9:[9]

To run spark in Colab, first we need to install all the dependencies in Colab environment such as Apache Spark 2.3.2 with hadoop 2.7, Java 8 and Findspark in order to locate the spark in the system. The tools installation can be carried out inside the Jupyter Notebook of the Colab.

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz
!tar xf spark-2.4.3-bin-hadoop2.7.tgz
!pip install -q findspark

if you get this error again : Cannot open: No such file or directory tar

visit Apache spark website and get the latest build version: 1. https://www-us.apache.org/dist/spark/ 2. http://apache.osuosl.org/spark/

replace spark-2.4.3 bold words with latest version.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3 Matteo
Solution 4
Solution 5 eemilk
Solution 6 zonksoft
Solution 7 Deepa Vasanthkumar
Solution 8 Vipul Sanjay Charthal
Solution 9 Roshan Bagdiya