'ClassNotFoundException: Failed to find data source: bigquery
I'm trying to load data from Google BigQuery into Spark running on Google Dataproc (I'm using Java). I tried to follow instructions on here: https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example
I get the error: "ClassNotFoundException: Failed to find data source: bigquery."
My pom.xml looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.virtualpairprogrammers</groupId>
<artifactId>learningSpark</artifactId>
<version>0.0.3-SNAPSHOT</version>
<packaging>jar</packaging>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
<java.version>1.8</java.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.3.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>com.google.cloud.spark</groupId>
<artifactId>spark-bigquery_2.11</artifactId>
<version>0.9.1-beta</version>
<classifier>shaded</classifier>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.5.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<artifactId>maven-jar-plugin</artifactId>
<version>3.0.2</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<archive>
<manifest>
<mainClass>com.virtualpairprogrammers.Main</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
</plugins>
</build>
</project>
After adding the dependency to my pom.xml it was downloading a lot to build the .jar, so I think I should have the correct dependency? However, Eclipse is also warning me that "The import com.google.cloud.spark.bigquery is never used".
This is the part of my code where I get the error:
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import com.google.cloud.spark.bigquery.*;
public class Main {
public static void main(String[] args) {
SparkSession spark = SparkSession.builder()
.appName("testingSql")
.getOrCreate();
Dataset<Row> data = spark.read().format("bigquery")
.option("table","project.dataset.tablename")
.load()
.cache();
Solution 1:[1]
I think you only added BQ connector as compile time dependency, but it is missing at runtime. You need to either make a uber jar which includes the connector in your job jar (the doc needs to be updated), or include it when you submit the job gcloud dataproc jobs submit spark --properties spark.jars.packages=com.google.cloud.spark:spark-bigquery_2.11:0.9.1-beta.
Solution 2:[2]
I faced the same issue and updated the format from "bigquery" to "com.google.cloud.spark.bigquery" and that worked for me.
Solution 3:[3]
Specifying the dependency in the build.sbt and using "com.google.cloud.spark.bigquery" in format as suggested by Peter resolved the issue for me.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Dagang |
| Solution 2 | Peter |
| Solution 3 | Jas Kaur |
