'How to bulk load data to Apache Phoenix 5.1.2. using Apache Spark 3.2.1? (PySpark)
I have problem. I am trying to bulk load csv files (30 - 300 GB each) into Apache Phoenix tables. I am trying to do that with the Apache Spark Plugin (https://phoenix.apache.org/phoenix_spark.html) however when I spark submit my code:
import sys
from pyspark.sql import SparkSession
def main():
spark = SparkSession.builder.appName('From CSV to Phoenix Loader').getOrCreate()
csv_name = sys.argv[1]
table_name = sys.argv[2]
csv_file = spark.read \
.option("header", True) \
.option("delimiter", ",") \
.csv(f"hdfs://open1:9000/csv_files/{csv_name}")
csv_file.printSchema()
csv_file.write \
.format("phoenix") \
.mode("overwrite") \
.option("table", table_name) \
.option("zkUrl", "open1,open2,open3,open4,open5,open6,open7,open8,open9,open10,open11,open12:2181") \
.save()
spark.stop()
if __name__ == "__main__":
main()
I get the error
Traceback (most recent call last):
File "load_from_csv_to_table.py", line 30, in <module>
main()
File "load_from_csv_to_table.py", line 19, in main
csv_file.write \
File "/home/hadoopuser/.local/lib/python3.8/site-packages/pyspark/sql/readwriter.py", line 738, in save
self._jwrite.save()
File "/home/hadoopuser/.local/lib/python3.8/site-packages/py4j/java_gateway.py", line 1321, in __call__
return_value = get_return_value(
File "/home/hadoopuser/.local/lib/python3.8/site-packages/pyspark/sql/utils.py", line 111, in deco
return f(*a, **kw)
File "/home/hadoopuser/.local/lib/python3.8/site-packages/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o48.save.
: java.lang.ClassNotFoundException:
Failed to find data source: phoenix. Please find packages at
http://spark.apache.org/third-party-projects.html
My spark-submit:
spark-submit --master yarn --deploy-mode cluster --jars /usr/local/phoenix/phoenix-spark-5.0.0-HBase-2.0.jar,/usr/local/phoenix/phoenix-client-hbase-2.4-5.1.2.jar hdfs://open1:9000/apps/python/load_from_csv_to_table.py data.csv TABLE.TABLE
The problem is... I do not know which jars should attach to spark submit. When I look at https://mvnrepository.com/artifact/org.apache.phoenix/phoenix-spark I do not see proper jar version for Apache Phoenix 5.1.2. The last version is 5.0.0 with HBase 2.0.0 from 2018 year. How to bulk load data to Apache Phoenix 5.1.2 using PySpark 3.2.1? Which jars do I need?
PS I have also defined spark-defaults.conf:
spark.executor.extraClassPath=/usr/local/phoenix/phoenix-client-hbase-2.4-5.1.2.jar:/usr/local/phoenix/phoenix-spark-5.0.0-HBase-2.0.jar
spark.driver.extraClassPath=/usr/local/phoenix/phoenix-client-hbase-2.4-5.1.2.jar:/usr/local/phoenix/phoenix-spark-5.0.0-HBase-2.0.jar
but I believe the jars are not proper.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
