'running sparknlp DocumentAssembler on EMR
I am trying to run sparknlp on EMR. I logged into my zeppelin notebook and ran the following commands
import sparknlp
spark = SparkSession.builder \
.appName("BBC Text Categorization")\
.config("spark.driver.memory","8G")\
.config("spark.memory.offHeap.enabled",True)\
.config("spark.memory.offHeap.size","8G") \
.config("spark.driver.maxResultSize", "2G") \
.config("spark.jars.packages", "JohnSnowLabs:spark-nlp:2.4.5")\
.config("spark.kryoserializer.buffer.max", "1000M")\
.config("spark.network.timeout","3600s")\
.getOrCreate()
from sparknlp.base import DocumentAssembler
documentAssembler = DocumentAssembler()\
.setInputCol("description") \
.setOutputCol('document')
This led to the following error:
Fail to execute line 1: documentAssembler = DocumentAssembler()\
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-4581426413302524147.py", line 380, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/__init__.py", line 110, in wrapper
return func(self, **kwargs)
File "/usr/local/lib/python3.6/site-packages/sparknlp/base.py", line 148, in __init__
super(DocumentAssembler, self).__init__(classname="com.johnsnowlabs.nlp.DocumentAssembler")
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/__init__.py", line 110, in wrapper
return func(self, **kwargs)
File "/usr/local/lib/python3.6/site-packages/sparknlp/internal.py", line 72, in __init__
self._java_obj = self._new_java_obj(classname, self.uid)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 67, in _new_java_obj
return java_obj(*java_args)
TypeError: 'JavaPackage' object is not callable
To understand the issue, I tried to log into the master and run the above command in pyspark console.
Everything runs fine and I don't get the above error if I start pyspark console using the command:
pyspark --packages JohnSnowLabs:spark-nlp:2.4.5
But I get the same error as before on using the command pyspark
How can I make this work on my zeppelin notebook?
Setup Details:
EMR 5.27.0
spark 2.4.4
openjdk version "1.8.0_272"
OpenJDK Runtime Environment (build 1.8.0_272-b10)
OpenJDK 64-Bit Server VM (build 25.272-b10, mixed mode)
Here is my bootstrap script:
#!/bin/bash
sudo yum install -y python36-devel python36-pip python36-setuptools python36-virtualenv
sudo python36 -m pip install --upgrade pip
sudo python36 -m pip install pandas
sudo python36 -m pip install boto3
sudo python36 -m pip install re
sudo python36 -m pip install spark-nlp==2.7.2
Solution 1:[1]
Make sure you use a supported EMR version, see here for supported versions
Your bootscript should contain
#!/bin/bash
set -x -e
echo -e 'export PYSPARK_PYTHON=/usr/bin/python3
export HADOOP_CONF_DIR=/etc/hadoop/conf
export SPARK_JARS_DIR=/usr/lib/spark/jars
export SPARK_HOME=/usr/lib/spark' >> $HOME/.bashrc && source $HOME/.bashrc
sudo python3 -m pip install awscli boto spark-nlp
set +x
exit 0
- Provide conf file, you can store it in S3 and pass to the cluster
[{
"Classification": "spark-env",
"Configurations": [{
"Classification": "export",
"Properties": {
"PYSPARK_PYTHON": "/usr/bin/python3"
}
}]
},
{
"Classification": "spark-defaults",
"Properties": {
"spark.yarn.stagingDir": "hdfs:///tmp",
"spark.yarn.preserve.staging.files": "true",
"spark.kryoserializer.buffer.max": "2000M",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.driver.maxResultSize": "0",
"spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.4"
}
}
]
- Finally, start MER cluster, i.e. from CLI
aws emr create-cluster \
--name "Spark NLP 3.4.4" \
--release-label emr-6.2.0 \
--applications Name=Hadoop Name=Spark Name=Hive \
--instance-type m4.4xlarge \
--instance-count 3 \
--use-default-roles \
--log-uri "s3://<S3_BUCKET>/" \
--bootstrap-actions Path=s3://<S3_BUCKET>/emr-bootstrap.sh,Name=custome \
--configurations "https://<public_access>/sparknlp-config.json" \
--ec2-attributes KeyName=<your_ssh_key>,EmrManagedMasterSecurityGroup=<security_group_with_ssh>,EmrManagedSlaveSecurityGroup=<security_group_with_ssh> \
--profile <aws_profile_credentials>
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | ckloan |
