'An error occurred while calling o137.partitions. : org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://ip

I am trying to execute this github project in an aws emr spark cluster https://github.com/pran4ajith/spark-twitter-streaming.git

I've succeeded to run 2 fisrt codes

  • tweet_stream_producer.py
  • sparkml_train_model.py

But when I run consumer part with command

spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0,io.delta:delta-core_2.12:0.7.0 tweet_stream_consumer.py

I got file path error

Py4JJavaError: An error occurred while calling o137.partitions.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://ip-10-0-0-61.ec2.internal:8020/home/hadoop/spark-twitter-streaming/TwitterStreaming/src/app/models/metadata

It seems that the problem is located mapping between local file system path and hadoop file system path

model_path = str(SRC_DIR / 'models')

pipeline_model = PipelineModel.load(model_path)


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source