'slow performance of spark
I am new to Spark. I have a requirement where I need to integrate Spark with Web Service. Any request to a Web service has to be processed using Spark and send the response back to the client.
I have created a small dummy service in Vertx, which accepts request and processes it using Spark. I am using Spark in cluster mode (1 master, 2 slaves, 8 core, 32 Gb each, running on top of Yarn and Hdfs)
public class WebServer {
private static SparkSession spark;
private static void createSparkSession(String masterUrl) {
SparkContext context = new SparkContext(new SparkConf().setAppName("Web Load Test App").setMaster(masterUrl)
.set("spark.hadoop.fs.default.name", "hdfs://x.x.x.x:9000")
.set("spark.hadoop.fs.defaultFS", "hdfs://x.x.x.x:9000")
.set("spark.hadoop.fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName())
.set("spark.hadoop.fs.hdfs.server", org.apache.hadoop.hdfs.server.namenode.NameNode.class.getName())
.set("spark.hadoop.conf", org.apache.hadoop.hdfs.HdfsConfiguration.class.getName())
.set("spark.eventLog.enabled", "true")
.set("spark.eventLog.dir", "hdfs://x.x.x.x:9000/spark-logs")
.set("spark.history.provider", "org.apache.spark.deploy.history.FsHistoryProvider")
.set("spark.history.fs.logDirectory", "hdfs://x.x.x.x:9000/spark-logs")
.set("spark.history.fs.update.interval", "10s")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
//.set("spark.kryoserializer.buffer.max", "1g")
//.set("spark.kryoserializer.buffer", "512m")
);
spark = SparkSession.builder().sparkContext(context).getOrCreate();
}
public static void main(String[] args) {
Vertx vertx = Vertx.vertx();
String masterUrl = "local[2]";
if(args.length == 1) {
masterUrl = args[0];
}
System.out.println("Master url: " + masterUrl);
createSparkSession(masterUrl);
WebServer webServer = new WebServer();
webServer.start(vertx);
}
private void start(Vertx vertx) {
int port = 19090;
Router router = Router.router(vertx);
router.route("/word_count*").handler(BodyHandler.create());
router.post("/word_count").handler(this::calWordFrequency);
vertx.createHttpServer()
.requestHandler(router::accept)
.listen(port,
result -> {
if (result.succeeded()) {
System.out.println("Server started @ " + port);
} else {
System.out.println("Server failed to start @ " + port);
}
});
}
private void calWordFrequency(RoutingContext routingContext) {
WordCountRequest wordCountRequest = routingContext.getBodyAsJson().mapTo(WordCountRequest.class);
List<String> words = wordCountRequest.getWords();
Dataset<String> wordsDataset = spark.createDataset(words, Encoders.STRING());
Dataset<Row> wordCounts = wordsDataset.groupBy("value").count();
List<String> result = wordCounts.toJSON().collectAsList();
routingContext.response().setStatusCode(300).putHeader("content-type", "application/json; charset=utf-8")
.end(Json.encodePrettily(result));
}
}
When I make a post request with payload size of about 5kb, then it is taking around 6 seconds to complete the request and the response back. I feel it is very slow.
However, if I carry a simple example of reading file from Hbase and performing transformation and displaying result, it is very fast. I am able to processes a file of 8Gb file in 2mins. Eg:
logFile="/spark-logs/single-word-line.less.txt"
master_node = 'spark://x.x.x.x:7077'
spark = SparkSession.builder.master(master_node).appName('Avi-load-test').getOrCreate()
log_data = spark.read.text(logFile)
word_count = (log_data.groupBy('value').count())
print(word_count.show())
What is the reason for my application to run so slow? Any pointers would be really helpful. Thank you in advance.
Solution 1:[1]
Spark processing is asynchronous, you are using it as part of a synchronous flow. You can do that but can't expect the processing to be finished. We have implemented similar use case - we have a REST service which triggers a spark job. The implementation does return spark job id(takes some time). And we have another end point to get job status using the Job Id, but we didn't implement a flow to return data from spark job through REST srvc and it is not recommended.
Our flow
REST API <-> Spark Job <-> HBase/Kafka
The REST endpoint triggers a Spark Job and the Spark job reads data from HBase and does the processing and write data back to HBase and Kafka. The REST API is called by a different and they receive the data that is processed through Kafka.
I think you need to rethink your architecture.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | nanic |
