'slow performance of spark

I am new to Spark. I have a requirement where I need to integrate Spark with Web Service. Any request to a Web service has to be processed using Spark and send the response back to the client.

I have created a small dummy service in Vertx, which accepts request and processes it using Spark. I am using Spark in cluster mode (1 master, 2 slaves, 8 core, 32 Gb each, running on top of Yarn and Hdfs)

    public class WebServer {
    
    private static SparkSession spark;
    
    private static void createSparkSession(String masterUrl) {
        SparkContext context = new SparkContext(new SparkConf().setAppName("Web Load Test App").setMaster(masterUrl)
                .set("spark.hadoop.fs.default.name", "hdfs://x.x.x.x:9000")
                .set("spark.hadoop.fs.defaultFS", "hdfs://x.x.x.x:9000")
                .set("spark.hadoop.fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName())
                .set("spark.hadoop.fs.hdfs.server", org.apache.hadoop.hdfs.server.namenode.NameNode.class.getName())
                .set("spark.hadoop.conf", org.apache.hadoop.hdfs.HdfsConfiguration.class.getName())
                .set("spark.eventLog.enabled", "true")
                .set("spark.eventLog.dir", "hdfs://x.x.x.x:9000/spark-logs")
                .set("spark.history.provider", "org.apache.spark.deploy.history.FsHistoryProvider")
                .set("spark.history.fs.logDirectory", "hdfs://x.x.x.x:9000/spark-logs")
                .set("spark.history.fs.update.interval", "10s")
                .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
                //.set("spark.kryoserializer.buffer.max", "1g")
                //.set("spark.kryoserializer.buffer", "512m")
        );

        spark = SparkSession.builder().sparkContext(context).getOrCreate();
    }

    public static void main(String[] args) {

        Vertx vertx = Vertx.vertx();
        String masterUrl = "local[2]";
        if(args.length == 1) {
            masterUrl = args[0];
        }
        System.out.println("Master url: " + masterUrl);
        createSparkSession(masterUrl);
        WebServer webServer = new WebServer();
        webServer.start(vertx);
    }

    private void start(Vertx vertx) {
        int port = 19090;
        Router router = Router.router(vertx);
        router.route("/word_count*").handler(BodyHandler.create());
        router.post("/word_count").handler(this::calWordFrequency);
        vertx.createHttpServer()
                .requestHandler(router::accept)
                .listen(port,
                        result -> {
                            if (result.succeeded()) {
                                System.out.println("Server started @ " + port);
                            } else {
                                System.out.println("Server failed to start @ " + port);
                            }
                        });
    }
    
    private void calWordFrequency(RoutingContext routingContext) {
        WordCountRequest wordCountRequest = routingContext.getBodyAsJson().mapTo(WordCountRequest.class);
        List<String> words = wordCountRequest.getWords();

        Dataset<String> wordsDataset = spark.createDataset(words, Encoders.STRING());
        Dataset<Row> wordCounts = wordsDataset.groupBy("value").count();
        List<String> result = wordCounts.toJSON().collectAsList();
        routingContext.response().setStatusCode(300).putHeader("content-type", "application/json; charset=utf-8")
                .end(Json.encodePrettily(result));

    }
}

When I make a post request with payload size of about 5kb, then it is taking around 6 seconds to complete the request and the response back. I feel it is very slow.

However, if I carry a simple example of reading file from Hbase and performing transformation and displaying result, it is very fast. I am able to processes a file of 8Gb file in 2mins. Eg:

logFile="/spark-logs/single-word-line.less.txt"
master_node = 'spark://x.x.x.x:7077'
spark = SparkSession.builder.master(master_node).appName('Avi-load-test').getOrCreate()
log_data = spark.read.text(logFile)
word_count = (log_data.groupBy('value').count())
print(word_count.show())

What is the reason for my application to run so slow? Any pointers would be really helpful. Thank you in advance.

Solution 1:^[1]

Spark processing is asynchronous, you are using it as part of a synchronous flow. You can do that but can't expect the processing to be finished. We have implemented similar use case - we have a REST service which triggers a spark job. The implementation does return spark job id(takes some time). And we have another end point to get job status using the Job Id, but we didn't implement a flow to return data from spark job through REST srvc and it is not recommended.

Our flow

REST API <-> Spark Job <-> HBase/Kafka

The REST endpoint triggers a Spark Job and the Spark job reads data from HBase and does the processing and write data back to HBase and Kafka. The REST API is called by a different and they receive the data that is processed through Kafka.

I think you need to rethink your architecture.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	nanic

'slow performance of spark

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]