'Java: Implementing a sliding window of data over an RDD flatmap

I need to do the following thing: Using Apache Spark streaming, for each word in a given string located in a file I want to have a string window of a few words that I can use/print to standard output. So far I have the following code:

SparkConf conf = new SparkConf()
                           .setAppName("SparkApp")
                           .setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> distFile = sc.textFile("file.txt");
JavaRDD<String> words = distFile
                           .flatMap(x -> Arrays.asList(
                               x.replaceAll(",", " ")
                                   .replaceAll("\"", " ")
                                   .split(" ")).iterator());

words.foreach(x -> {
                 System.out.println(x);
             });

The file file.txt looks like this:

The quick brown fox jumps over the lazy dog

The output of the program is the following:

The
quick
brown
fox
jumps
over
the
lazy
dog

What I want it to do is to, for example, if I pass to some method 3 for it to print three words at a time in the following way:

The quick brown
quick brown fox
brown fox jumps
fox jumps over
.. etecetera

What I've tried so far:

  1. Implemented a serializable linked list as a queue to add the current word to and remove the last one to while passing through every word with .foreach. It didn't work out because apparently you can't change local variables through foreaching an RDD
  2. Found something useful and similar to what I'm supposed to do regarding DStreams and sliding windows over data, but it proved extremely difficult and I'd really rather hear you guys' opinion before I delve back into that, if it's indeed the only way or if there are indeed better ways to do it.


Solution 1:[1]

You want a function that takes a string, and returns a list of strings that include N tokens

def splitNum(line: String, count: Int): List[String] = {
    line
      .split(" ")      // break into tokens
      .grouped(count)  // group every COUNT tokens together
      .map { tokens =>
        tokens.mkString(" ") // glue group of tokens back together
  }
 }

You can then apply the function to the text that you read in like:

 distFile.flatMap(line => splitNum(line, 3))

depending on how clean your text file is, you might still have other work to do handling other punctuation, better tokenization, handling multiple spaces, etc

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1