'Spark RDD: Find the single row that has the highest count and for that row report the month, count and hashtag name. Output Using PrintLn

[Spark RDD] Find the single row that has the highest count and for that row report the month, count and hashtag name. Print the result to the terminal output using println. So, for the above small example data set the result would be:

Here is a small example of the Twitter data that we will use to illustrate the subtasks below:

enter image description here

Please check the data in the above image

Expected Output: month: 200907, count: 1000, hashtagName: abc

My Input:

import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.SparkContext

object Main {
  def solution(sc: SparkContext) {
    // Load each line of the input data
    val twitterLines = sc.textFile("Assignment_Data/twitter-small.tsv")
    // Split each line of the input data into an array of strings
    val twitterdata = twitterLines.map(_.split("\t"))

    // TODO: *** Put your solution here ***

val find_max = twitterdata.reduce { (max: Array[String], current: Array[String]) =>
val curCount = current(2).toInt
val maxCount = max(2).toInt
if (curCount > maxCount) current else max 
}

Please help, what to do next?

Thanks.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source