'Why does RDD work when I don't have the implicit encoder? Why does providing an implicit encoder also fix the issue?

The Spark quickstart example provides some code to count the occurrences of each word that appears in the README.md document:

val textFile = spark.read.textFile("README.md")
textFile.flatMap(line => line.split(" ")).groupByKey(identity).count().collect()
res0: Array[(String, Long)] = Array(([![PySpark,1), (online,1), (graphs,1), (["Building,1), (documentation,3), (command,,2), (abbreviated,1), (overview,1), (rich,1), (set,2), (-DskipTests,1), (1,000,000,000:,2), (name,1), (["Specifying,1), (stream,1), (run:,1), (not,1), (programs,2), (tests,2), (./dev/run-tests,1), (will,1), ([run,1), (particular,2), (Alternatively,,1), (must,1), (using,3), (./build/mvn,1), (you,4), (MLlib,1), (DataFrames,,1), (variable,1), (Note,1), (core,1), (protocols,1), (Guide](https://spark.apache.org/docs/latest/configuration.html),1), (guidance,2), (shell:,2), (can,6), (site,,1), (*,4), (systems.,1), ([building,1), (configure,1), (for,12), (README,1), (Interactive,2), (how,3), ([Configuration,1), (Hive,2), (provides,1), (Hadoop-supporte...

I thought it would be a good excercise to figure out how to modify the code so that I could count the characters instead of the words. I inncorrectly assume I could replace line.split(" ") which toCharArray(). The original was a list of String, the replacement is a list of Char. This didn't work:

textFile.flatMap(line => line.toCharArray()).groupByKey(identity).count().collect()
<console>:24: error: Unable to find encoder for type Char. An implicit Encoder[Char] is needed to store Char instances in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._  Support for serializing other types will be added in future releases.
       textFile.flatMap(line => line.toCharArray()).groupByKey(identity).count().collect()
                       ^
<console>:24: error: missing argument list for method identity in object Predef
Unapplied methods are only converted to functions when a function type is expected.
You can make this conversion explicit by writing `identity _` or `identity(_)` instead of `identity`.
       textFile.flatMap(line => line.toCharArray()).groupByKey(identity).count().collect()

After exploring stack overflow for the error message I found Unable to find Encode[Char] while using flatMap with toCharArray in spark which says you need to use rdd. As a result, I modified the above with rdd:

textFile.flatMap(line => line.split("")).groupByKey(identity).count().collect()
res6: Array[(String, Long)] = Array((K,1), (7,3), (l,156), (x,12), (=,5), (<,2), (],19), (g,96), (3,3), (F,7), (Q,2), (*,4), (0,37), (m,75), (!,4), (E,9), (T,14), (f,47), (B,6), ((,22), (n,225), (k,58), (.,81), (_,4), (Y,6), (L,4), (M,8), (V,3), (U,2), (v,43), (e,316), (D,8), (O,2), (o,246), (h,126), (z,1), (C,5), (p,160), (d,108), (J,2), (-,35), (A,18), (/,109), (N,5), (X,1), (y,38), (w,23), (),22), (c,116), (S,35), (u,119), (:,29), (i,215), (R,9), (G,3), (",12), (1,8), (q,2), (j,8), (#,22), (%,1), (`,6), (b,59), (I,6), (&,1), (P,14), (,,32), (a,299), (r,230), ("",41), (" ",462), (?,3), (t,264), (>,6), (2,2), (H,10), (s,228), ([,19))

The issue is summarized by a commenter in the linked stack overflow article:

char is not a default Spark datatype and so it cannot be encoded.

An alternative solution is also using rdd as suggested in one of the answers from the article:

textFile.rdd.flatMap(line => line.toCharArray().map(c=>(c,1))).reduceByKey(_+_).collect()
res7: Array[(Char, Int)] = Array((w,23), (",12), (`,6), (Q,2), (e,316), (G,3), (7,3), (R,9), (B,6), (P,14), (O,2), (b,59), (y,38), (A,18), (#,22), (2,2), (h,126), (o,246), (i,215), (K,1), (3,3), (%,1), (k,58), (n,225), (-,35), (j,8), (J,2), (?,3), (H,10), (S,35), (F,7), (Y,6), (&,1), (1,8), (g,96), (N,5), (l,156), (m,75), (c,116), (T,14), (d,108), (),22), (=,5), (z,1), (s,228), (/,109), (L,4), (x,12), (p,160), (M,8), (a,299), (_,4), (t,264), (.,81), (0,37), (u,119), (I,6), ( ,462), (>,6), (],19), (!,4), (*,4), (f,47), (q,2), (v,43), ((,22), (C,5), (E,9), (U,2), (:,29), (,,32), (V,3), (<,2), ([,19), (X,1), (r,230), (D,8))

Given my experience above I have a fundamental misunderstanding. I have these two questions to help me understand spark better:

  1. Why does RDD work with a non-primitive type?
  2. Why does providing an implicit encoder also fix the issue? How can I provide an implicit Encoder[Char], so that I don't need RDD?


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source