'RDD saveToCassandra() updating the column value inconsistently when spark.core is more than 1
Scenario 1
I'm using saveToCassandra API to update few columns of a row several times. I have a database with job_name as primary_key and status as a column. The spark job updates the row in twice. First, when it starts updates the status to INPROGRESS using rdd.saveToCassandra() and finally, depending on whether the job is a success or a failure, the status is again updated.
The final update does not happen at times. The status is left in INPROGRESS. No errors at all. After several trail and error, I was able to pin-point that, this inconsistency happens due to spark.cores.max being more than 1. When I modified spark.cores.max value to have 1 as a value, I was able to see the updates happening consistently.
Question: can anyone help me understand as to why ? Is there a way to overcome this by not compromising on parallelism.
Scenario 2
Tried updating the column several times by using dataframe instead of RDD. sample code
val df1 = Seq((job_name, "", "INPROGRESS"))
.toDF("job_name", "completion_date", "job_status").cache() `
df1.write.format("org.apache.spark.sql.cassandra")
.mode(SaveMode.Append)
.options(Map("table" -> "job_status", "keyspace" -> "monitor"))
.save()
Can inconsistencies happen when dataframe is used as well ?
Why RDD isn't working as expected ?. Really appreciate any light thrown on this .
Spark version: 1.5.1
Scala version: 2.10.4
libraryDependencies += ("com.datastax.spark" %% "spark-cassandra-connector-java" % "1.5.0").exclude("io.netty", "netty-handler")
libraryDependencies += ("com.datastax.spark" %% "spark-cassandra-connector-embedded" % "1.5.0" % "provided" ).exclude("io.netty", "netty-handler")
libraryDependencies += ("com.datastax.cassandra" % "cassandra-driver-core" % "2.1.9" % "provided").exclude("io.netty", "netty-handler")
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
