'RDD saveToCassandra() updating the column value inconsistently when spark.core is more than 1

Scenario 1

I'm using saveToCassandra API to update few columns of a row several times. I have a database with job_name as primary_key and status as a column. The spark job updates the row in twice. First, when it starts updates the status to INPROGRESS using rdd.saveToCassandra() and finally, depending on whether the job is a success or a failure, the status is again updated.

The final update does not happen at times. The status is left in INPROGRESS. No errors at all. After several trail and error, I was able to pin-point that, this inconsistency happens due to spark.cores.max being more than 1. When I modified spark.cores.max value to have 1 as a value, I was able to see the updates happening consistently.

Question: can anyone help me understand as to why ? Is there a way to overcome this by not compromising on parallelism.

Scenario 2

Tried updating the column several times by using dataframe instead of RDD. sample code

val df1 = Seq((job_name, "", "INPROGRESS"))
  .toDF("job_name", "completion_date", "job_status").cache()  `

df1.write.format("org.apache.spark.sql.cassandra")
  .mode(SaveMode.Append)
  .options(Map("table" -> "job_status", "keyspace" -> "monitor"))
  .save()

Can inconsistencies happen when dataframe is used as well ?

Why RDD isn't working as expected ?. Really appreciate any light thrown on this .

Spark version: 1.5.1 
Scala version: 2.10.4

libraryDependencies += ("com.datastax.spark" %% "spark-cassandra-connector-java" % "1.5.0").exclude("io.netty", "netty-handler")
libraryDependencies += ("com.datastax.spark" %% "spark-cassandra-connector-embedded" % "1.5.0" % "provided" ).exclude("io.netty", "netty-handler")
libraryDependencies += ("com.datastax.cassandra" % "cassandra-driver-core" % "2.1.9" % "provided").exclude("io.netty", "netty-handler")

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'RDD saveToCassandra() updating the column value inconsistently when spark.core is more than 1

Sources

Related Questions