'Reading from cassandra in Spark does not return all the data when using JoinWithCassandraTable
I'm querying data from Cassandra in Spark using SCC 2.5 and Spark 2.4.7 (pyspark). The table I'm reading from has a composite partition key (partition_key_1, partition_key_1). For the sake of this example, let's say they both range from 0 to 9.
My code looks like this:
df = (spark.read
.format("org.apache.spark.sql.cassandra")
.options(table="my_table", keyspace="my_keyspace")
.load()
.filter(f.col('partition_key_1').isin(list(range(0,10))))
.filter(f.col('partition_key_2').isin(list(range(0,10))))
.select('partition_key_1', 'partition_key_2', 'some_value')
)
This works perfectly fine and I get all the data I'm expecting.
However if I change
spark.cassandra.sql.inClauseToJoinConversionThreshold
(see https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md) to something lower like 20 which means I hit the threshold (my cross-product is 10*10=100) and JoinWithCassandraTable will be used. I suddenly do not get all the data, and on top of that I get duplicated rows for some of the data. It looks like I'm completely missing some of the partition keys, and some of the partition keys return duplicated rows (this quick-analysis might however be wrong).
Anybody got any idea on whats going on? Can I "inspect" the query that is sent to cassandra (other than the Physical Plan in which I did not find anything interesting)?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
