'How to generate a DStream from RateStreamSource in Spark
I have a case class in Scala like this
case class RemoteCopyGroup(
ts: Long,
systemId: String,
name: String,
id: Int,
role: String,
mode: String,
remoteGroupName: String)
object RemoteCopyGroup {
// to be removed
val arrayOfIds = Array("CZ210507H1", "CZ20030W4H", "CZ29400JBJ")
def randomSerialNumber = Random.shuffle(arrayOfIds.toList).head
def get(x: Rate): RemoteCopyGroup = {
RemoteCopyGroup(
x.timestamp.getTime / 1000,
randomSerialNumber,
Random.nextString(2),
Random.nextInt(3),
Random.nextString(2),
Random.nextString(2),
Random.nextString(2))
}
}
I am generating a stream of data using RateStreamSource like this
val remoteCopyGroupDS: Dataset[(String, RemoteCopyGroup)] = sparkSession
.readStream
.format("rate") // <-- use RateStreamSource
.option("rowsPerSecond", rate)
.load()
.as[Rate].filter(_.value % 10 == 0)
.map(RemoteCopyGroup.get).map(rcg => rcg.systemId -> rcg)
I want to do stateful operations on remoteCopyGroupDS but I am not able to use methods like mapWithState because remoteCopyGroupDS is not a DStream.
Is there a way I can generate a DStream that continuously emits data or I can convert current DataSet i.e. remoteCopyGroupDS to DStream ?
Solution 1:[1]
The KafkaRate is a stream of rate/price data that is continuously published by a Kafka Topic or a Kafka Topic Stream. In Spark, the DStream is a distributed stream of data that can be computed on the fly. DStreams represent a time series of data and are used to compute values for use in other Spark operations. DStreams can be created from a variety of sources including Kafka, Flume, HDFS, and many others. Spark provides a Receiver object to interface with Kafka sources. You can create Receiver objects from Kafka Sources using the created receiver method that is provided in the SparkContext.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | George Harry |
