'Apache spark, is it possible to have Google guice as dependency injection technique

Is it possible to use Google guice as dependency injection provider for a Apache spark Java application?

I am able to achieve this if the execution is happening at the driver but no control over when the execution is happening at executors.

Is it even possible to use the injected objects at the executors? Its hard to manage the code with out the dependency injection in the spark applications.



Solution 1:[1]

I think the neutrino framework is exactly for your requirement.

Disclaimer: I am the author of the neutrino framework.

This framework provides the capability to use dependency injection (DI) to generate the objects and control their scope at both the driver and executors.

How does it do that

As we know, to adopt the DI framework, we need to first build a dependency graph first, which describes the dependency relationship between various types and can be used to generate instances along with their dependencies. Guice uses Module API to build the graph while the Spring framework uses XML files or annotations. The neutrino is built based on Guice framework, and of course, builds the dependency graph with the guice module API. It doesn't only keep the graph in the driver, but also has the same graph running on every executor.

serialize creation method In the dependency graph, some nodes may generate objects which may be passed to the executors, and neutrino framework would assign unique ids to these nodes. As every JVM have the same graph, the graph on each JVM have the same node id set. When an instance to be transferred is requested from the graph at the driver, instead of creating the actual instance, it just returns a stub object which holds the object creation method (including the node id). When the stub object is passed to the executors, the framework will find the corresponding node in the graph in the executor JVM with the id and recreate the same object and its dependencies there.

Here is an example:

Example:

Here is a simple example (just filter a event stream based on redis data):

trait EventFilter[T] {
    def filter(t: T): Boolean
}

// The RedisEventFilter class depends on JedisCommands directly,
// and doesn't extend `java.io.Serializable` interface.
class RedisEventFilter @Inject()(jedis: JedisCommands)
extends EventFilter[ClickEvent] {
   override def filter(e: ClickEvent): Boolean = {
       // filter logic based on redis
   }
}

/* create injector */
val injector = ...

val eventFilter = injector.instance[EventFilter[ClickEvent]]
val eventStream: DStream[ClickEvent] = ...
eventStream.filter(e => eventFilter.filter(e))

Here is how to config the bindings:

class FilterModule(redisConfig: RedisConfig) extends SparkModule {
   override def configure(): Unit = {
       // the magic is here
       // The method `withSerializableProxy` will generate a proxy 
       // extending `EventFilter` and `java.io.Serializable` interfaces with Scala macro.
       // The module must extend `SparkModule` or `SparkPrivateModule` to get it
       bind[EventFilter[ClickEvent]].withSerializableProxy
           .to[RedisEventFilter].in[SingletonScope]
   }
}

With neutrino, the RedisEventFilter doesn't even care about serialization problem. Every thing just works like in a single JVM.

For details, please refer to the neutrino readme file.

Limitation

Since this framework uses scala macro to generate the proxy class, the guice modules and the logic of how to wire up these modules needs to be written with scala. Other classes such as EventFilter and its implementations can be java.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1