'Python Sdk Code Example for Splittable Dofns in Apache Beam

I am creating a dataflow pipeline in python in which i need to use FileIO because i want to access and keep track of the filenames processed.

Everything is working fine ,till my files are of small sizes. When running on a large files(GBS of data) , the dataflow job is not performant or scalable.

I have a solution of using splittable Pardos(dofns) which I have earlier implemented using java, but now the preferred language is python.

The problem is that I am unable to find any decent code snippets (example) which will explain how to implement a splittable pardo (dofns) in python sdk.

I have found a code example in the beam documentation https://beam.apache.org/blog/splittable-do-fn/ , but it's not correct as far as I have tried it.

class CountFn(DoFn):
  def process(element, tracker=DoFn.RestrictionTrackerParam)
    for i in xrange(*tracker.current_restriction()):
      if not tracker.try_claim(i):
        return
      yield element[0], i

  def get_initial_restriction(element):
    return (0, element[1])

In this example , we are passing tracker=DoFn.RestrictionTrackerParam in the process method, but as I can see DoFn class does not have any parameter RestrictionTrackerParam.

As far as I have tested this example is not complete.

Can i get some help on getting a decent example of splittable dofns used in python sdk.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source