'Python Sdk Code Example for Splittable Dofns in Apache Beam
I am creating a dataflow pipeline in python in which i need to use FileIO because i want to access and keep track of the filenames processed.
Everything is working fine ,till my files are of small sizes. When running on a large files(GBS of data) , the dataflow job is not performant or scalable.
I have a solution of using splittable Pardos(dofns) which I have earlier implemented using java, but now the preferred language is python.
The problem is that I am unable to find any decent code snippets (example) which will explain how to implement a splittable pardo (dofns) in python sdk.
I have found a code example in the beam documentation https://beam.apache.org/blog/splittable-do-fn/ , but it's not correct as far as I have tried it.
class CountFn(DoFn):
def process(element, tracker=DoFn.RestrictionTrackerParam)
for i in xrange(*tracker.current_restriction()):
if not tracker.try_claim(i):
return
yield element[0], i
def get_initial_restriction(element):
return (0, element[1])
In this example , we are passing tracker=DoFn.RestrictionTrackerParam in the process method, but as I can see DoFn class does not have any parameter RestrictionTrackerParam.
As far as I have tested this example is not complete.
Can i get some help on getting a decent example of splittable dofns used in python sdk.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|