'Is it possible to read files with uncommon extensions (not like .txt or .csv) in Apache Beam using Python SDK? For example, file with .set extension

Is it possible to read files with uncommon extensions (not like .txt or .csv or .json) in Apache Beam using Python SDK? For example, I want to read file from local with .set extension (this is special file with EEG record). I could't find any information about how to implement this on the official page.

If I understand correctly, beam.Create creates PCollection from iterable, but what if my data is not iterable (like data in .set file)? How to read it?



Solution 1:[1]

If you have only one file to process, you can pass a list to beam.Create that contains your set file.

p | 'Initialise pipeline' >> beam.Create(['your_file.set'])

Regarding reading the set file, even if it's not supported officially by beam I/O connectors, you can create your own connector with python

class ReadSetContent(beam.DoFn):

    def process(self, file):

        # your set file path will passed here, so you can read it
        # yield the content of the set file and it will be processed by the net transformation 

So you pipeline start would look like this

(p | 'Initialise pipeline' >> beam.Create(['your_file.set'])
  | 'Reading content' >> beam.ParDo(ReadSetContent())
  | 'Next transformation' >> ...)

Solution 2:[2]

You can use the Beam fileio library to read arbitrary files. Any custom processing should be done by subsequent ParDo transforms.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Idhem
Solution 2 chamikara