'How to read and write BSON files with Spark?

I have many MongoDB dumps in gzip compressed BSON files, each with multiple documents. I would like to read them directly to Spark, ideally partitioning on individual document level.

Previous discussions (1, 2) are old and use the depracated Hadoop Mongo connector. The new, actively maintained Spark Mongo connector seems to implement a DefaultSource interface, a couple custom partitioners, and a connection layer.

I would like to extract (or contribute) a way to read a multi-document BSON file from disk into a DataFrame, such that different documents can be loaded into different partitions. Writing would also be great to have for completeness, but I'm not sure how robust can writing to a single file from multiple writers be. I am new to Spark and unsure where to start.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source