'Handling symlinks in pyspark textFileStream
I have a textFileStream which reads files coming into a folder 'hl/' every 3 seconds
This works fine when I put actual .csv files inside the folder. However when putting symlinks to .csv files that are located elsewhere on the drive it does not work.
Is it possible to make it work, what am I missing ?
Here's my code for the textFileStream
from pyspark.streaming import StreamingContext
from pyspark.sql import SparkSession
def testprint(rdd):
if not rdd.isEmpty():
print('Empty')
else:
rdd.toDF().pprint()
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
ssc = StreamingContext(sc, 3)
files = ssc.textFileStream('hl/')
files.foreachRDD(testprint)
ssc.start()
time.sleep(30)
ssc.stop()
While It runs I just copy a csv or make a symlink inside the folder. Here's the output while copying :
Empty
----------------------------------
Time: 2022-04-25 12:12:03
----------------------------------
A, B
0, 1
1, 0
0, 0
----------------------------------
Time: 2022-04-25 12:12:06
----------------------------------
And Here's the output using symlinks :
Empty
----------------------------------
Time: 2022-04-25 12:14:03
----------------------------------
Empty
----------------------------------
Time: 2022-04-25 12:14:06
----------------------------------
Edit : reading symlinks works when using the usual non-streaming way i.e. :
> df = spark.read.option('multiline', 'true').option('header', 'true').csv('hl/')
> df.show()
+-----+
| A| B|
| 0| 1|
+-----+
Any help appreciated !
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
