'PySpark input_file_name() into a variable NOT df

I want to store the value from input_file_name() into a variable instead of a dataframe. This variable will then be used for logging and troubleshooting.etc



Solution 1:[1]

You can create a new column on the data frame using withColumn and input_file_name() and then use collect() operation, something like below:

df = spark.read.csv("/FileStore/tmp/part-00000-tid-6847462229548084439-4a50d1c2-9b65-4756-9a29-0044d620a1da-11-1-c000.csv")
df.show()

+-----+
|  _c0|
+-----+
|43368|
+-----+

from pyspark.sql.functions import *

df1 = df.withColumn("file_name", input_file_name())
df1.show(truncate=False)

+-----+---------------------------------------------------------------------------------------------------------+
|_c0  |file_name                                                                                                |
+-----+---------------------------------------------------------------------------------------------------------+
|43368|dbfs:/FileStore/tmp/part-00000-tid-6847462229548084439-4a50d1c2-9b65-4756-9a29-0044d620a1da-11-1-c000.csv|
+-----+---------------------------------------------------------------------------------------------------------+

Now, creating a variable with file_name using collect and then split it on /

file_name = df1.collect()[0][1].split("/")[3]

print(file_name)

Output

part-00000-tid-6847462229548084439-4a50d1c2-9b65-4756-9a29-0044d620a1da-11-1-c000.csv

Please note, in your case index for both collect as well as well as after split might be differ.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 DKNY