'PySpark input_file_name() into a variable NOT df

I want to store the value from input_file_name() into a variable instead of a dataframe. This variable will then be used for logging and troubleshooting.etc

pyspark

Solution 1:^[1]

You can create a new column on the data frame using withColumn and input_file_name() and then use collect() operation, something like below:

df = spark.read.csv("/FileStore/tmp/part-00000-tid-6847462229548084439-4a50d1c2-9b65-4756-9a29-0044d620a1da-11-1-c000.csv")
df.show()

+-----+
|  _c0|
+-----+
|43368|
+-----+

from pyspark.sql.functions import *

df1 = df.withColumn("file_name", input_file_name())
df1.show(truncate=False)

+-----+---------------------------------------------------------------------------------------------------------+
|_c0  |file_name                                                                                                |
+-----+---------------------------------------------------------------------------------------------------------+
|43368|dbfs:/FileStore/tmp/part-00000-tid-6847462229548084439-4a50d1c2-9b65-4756-9a29-0044d620a1da-11-1-c000.csv|
+-----+---------------------------------------------------------------------------------------------------------+

Now, creating a variable with file_name using collect and then split it on /

file_name = df1.collect()[0][1].split("/")[3]

print(file_name)

Output

part-00000-tid-6847462229548084439-4a50d1c2-9b65-4756-9a29-0044d620a1da-11-1-c000.csv

Please note, in your case index for both collect as well as well as after split might be differ.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	DKNY

'PySpark input_file_name() into a variable NOT df

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]