'PySpark input_file_name() into a variable NOT df
I want to store the value from input_file_name() into a variable instead of a dataframe. This variable will then be used for logging and troubleshooting.etc
Solution 1:[1]
You can create a new column on the data frame using withColumn and input_file_name() and then use collect() operation, something like below:
df = spark.read.csv("/FileStore/tmp/part-00000-tid-6847462229548084439-4a50d1c2-9b65-4756-9a29-0044d620a1da-11-1-c000.csv")
df.show()
+-----+
| _c0|
+-----+
|43368|
+-----+
from pyspark.sql.functions import *
df1 = df.withColumn("file_name", input_file_name())
df1.show(truncate=False)
+-----+---------------------------------------------------------------------------------------------------------+
|_c0 |file_name |
+-----+---------------------------------------------------------------------------------------------------------+
|43368|dbfs:/FileStore/tmp/part-00000-tid-6847462229548084439-4a50d1c2-9b65-4756-9a29-0044d620a1da-11-1-c000.csv|
+-----+---------------------------------------------------------------------------------------------------------+
Now, creating a variable with file_name using collect and then split it on /
file_name = df1.collect()[0][1].split("/")[3]
print(file_name)
Output
part-00000-tid-6847462229548084439-4a50d1c2-9b65-4756-9a29-0044d620a1da-11-1-c000.csv
Please note, in your case index for both collect as well as well as after split might be differ.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | DKNY |
