'Returning a Pandas dataframe to the caller of a function (return vs. assign variable to function call)
Let's assume we have the following Pandas dataframe df:
df = pd.DataFrame({'food' : ['spam', 'ham', 'eggs', 'ham', 'ham', 'eggs', 'milk'],
'sales' : [10, 15, 12, 5, 14, 3, 8]})
Let's further assume that we have the following function that squares the value of the sales column in df:
def square_sales(df):
df['sales'] = df['sales']**2
return df
Now, let's assume we have a requirement to: "return df to the caller"
Does this mean that we pass a df to the square_sales function, then return the processed df (i.e. the df with the squared sales column?
Or, does this mean that we pass df to square_sales, then assign that function call to a variable named df? For example:
df = square_sales(df)
Thanks!
Solution 1:[1]
The function changes the df itself (inplace operation). Even if you don't return the df, it will change in the calling scope as well.
The way it is written will work the same for both cases:
df = square_sales(df)
and
square_sales(df)
If you need to return a new df w/o altering the original you'll have to first make a copy and only then assign the new column. In this case you will also have to return the new df to a new variable:
def square_sales(df):
df2 = df.copy(deep=True)
df2['sales'] = df2['sales']**2
return df2
new_df = square_sales(df)
Solution 2:[2]
I think there's some aspect of functions and variable scope that you're confused about, but I'm not sure precisely what. If the function returns a DataFrame, then outside of the function you can assign that returned DataFrame to whatever variable you want. Whether or not the variable name outside the function is the same as the variable name inside the function doesn't matter, as far as the function is concerned.
SiP's answer already points out that your function modifies the original input DataFrame in place and returns the updated version. I would caution that this is a misleading antipattern. Functions that operate on a mutable value (like a DataFrame) are usually expected to only do one or the other. And Pandas' own methods, by default, return the new value without modifying in placeāas it appears you've been asked to do.
So I would advise that you use the modified function suggested by SiP, that copies the supplied DataFrame before making changes. As for using it, all of these do basically the same thing:
# Define df
df = square_sales(df)
# Define df
new_df = square_sales(df)
# Define df
some_other_variable_name = square_sales(df)
The only real difference is that in the first case, you no longer have access to the previous, unmodified DataFrame. But if you don't need that, and from henceforth you only plan to need the squared version, then it can make perfect sense.
(Also, if you wanted to, you could alter the function definition to use a different parameter name, say my_internal_df. This would not in any way affect how any of those three examples work.)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | SiP |
| Solution 2 |
