'Pandas: Unexpected behavior for apply function with torch.tensor()

I confused of the behavior of the panda.apply() function. I want to convert a column containing a list of int to a troch.tensor. Here is some sample code showing the behavior:

df_test = pd.DataFrame([3,3,3], columns=['value'])
df_test.value = df_test.value.apply(lambda x: [y for y in range(x)])
print(df_test)
# Output:
#        value
# 0  [0, 1, 2]
# 1  [0, 1, 2]
# 2  [0, 1, 2]


print(df_test.value.apply(lambda x: torch.tensor(x)))
# Output:
#                                value
# 0  [tensor(0), tensor(1), tensor(2)]
# 1  [tensor(0), tensor(1), tensor(2)]
# 2  [tensor(0), tensor(1), tensor(2)]

print(df_test.value.apply(lambda x: x + [12]))
# Output:
# 0    [0, 1, 2, 12]
# 1    [0, 1, 2, 12]
# 2    [0, 1, 2, 12]

print(torch.tensor([1,2,3]))
# Output:
# tensor([1, 2, 3])

I would have expected, one tensor with three elements per row element, but instead the apply creates a list of tensors containing one element. For testing, I added an example that adds an element to the list, to ensure, that x is the list itself. As you can see it behaves as expected. Can anyone explain the behavior?

Is there a workaround? I don't want to use torch.tensor(df.values), since I need to apply the tensor transformation to multiple columns and want to keep them in the dataframe. Thanks!



Solution 1:[1]

The reason is that apply function converts implicitly a tensor to list because the type of df_test.value[0] is a list. When you convert a tensor to a list, here is a result:

print(df_test.value[0])  # list
x = torch.tensor([1,2,3])
print(list(x))    # convert a tensor to a list

[tensor(1), tensor(2), tensor(3)]

You expected tensor([1, 2, 3]) replacing each list in df_test["value"]. But do not forget the column type will be tensor, which is not valid type in pandas.

To solve this problem is to convert a dataframe to NumPy array and then to a tensor. Then you can do all your transformations and then convert it again to NumPy to pandas.

If you try this code:

df_test["new"]= torch.tensor([1,2,3])
type(df_test.new.dtype) # it is not tensor but NumPy which is implicit conversion

numpy.dtype[int64]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Phoenix