'How to keep np.array properties in pandas dataframe where elements are arrays?

I want to use a dataframe as a sort of database where elements of it are numpy arrays and I want to keep their properties for later use in dataframe operations.

import numpy as np
import pandas as pd

names=["randomname"]*4
identifiers=["a","b","b","c"]
arrays=[np.arange(0,10),np.arange(20,30),np.arange(22,32),np.arange(40,50)]
alltogether=np.array([names,identifiers,arrays])
df=pd.DataFrame(data=alltogether.T,columns=["names","id","arrays"])

This gives me somewhat the desired Dataframe.

However I want to be able to use DataFrame indexing logic together with plotting.

For example

df[df["id"]=="b"].plot()

this currently gives

TypeError: no numeric data to plot

Now can anybody help on how to still keep this element consisting of a np.array ?

Ideally my indexing logic would enable me to plot multiple of the arrays with certain criteria(here id=b)

I am kinda lost



Solution 1:[1]

Your code:

In [38]: names=["randomname"]*4
    ...: identifiers=["a","b","b","c"]
    ...: arrays=[np.arange(0,10),np.arange(20,30),np.arange(22,32),np.arange(40,
    ...: 50)]
    ...: alltogether=np.array([names,identifiers,arrays])
    ...: df=pd.DataFrame(data=alltogether.T,columns=["names","id","arrays"])
<ipython-input-38-f2bc5f6c3a15>:4: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  alltogether=np.array([names,identifiers,arrays])

Did you get this warning? Does it bother you? It's produced by that alltogether line. You are mixing strings and arrays, and result has to be an object dtype array.

Anyways, the result (which you should have shown :( ):

In [39]: df
Out[39]: 
        names id                                    arrays
0  randomname  a            [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
1  randomname  b  [20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
2  randomname  b  [22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
3  randomname  c  [40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
In [40]: df.dtypes
Out[40]: 
names     object
id        object
arrays    object
dtype: object
In [41]: alltogether
Out[41]: 
array([['randomname', 'randomname', 'randomname', 'randomname'],
       ['a', 'b', 'b', 'c'],
       [array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
        array([20, 21, 22, 23, 24, 25, 26, 27, 28, 29]),
        array([22, 23, 24, 25, 26, 27, 28, 29, 30, 31]),
        array([40, 41, 42, 43, 44, 45, 46, 47, 48, 49])]], dtype=object)
In [42]: df['arrays'].to_numpy()
Out[42]: 
array([array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
       array([20, 21, 22, 23, 24, 25, 26, 27, 28, 29]),
       array([22, 23, 24, 25, 26, 27, 28, 29, 30, 31]),
       array([40, 41, 42, 43, 44, 45, 46, 47, 48, 49])], dtype=object)

Selecting a couple of rows:

In [46]: df[df['id']=='b']
Out[46]: 
        names id                                    arrays
1  randomname  b  [20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
2  randomname  b  [22, 23, 24, 25, 26, 27, 28, 29, 30, 31]

What is the plot method supposed to do? I haven't used it

I can make a simple line plot from a frame like this:

In [58]: adf = pd.DataFrame(np.arange(5)**2)
In [59]: adf
Out[59]: 
    0
0   0
1   1
2   4
3   9
4  16
In [60]: adf.plot()

But that has one number per cell, not strings (your id and names columns) or arrays.

I could use matplotlib plot function calls on the individual array elements of your frame. But a simple call the dataframe plot method won't do it.

In [68]: df[df['id']=='b']['arrays'].to_numpy()
Out[68]: 
array([array([20, 21, 22, 23, 24, 25, 26, 27, 28, 29]),
       array([22, 23, 24, 25, 26, 27, 28, 29, 30, 31])], dtype=object)
In [69]: plt.plot(_[0],_[1])
Out[69]: [<matplotlib.lines.Line2D at 0x7f3915e6bfd0>]

This uses the 20:30 range as x axis, and 22:32 as y.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 hpaulj