'shape vs len for numpy array
Is there a difference (in performance for example) when comparing shape and len? Consider the following example:
In [1]: import numpy as np
In [2]: a = np.array([1,2,3,4])
In [3]: a.shape
Out[3]: (4,)
In [4]: len(a)
Out[4]: 4
Quick runtime comparison suggests that there's no difference:
In [17]: a = np.random.randint(0,10000, size=1000000)
In [18]: %time a.shape
CPU times: user 6 µs, sys: 2 µs, total: 8 µs
Wall time: 13.1 µs
Out[18]: (1000000,)
In [19]: %time len(a)
CPU times: user 5 µs, sys: 1 µs, total: 6 µs
Wall time: 9.06 µs
Out[19]: 1000000
So, what is the difference and which one is more pythonic? (I guess using shape).
Solution 1:[1]
From the source code, it looks like shape basically uses len():
https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py
@property
def shape(self) -> Tuple[int, int]:
return len(self.index), len(self.columns)
def __len__(self) -> int:
return len(self.index)
Calling shape will attempt to run both dim calcs. So maybe df.shape[0] + df.shape[1] is slower than len(df.index) + len(df.columns). Still, performance-wise, the difference should be negligible except for a giant giant 2D dataframe.
So in line with the previous answers, df.shape is good if you need both dimensions, for a single dimension, len() seems more appropriate conceptually.
Looking at property vs method answers, it all points to usability and readability of code. So again, in your case, I would say if you want information about the whole dataframe just to check or for example to pass the shape tuple to a function, use shape. For a single column, including index (i.e. the rows of a df), use len().
Solution 2:[2]
There is really (very small) a different. If you work on time-series data and know that the data is vector (1D), use len as it is faster, and make it habit, even if it is just very-very marginal. Bish's answer already explained what happens behind the scene.
Proper benchmark using %%timeit (I test is several times) resulting in len as the victor:
# tested on pandas DataFrame
%%timeit
len(yhat.values)
# 576 ns ± 1.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%%timeit
yhat.values.shape[0]
# 607 ns ± 1.07 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Furthermore, in 1D, len as length is more informative (when you read a code) than .shape[0].
Solution 3:[3]
For 1D case, both len and shape will produce same result. For other case, I shape will provide more information. It depends on program to program in which will provide you better performance. I suggest you to not to worry much about performance.
Solution 4:[4]
import numpy as np
x = np.linspace(1, 10, 10).reshape((5, 2))
print(x)
print(x.size)
print(len(x))
gives the following output:
[[ 1. 2.]
[ 3. 4.]
[ 5. 6.]
[ 7. 8.]
[ 9. 10.]]
10
5
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Bish |
| Solution 2 | Muhammad Yasirroni |
| Solution 3 | Ashiq Imran |
| Solution 4 |
