'Create a dataframe for stock analysis using a datetimeindex timeseries data source
I have a datasource which gives me the following dataframe, pricehistory:
+---------------------+------------+------------+------------+------------+----------+------+
| time | close | high | low | open | volume | red |
+---------------------+------------+------------+------------+------------+----------+------+
| | | | | | | |
| 2020-01-02 10:14:00 | 321.336177 | 321.505186 | 321.286468 | 321.505186 | 311601.0 | True |
| 2020-01-02 11:16:00 | 321.430623 | 321.465419 | 321.395827 | 321.465419 | 42678.0 | True |
| 2020-01-02 11:17:00 | 321.425652 | 321.445536 | 321.375944 | 321.440565 | 39827.0 | True |
| 2020-01-02 11:33:00 | 321.137343 | 321.261614 | 321.137343 | 321.261614 | 102805.0 | True |
| 2020-01-02 12:11:00 | 321.256643 | 321.266585 | 321.241731 | 321.266585 | 25629.0 | True |
| 2020-01-02 12:12:00 | 321.246701 | 321.266585 | 321.231789 | 321.266585 | 40869.0 | True |
| 2020-01-02 13:26:00 | 321.226818 | 321.266585 | 321.226818 | 321.261614 | 44011.0 | True |
| 2020-01-03 10:18:00 | 320.839091 | 320.958392 | 320.828155 | 320.958392 | 103351.0 | True |
| 2020-01-03 10:49:00 | 320.988217 | 321.077692 | 320.988217 | 321.057809 | 84492.0 | True |
| etc... | etc... | etc... | etc... | etc... | etc... | etc. |
+---------------------+------------+------------+------------+------------+----------+------+
Output of pricehistory.dtypes:
close float64
high float64
low float64
open float64
volume float64
red bool
dtype: object
Output of pricehistory.index.dtype:
dtype('<M8[ns]')
Note: This dataframe is large, each row is 1-min of data and spans for months, so there are many time frames to iterate over.
Question:
I have some specific criteria I'd like to use that will become columns in a new dataframe:
- High price and time (minute) of each day for the entire dataframe
- The first occurrence of 4 downward trending minutes
open < closeduring the day with their respective times
So far, I'm not exactly sure how to pull the time (datetimeindex value) and high price from pricehistory.
For (1) above, I'm using pd.DataFrame(pricehistory.high.groupby(pd.Grouper(freq='D')).max()) which gives me:
+------------+------------+
| time | high |
+------------+------------+
| | |
| 2020-01-02 | 322.956677 |
| 2020-01-03 | 321.753729 |
| 2020-01-04 | NaN |
| 2020-01-05 | NaN |
| 2020-01-06 | 321.843204 |
| etc... | etc... |
+------------+------------+
But this doesn't work because it's only giving me the day and not down to the minute, and using min as the Grouper freq doesn't work because then it's just the max value of each min, which is high.
Desired outcome (note: minutes included):
+---------------------+------------+
| time | high |
+---------------------+------------+
| | |
| 2020-01-02 9:31:00 | 322.956677 |
| 2020-01-03 10:13:11 | 321.753729 |
| 2020-01-04 15:33:12 | 320.991231 |
| 2020-01-06 12:01:23 | 321.843204 |
| etc... | etc... |
+---------------------+------------+
For (2) above, I'm using the following:
pricehistory['red'] = pricehistory['close'].lt(pricehistory['open'])
To make a new column in pricehistory which shows us if there are 4 red minutes in a row.
Then, using new_pricehistory = pricehistory.loc[pricehistory[::-1].rolling(4)['red'].sum().eq(4)], this gives a new dataframe of only the rows where 4 red minutes in a row occur, preferably I'd like to only have the very first occurrence, not all.
Current output:
+---------------------+------------+------------+------------+------------+--------+------+
| time | close | high | low | open | volume | red |
+---------------------+------------+------------+------------+------------+--------+------+
| | | | | | | |
| 2020-01-02 10:14:00 | 321.336177 | 321.505186 | 321.286468 | 321.505186 | 311601 | TRUE |
| 2020-01-03 10:18:00 | 320.839091 | 320.958392 | 320.828155 | 320.958392 | 103351 | TRUE |
| 2020-01-06 10:49:00 | 320.520956 | 320.570665 | 320.501073 | 320.550781 | 71901 | TRUE |
+---------------------+------------+------------+------------+------------+--------+------+
Solution 1:[1]
Given you didn't provide data I create some dummy one. By SO policy you should make different question per problem. For now I'm answering the first one.
Generate data
import pandas as pd
import numpy as np
times = pd.date_range(start="2020-06-01", end="2020-06-10", freq="1T")
df = pd.DataFrame({"time":times,
"high":np.random.randn(len(times))})
Question 1
Here I just look for index where the maximum per day occour and filter df accordingly
idx = df.groupby(df["time"].dt.date)["high"].idxmax().values
df[df.index.isin(idx)]
Update: In case you have time as index in your df the solution will be
df = df.set_index("time")
idx = df.groupby(pd.Grouper(freq='D'))["high"].idxmax().values
df[df.index.isin(idx)]
Question 2
import pandas as pd
import numpy as np
# generate data
times = pd.date_range(start="2020-06-01", end="2020-06-10", freq="1T")
df = pd.DataFrame({"time":times,
"open":np.random.randn(len(times))})
df["open"] = np.where(df["open"]<0, -1 * df["open"], df["open"])
df["close"] = df["open"] + 0.01 *np.random.randn(len(times))
df = df.set_index("time")
df["red"] = df['close'].lt(df['open'])
# this function return the first time
# when there are 4 consecutive red
def get_first(ts):
idx = ts.loc[ts[::-1].rolling(4)['red'].sum().ge(4)].index
if idx.empty:
return pd.NaT
else:
return idx[0]
# get first time within group and drop nan
grp = df.groupby(pd.Grouper(freq='D'))\
.apply(get_first).dropna()
df[df.index.isin(grp.values)]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
