'Create a dataframe for stock analysis using a datetimeindex timeseries data source

I have a datasource which gives me the following dataframe, pricehistory:

+---------------------+------------+------------+------------+------------+----------+------+
|         time        |   close    |    high    |    low     |    open    |  volume  | red  |
+---------------------+------------+------------+------------+------------+----------+------+
|                     |            |            |            |            |          |      |
| 2020-01-02 10:14:00 | 321.336177 | 321.505186 | 321.286468 | 321.505186 | 311601.0 | True |
| 2020-01-02 11:16:00 | 321.430623 | 321.465419 | 321.395827 | 321.465419 | 42678.0  | True |
| 2020-01-02 11:17:00 | 321.425652 | 321.445536 | 321.375944 | 321.440565 | 39827.0  | True |
| 2020-01-02 11:33:00 | 321.137343 | 321.261614 | 321.137343 | 321.261614 | 102805.0 | True |
| 2020-01-02 12:11:00 | 321.256643 | 321.266585 | 321.241731 | 321.266585 | 25629.0  | True |
| 2020-01-02 12:12:00 | 321.246701 | 321.266585 | 321.231789 | 321.266585 | 40869.0  | True |
| 2020-01-02 13:26:00 | 321.226818 | 321.266585 | 321.226818 | 321.261614 | 44011.0  | True |
| 2020-01-03 10:18:00 | 320.839091 | 320.958392 | 320.828155 | 320.958392 | 103351.0 | True |
| 2020-01-03 10:49:00 | 320.988217 | 321.077692 | 320.988217 | 321.057809 | 84492.0  | True |
| etc...              | etc...     | etc...     | etc...     | etc...     | etc...   | etc. |
+---------------------+------------+------------+------------+------------+----------+------+

Output of pricehistory.dtypes:

close     float64
high      float64
low       float64
open      float64
volume    float64
red          bool
dtype: object

Output of pricehistory.index.dtype: dtype('<M8[ns]')

Note: This dataframe is large, each row is 1-min of data and spans for months, so there are many time frames to iterate over.

Question:

I have some specific criteria I'd like to use that will become columns in a new dataframe:

  1. High price and time (minute) of each day for the entire dataframe
  2. The first occurrence of 4 downward trending minutes open < close during the day with their respective times

So far, I'm not exactly sure how to pull the time (datetimeindex value) and high price from pricehistory.

For (1) above, I'm using pd.DataFrame(pricehistory.high.groupby(pd.Grouper(freq='D')).max()) which gives me:

+------------+------------+
|    time    |    high    |
+------------+------------+
|            |            |
| 2020-01-02 | 322.956677 |
| 2020-01-03 | 321.753729 |
| 2020-01-04 | NaN        |
| 2020-01-05 | NaN        |
| 2020-01-06 | 321.843204 |
| etc...     | etc...     |
+------------+------------+

But this doesn't work because it's only giving me the day and not down to the minute, and using min as the Grouper freq doesn't work because then it's just the max value of each min, which is high.

Desired outcome (note: minutes included):

+---------------------+------------+
|    time             |    high    |
+---------------------+------------+
|                     |            |
| 2020-01-02 9:31:00  | 322.956677 |
| 2020-01-03 10:13:11 | 321.753729 |
| 2020-01-04 15:33:12 | 320.991231 |
| 2020-01-06 12:01:23 | 321.843204 |
| etc...              | etc...     |
+---------------------+------------+

For (2) above, I'm using the following:

pricehistory['red'] = pricehistory['close'].lt(pricehistory['open'])

To make a new column in pricehistory which shows us if there are 4 red minutes in a row.

Then, using new_pricehistory = pricehistory.loc[pricehistory[::-1].rolling(4)['red'].sum().eq(4)], this gives a new dataframe of only the rows where 4 red minutes in a row occur, preferably I'd like to only have the very first occurrence, not all.

Current output:

+---------------------+------------+------------+------------+------------+--------+------+
|        time         |   close    |    high    |    low     |    open    | volume | red  |
+---------------------+------------+------------+------------+------------+--------+------+
|                     |            |            |            |            |        |      |
| 2020-01-02 10:14:00 | 321.336177 | 321.505186 | 321.286468 | 321.505186 | 311601 | TRUE |
| 2020-01-03 10:18:00 | 320.839091 | 320.958392 | 320.828155 | 320.958392 | 103351 | TRUE |
| 2020-01-06 10:49:00 | 320.520956 | 320.570665 | 320.501073 | 320.550781 |  71901 | TRUE |
+---------------------+------------+------------+------------+------------+--------+------+


Solution 1:[1]

Given you didn't provide data I create some dummy one. By SO policy you should make different question per problem. For now I'm answering the first one.

Generate data

import pandas as pd
import numpy as np

times = pd.date_range(start="2020-06-01", end="2020-06-10", freq="1T")
df = pd.DataFrame({"time":times,
                  "high":np.random.randn(len(times))})

Question 1

Here I just look for index where the maximum per day occour and filter df accordingly

idx = df.groupby(df["time"].dt.date)["high"].idxmax().values

df[df.index.isin(idx)]

Update: In case you have time as index in your df the solution will be

df = df.set_index("time")

idx = df.groupby(pd.Grouper(freq='D'))["high"].idxmax().values
df[df.index.isin(idx)]

Question 2

import pandas as pd
import numpy as np

# generate data
times = pd.date_range(start="2020-06-01", end="2020-06-10", freq="1T")
df = pd.DataFrame({"time":times,
                   "open":np.random.randn(len(times))})

df["open"] = np.where(df["open"]<0, -1 * df["open"], df["open"])
df["close"] = df["open"] + 0.01 *np.random.randn(len(times))
df = df.set_index("time")
df["red"] = df['close'].lt(df['open'])

# this function return the first time 
# when there are 4 consecutive red

def get_first(ts):
    idx = ts.loc[ts[::-1].rolling(4)['red'].sum().ge(4)].index
    if idx.empty:
        return pd.NaT
    else:
        return idx[0]

# get first time within group and drop nan
grp = df.groupby(pd.Grouper(freq='D'))\
        .apply(get_first).dropna()



df[df.index.isin(grp.values)]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1