'How to create and assign indexes for each group in a dataframe

[This is DataFrame loaded with data from an Excel file]

   STUDY              Teacher       UPDATE_DATE
0   math                    A        2022-02-25
1   math                 A, C        2022-02-25
2   science                 D        2022-01-03
3   science                 A        2022-01-03
4   science              A, B        2022-02-20
5   entertainment           C        2022-01-01
6   entertainment        B, C        2022-03-02
7   technology              E        2021-09-01

Q1. I want to create an index_number field based on count per STUDY and add count per STUDY group to this field from 0. I want to add an empty or null value if count per group is 1.

   STUDY              Teacher       UPDATE_DATE   INDEX_NUMBER
0   math                    A        2022-02-25           0
1   math                 A, C        2022-02-25           1
2   science                 D        2022-01-03           0
3   science                 A        2022-01-03           1
4   science              A, B        2022-02-20           2
5   entertainment           C        2022-01-01           0
6   entertainment        B, C        2022-03-02           1
7   technology              E        2021-09-01        null  

---------------------------- (NEW) ----------------------------------------- I have solved the existing problem through the answer as below.

def compare_date(group): index = range(len(group)) if len(group) > 2: col = ['2022-01-01'] + group.iloc[2:len(group)]['UPDATE_DATE'].tolist() + ['2022-04-10'] elif len(group) == 2: col = ['2022-01-01'] + ['2022-04-10'] else: col = group.iloc[0]['UPDATE_DATE'] group = group.assign(COMPARE_DATE=col) return group

output = df.groupby('STUDY').apply(compare_date)

   STUDY              Teacher       UPDATE_DATE  COMPARE_DATE  INTERVAL_DAYS
0   entertainment           C        2022-03-02    2022-01-01    0.61
1   entertainment        B, C        2022-03-02    2022-04-10    0.39
2   math                    A        2022-02-25    2022-01-01    0.56
3   math                 A, C        2022-02-25    2022-04-10    0.44
4   science                 D        2022-01-03    2022-01-01    0.02
5   science                 A        2022-01-03    2022-02-20    0.48
6   science              A, B        2022-02-20    2022-04-10    0.49
7   technology              E        2021-09-01    2022-09-01    0.00

In addition, when there are more than 2 teacheres, I want to create a row by splitting based on "," and divide the value of the INTERVAL_DAYS field by the number of teachers to add. I thought it had a similar structure to the compare_date function you helped and tried to apply it, but the desired output did not come out, so I had to raise an additional question.

   STUDY              Teacher       UPDATE_DATE  COMPARE_DATE  INTERVAL_DAYS
0   entertainment           C        2022-03-02    2022-01-01    0.61
1   entertainment           B        2022-03-02    2022-04-10    0.2
1   entertainment           C        2022-03-02    2022-04-10    0.2
2   math                    A        2022-02-25    2022-01-01    0.56
3   math                    A        2022-02-25    2022-04-10    0.22
3   math                    C        2022-02-25    2022-04-10    0.22
4   science                 D        2022-01-03    2022-01-01    0.02
5   science                 A        2022-01-03    2022-02-20    0.48
6   science                 A        2022-02-20    2022-04-10    0.25
6   science                 B        2022-02-20    2022-04-10    0.25
7   technology              E        2021-09-01    2022-09-01    0.00


Solution 1:[1]

You can add INDEX_NUMBER and COMPARE_DATE by applying a customized function:

def compare_date(group):
    index = range(len(group))
    if len(group) > 2:
        col = ['2022-01-01']+group.iloc[1:len(group)-1]['UPDATE_DATE'].tolist()+['2022-04-10']
    elif len(group) == 2:
        col = ['2022-01-01'] + ['2022-04-10']
    else:
        col = [pd.NA]
        index = [pd.NA]
    group = group.assign(INDEX_NUMBER=index, COMPARE_DATE=col)
    return group


out = df.groupby('STUDY').apply(compare_date)
print(out)

           STUDY Teacher UPDATE_DATE INDEX_NUMBER COMPARE_DATE
0           math       A  2022-02-25            0   2022-01-01
1           math    A, C  2022-02-25            1   2022-04-10
2        science       D  2022-01-03            0   2022-01-01
3        science       A  2022-01-03            1   2022-01-03
4        science    A, B  2022-02-20            2   2022-04-10
5  entertainment       C  2022-01-01            0   2022-01-01
6  entertainment    B, C  2022-03-02            1   2022-04-10
7     technology       E  2021-09-01         <NA>         <NA>

If you don't want the only 1 length group to be NA, there is a more easier method with cumcount

df['INDEX_NUMBER'] = df.groupby('STUDY').cumcount()
print(df)

           STUDY Teacher UPDATE_DATE  INDEX_NUMBER
0           math       A  2022-02-25             0
1           math    A, C  2022-02-25             1
2        science       D  2022-01-03             0
3        science       A  2022-01-03             1
4        science    A, B  2022-02-20             2
5  entertainment       C  2022-01-01             0
6  entertainment    B, C  2022-03-02             1
7     technology       E  2021-09-01             0

Solution 2:[2]

First of all, in order to test it, will create the dataframe that you mention.

import pandas as pd

df = pd.DataFrame( { 'STUDY': ['math', 'math', 'science', 'science', 'science', 'entertainment', 'entertainment', 'technology'], 'Teacher': ['A', 'A, C', 'D', 'A', 'A, B', 'C', 'B, C', 'E'], 'UPDATE_DATE': ['2022-02-25', '2022-02-25', '2022-01-03', '2022-01-03', '2022-02-20', '2022-01-01', '2022-03-02', '2021-09-01'] } )

Now, for the first part of the challenge, the creation of the column named INDEX_NUMBER, the following will do the work

df['INDEX_NUMBER'] = df.groupby('STUDY')['STUDY'].transform(lambda x: x.rank(method='first'))

If one prints this is the current state of the dataframe

print(df)

[Out]: 
           STUDY Teacher UPDATE_DATE  INDEX_NUMBER
0           math       A  2022-02-25           1.0
1           math    A, C  2022-02-25           2.0
2        science       D  2022-01-03           1.0
3        science       A  2022-01-03           2.0
4        science    A, B  2022-02-20           3.0
5  entertainment       C  2022-01-01           1.0
6  entertainment    B, C  2022-03-02           2.0
7     technology       E  2021-09-01           1.0

Note that instead of starting the INDEX_NUMBER with 0, I've started with 1. But one will take this in consideration for the next part.

Now, for the last part of the question, my suggestion is for you to think more carefully on the requirements and, if needed, post a new question. There are a few things that you might want to consider, (see this comment).

With the current state of things, one way that you can use to solve your particular issue is by using this function (not the most elegant, but it does the work)

def get_compare_date(x):
    if x['STUDY'] == 'math':
        if x['INDEX_NUMBER'] == 1:
            return '2022-01-01'
        else:
            return '2022-04-10'
    elif x['STUDY'] == 'science':
        if x['INDEX_NUMBER'] == 1:
            return '2022-01-01'
        elif x['INDEX_NUMBER'] == 2:
            return x['UPDATE_DATE']
        else:
            return '2022-04-10'
    elif x['STUDY'] == 'entertainment':
        if x['INDEX_NUMBER'] == 1:
            return '2022-01-01'
        else:
            return '2022-04-10'
    else:
        if x['INDEX_NUMBER'] == 1:
            return '2022-01-01'
        else:
            return '2022-04-10'

And, from that one, one can create the column "COMPARE_DATE"

df['COMPARE_DATE'] = df.apply(get_compare_date, axis=1)

Which will output the following

print(df)

[Out]:
           STUDY Teacher UPDATE_DATE  INDEX_NUMBER COMPARE_DATE
0           math       A  2022-02-25             1   2022-01-01
1           math    A, C  2022-02-25             2   2022-04-10
2        science       D  2022-01-03             1   2022-01-01
3        science       A  2022-01-03             2   2022-01-03
4        science    A, B  2022-02-20             3   2022-04-10
5  entertainment       C  2022-01-01             1   2022-01-01
6  entertainment    B, C  2022-03-02             2   2022-04-10
7     technology       E  2021-09-01             1   2022-01-01

This may be enough to solve what you are looking for, but may not be.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Gonçalo Peres