'How to Find the consecutive values in Pandas DataFrame column and replace the value

I have a below Dataframe

Col1    Col2    Col3
A       ABC     100
B       BCD     200
C       CDA     300
D       CDA     400
E       CDA     500
F       EFG     600
G       XYZ     700
H       XYZ     800
I       PQR     900

We can see that in Col2 CDA is repeated 3 times and XYZ is repeated 2 times.

Now I need create the new column NewCol4 and check for the consecutive values, if there are consecutive values just add underscore at the last and add the number after underscore, if it is not consecutive value leave as it is.

I need the output in the below format.

Col1    Col2    Col3   NewCol4
A       ABC     100    ABC
B       BCD     200    BCD
C       CDA     300    CDA_1
D       CDA     400    CDA_2
E       CDA     500    CDA_3
F       EFG     600    EFG
G       XYZ     700    XYZ_1
H       XYZ     800    XYZ_2
I       PQR     900    PQR

I am very new to Pandas, is there any way to achieve the above output that it would be really helpful.



Solution 1:[1]

Pretty sure that my answer is not the best idea and there should be some single-line approaches as well, but you can use:

checkNumber = [0]
checkValue = [df["Col2"].values[0]]
def toApply(value):
  if sum(df["Col2"] == value) > 1:
    if checkValue[0] == value:
      checkNumber[0] += 1
      return value + "_" + str(checkNumber[0])
    else:
      checkNumber[0] = 0
      checkValue[0] = value
      print(value)
      return toApply(value)
  else:
    return value
df["NewCol4"] = df["Col2"].apply(toApply)
df

Output

Col1 Col2 Col3 NewCol4
0 A ABC 100 ABC
1 B BCD 200 BCD
2 C CDA 300 CDA_1
3 D CDA 400 CDA_2
4 E CDA 500 CDA_3
5 F EFG 600 EFG
6 G XYZ 700 XYZ_1
7 H XYZ 800 XYZ_2
8 I PQR 900 PQR

Solution 2:[2]

A couple of "one-liners" together (?) can bring you all the way, but surely this could be optimized:

# Assign a number to each consecutive group of equal "Col2" values
df["g_rank"] = (df["Col2"] != df["Col2"].shift()).cumsum()

# Add a column with the size of each group
df = df.join(df.groupby("g_rank")["Col2"].size().rename("size"), on="g_rank")

# Now add a different number to each member of a group
df["l_rank"] = df.groupby("g_rank")["size"].rank("first")

# Finally, calculate all the "NewCol4" and remove the unnecesary ones
df["NewCol4"] = df["Col2"] + "_" + df["l_rank"].astype(int).astype(str)
df.loc[df["size"] <= 1, "NewCol4"] = None

I left some columns on purpose for make it easier to understand each step. This should be the output:

  Col1 Col2  Col3  g_rank  size  l_rank NewCol4
0    A  ABC   100       1     1     1.0    None
1    B  BCD   200       2     1     1.0    None
2    C  CDA   300       3     3     1.0   CDA_1
3    D  CDA   400       3     3     2.0   CDA_2
4    E  CDA   500       3     3     3.0   CDA_3
5    F  EFG   600       4     1     1.0    None
6    G  XYZ   700       5     2     1.0   XYZ_1
7    H  XYZ   800       5     2     2.0   XYZ_2
8    I  PQR   900       6     1     1.0    None

Solution 3:[3]

Maybe something like this could help you

# get repeating values
counts = df['Col2'].value_counts()
multi = counts.index[counts>1]
# Get a counter for each Col2 value
new = df.groupby('Col2').apply(pd.DataFrame.reset_index)
new = new.reset_index(1)
new['level_1'] += 1
# Build a NewCol4 with a suffix for all
new['NewCol4'] = new['Col2'] + "_" + new['level_1'].astype(str)
# Undo those not needed
stay = ~new['Col2'].isin(multi)
new.loc[stay, "NewCol4"] = new.loc[stay, "Col2"]
# remove unneeded columns
new.set_index('index', drop=True, inplace=True)
new.pop('level_1')
new

This is not yet very optimized, but could help as a starting point.

Solution 4:[4]

use a groupby and the df.loc to create a NewCol4 . The groupby results can have an apply with function used to return dataframes index hierarchy and list of list values of the groupby resultset.

data="""Col1    Col2    Col3   NewCol4
A       ABC     100    ABC
B       BCD     200    BCD
C       CDA     300    CDA_1
D       CDA     400    CDA_2
E       CDA     500    CDA_3
F       EFG     600    EFG
G       XYZ     700    XYZ_1
H       XYZ     800    XYZ_2
I       PQR     900    PQR"""

df = pd.read_csv(StringIO(data), sep="\s+")

grouped=df.groupby('Col2')['Col3']

index=[]
values=[]
def count_consecutive(df):
    index.append(df.index)
    values.append(df.values)

grouped.apply(count_consecutive)        
#[print(x) for x in index]
#[print(x) for x in values]

for x in index:
    count=0
    old_value=0
    for i in x:
        field=df.loc[i,'Col2']
        value=df.loc[i,'Col3']
        #print(value)
        if value>old_value:
            count+=1
        df.loc[i,'NewCol4']=field+"_"+str(count)
        old_value=value
    
print(df)

output:

Col1 Col2  Col3 NewCol4
0    A  ABC   100   ABC_1
1    B  BCD   200   BCD_1
2    C  CDA   300   CDA_1
3    D  CDA   400   CDA_2
4    E  CDA   500   CDA_3
5    F  EFG   600   EFG_1
6    G  XYZ   700   XYZ_1
7    H  XYZ   800   XYZ_2
8    I  PQR   900   PQR_1

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Amirhossein Kiani
Solution 2 aaossa
Solution 3 Carlos Horn
Solution 4