'How to increment/decrement column based on value of other column in Pyspark

I have audit log data in Pyspark that I am trying to figure out how many people are logged on to the system at any given time.

So for example the dataframe I am working has the following structure

time    src_comp    auth_orient
1       C1          LogOn
2       C1          LogOn
2       C1          LogOn
3       C2          LogOn
3       C1          LogOff
4       C1          LogOn
4       C2          LogOff
5       C1          LogOn
6       C2          LogOn
6       C3          LogOff

Then from my code I want to create another column that shows how many people are logged on, like below. I also don't want the loggedon column to ever go negative, so if someone logs out of a system that I wasn't tracking I don't want to decrement the column:

Expected Output:

time    src_comp    LoggedOn
1       C1          1
2       C1          2
3       C2          1
3       C1          1
3       C1          2
4       C2          0
5       C1          3
6       C2          1
6       C3          0

I have the following code that I tried and got an error.

auth_dataset.groupby(auth_dataset.time,auth_dataset.src_comp).agg(F.when(auth_dataset.strt=='LogOn',number+=1).otherwise(number-=1))


Solution 1:[1]

Try the following code. It sums up all values (1 and -1) and greatest(0,col) functions makes sure that it would never go negative.

auth_dataset.groupby('time','src_comp')\
.agg(F.sum(F.when(auth_dataset.auth_orient == 'LogOn', F.lit(1)).otherwise(F.lit(-1))).alias('LoggedOn'))\
.withColumn('LoggedOn', F.greatest(F.lit(0),'LoggedOn'))\
.sort('time','src_comp').show()

Solution 2:[2]

it Question not Answer

time src_comp auth_orient 1 C1 LogOn 2 C2 LogOn 2 C3 LogOn 3 C1 LogOn 3 C2 LogOff 4 C1 LogOn 4 C1 LogOff 5 C2 LogOn 6 C3 LogOn 6 C1 LogOff

in src_comp column values are reset to initial value then how to give a group number for that. Ex: C1, C2, C3 as group1. C1, C2 as group2. C1 as group3. C1, C2, C3 as group4. C1 as group5. when value resets to C1 from there i want to give a group number in another column in pyspark. how can i do that.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Manoj Singh
Solution 2 Manohar Nookala