'Compare rows of pandas dataframe

Continuing my earlier question. Now I have one dataframe, where I added a new column 'New' with values 1 through 150 for each new file, that can be used as an index if required. I figured it would be easier to make the loop for every file separately. And now I am not sure how to proceed. Let me provide the code, the explanation of the task at hand, and some thoughts how to move on.

New	FileName	Transcriber	Transcription
1	612_000002.wav	100% (80/80)	Are we starting off?
	612_000002.wav	100% (50/50)	shall we starting on
	612_000002.wav	100% (2/2)	fast mode
	612_000002.wav	100% (258/259)	Go and start it up
	612_000002.wav	100% (20/20)	Are we starting off?
	612_000002.wav	Quartznet	there was not inl
	612_000002.wav	Transducer_M	don't start again
	612_000002.wav	Transducer_L	do we start again
2	612_000003.wav	100% (258/259)	here we go, hey well woah woah woah
	612_000003.wav	100% (23/23)	evening gulf air
	612_000003.wav	100% (32/32)	And as the 1st group reached the bottom of the...
	612_000003.wav	100% (80/80)	Happy to go off here, woah woah woah
	612_000003.wav	100% (10/10)	Go boom we'll just
	612_000003.wav	Transducer_L	anything off yeah i'm willing we'll just
	612_000003.wav	Transducer_S	having gone here on will and wolf is

generator = iter(range(151))

for i in range(0, len(df_all)):
  if i == 0:
    df_all.iloc[[i],[3]] = next(generator)
  if df_all.iloc[i]['FileName'] == df_all.iloc[i-1]['FileName']:
    pass
  if df_all.iloc[i]['FileName'] != df_all.iloc[i-1]['FileName']:
    df_all.iloc[[i],[3]] = next(generator)

Now, what is needed to do is create some sort of a matrix (few columns), where -- for every wav file -- for every transcription of a wav file ...we calculate - wer(ground_truth, hyposesis) (from jiwer library)

First column (to be created), for instance, we take the first transcription as the ground truth and we compare it against itself as well as all other transcriptions of the same wav file. Then we'd move to the next wav file and start again. Second column would take the next (second) transcription as a ground truth and compare all the other transcriptions against it. Note, the number of transcriptions is not fixed for every file.

This is a gist. Is it possible at all?

Thank you <3

python pandas

Solution 1:^[1]

If anyone is interested, here is my solution. Since I had a few transcriptions for the same filename, I decided to group them together in a small dataframe with groupby by 'FileName' and iterate through every transcript within every df.

    wer_0 = []
    for idx, df_select in df_mturk.groupby(level=['FileName']):
        for i in range(len(df_select)):
            if i == 0:
                ground_truth = df_select.iloc[i][1]
                wer_0.append(wer(ground_truth, df_select.iloc[i][1]))
            if i > 0:
                wer_0.append(wer(ground_truth, df_select.iloc[i][1]))

In this code the ground truth was the first row. Then I twisted the code a little to calculate wer with other transcripts as ground truth.

wer_2 = []
for idx, df_select in df_mturk.groupby(level=['FileName']):
    if len(df_select) > 2:
        for i in range(len(df_select)):
          if i == 0:
              ground_truth = df_select.iloc[i+2][1]
              wer_2.append(wer(ground_truth, df_select.iloc[i][1]))
          if i > 0:
              wer_2.append(wer(ground_truth, df_select.iloc[i][1]))
    else:
        for i in range(len(df_select)):
            wer_2.append('NaN')

At the end I obtained the dataframe by adding lists as columns to original one, df_mturk.

FileName	Transcriber	Transcription	WER_0	WER_1	WER_2	WER_3	WER_4	WER_5
612_000002.wav	100% (80/80)	are we starting off	0.000000	0.500000	2.0	1.0	0.0	NaN
612_000002.wav	100% (50/50)	shall we starting on	0.500000	0.000000	2.0	1.0	0.5	NaN
612_000002.wav	100% (2/2)	fast mode	1.000000	1.000000	0.0	1.0	1.0	NaN
612_000002.wav	100% (258/259)	go and start it up	1.250000	1.250000	2.5	0.0	1.25	NaN
612_000002.wav	100% (20/20)	are we starting off	0.000000	0.500000	2.0	1.0	0.0	NaN
612_000003.wav	100% (258/259)	here we go hey well woah woah woah	0.000000	2.666667	1.0	0.5	1.5	NaN
612_000003.wav	100% (23/23)	evening gulf air	1.000000	0.000000	1.0	1.0	1.0	NaN
612_000003.wav	100% (32/32)	and as the 1st group reached the bottom of the...	4.250000	11.333333	0.0	4.125	8.5	NaN
612_000003.wav	100% (80/80)	happy to go off here woah woah woah	0.500000	2.666667	0.970588	0.0	1.75	NaN
612_000003.wav	100% (10/10)	go boom well just	0.750000	1.333333	1.0	0.875	0.0	NaN
612_000008.wav	100% (10/10)	yes obviously trying to get around that with b...	0.000000	1.000000	0.941176	0.545455	NaN	NaN
612_000008.wav	100% (258/259)	its there you will see tank on random and make...	0.900000	0.000000	0.970588	0.818182	NaN	NaN
612_000008.wav	100% (32/32)	and as the 1st group reached the bottom of the...	1.600000	1.833333	0.0	1.363636	NaN	NaN
612_000008.wav	100% (65/65)	yes and obviously try and get round the mark o...	0.600000	1.000000	0.882353	0.0	NaN	NaN
612_000010.wav	100% (20/20)	its about to pop a hole	0.000000	1.200000	1.25	1.0	NaN	NaN
612_000010.wav	100% (258/259)	stand up of the hull	1.000000	0.000000	1.0	0.8	NaN	NaN
612_000010.wav	100% (50/50)	sven to proper hull	0.833333	0.800000	0.0	0.6	NaN	NaN
612_000010.wav	100% (80/80)	just about the proper hull	0.833333	0.800000	0.75	0.0	NaN	NaN
612_000011.wav	100% (80/80)	we better go full speed here	0.000000	0.166667	0.857143	NaN	NaN	NaN
612_000011.wav	100% (65/65)	we gona go full speed here	0.166667	0.000000	0.857143	NaN	NaN	NaN

Solution 2:^[2]

I assumed you still wanted to know the transcriber of both ground truth and hypothesis. In my approach, I disregarded the new column, as I don't really see what the purpose is.

First, we join the dataframe with itself, merging only on filename (thereby creating a cross-join of 113 rows, for this example).

df_combo = pd.merge(df,
                    df, 
                    on='FileName', 
                    suffixes=('_gt', '_hyp'))

df_combo looks like this:

	FileName	Transcriber_gt	Transcription_gt	Transcriber_hyp	Transcription_hyp
0	612_000002.wav	100% (80/80)	Are we starting off?	100% (80/80)	Are we starting off?
1	612_000002.wav	100% (80/80)	Are we starting off?	100% (50/50)	shall we starting on
2	612_000002.wav	100% (80/80)	Are we starting off?	100% (2/2)	fast mode
3	612_000002.wav	100% (80/80)	Are we starting off?	100% (258/259)	Go and start it up
4	612_000002.wav	100% (80/80)	Are we starting off?	100% (20/20)	Are we starting off?
...	...	...	...	...	...
108	612_000003.wav	Transducer_S	having gone here on will and wolf is	100% (32/32)	And as the 1st group reached the bottom of the...
109	612_000003.wav	Transducer_S	having gone here on will and wolf is	100% (80/80)	Happy to go off here, woah woah woah
110	612_000003.wav	Transducer_S	having gone here on will and wolf is	100% (10/10)	Go boom we'll just
111	612_000003.wav	Transducer_S	having gone here on will and wolf is	Transducer_L	anything off yeah i'm willing we'll just
112	612_000003.wav	Transducer_S	having gone here on will and wolf is	Transducer_S	having gone here on will and wolf is

Next, we use .apply() to perform the calculation on each row and assign it to a new column wer.

df_combo['wer'] = df_combo.apply(lambda x: wer(x.Transcription_gt, x.Transcription_hyp), 
                                 axis=1)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1
Solution 2	DocZerÃ¸

'Compare rows of pandas dataframe

Solution 1:[1]

Solution 2:[2]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]