'Compare rows of pandas dataframe

Continuing my earlier question. Now I have one dataframe, where I added a new column 'New' with values 1 through 150 for each new file, that can be used as an index if required. I figured it would be easier to make the loop for every file separately. And now I am not sure how to proceed. Let me provide the code, the explanation of the task at hand, and some thoughts how to move on.

New FileName Transcriber Transcription
1 612_000002.wav 100% (80/80) Are we starting off?
612_000002.wav 100% (50/50) shall we starting on
612_000002.wav 100% (2/2) fast mode
612_000002.wav 100% (258/259) Go and start it up
612_000002.wav 100% (20/20) Are we starting off?
612_000002.wav Quartznet there was not inl
612_000002.wav Transducer_M don't start again
612_000002.wav Transducer_L do we start again
2 612_000003.wav 100% (258/259) here we go, hey well woah woah woah
612_000003.wav 100% (23/23) evening gulf air
612_000003.wav 100% (32/32) And as the 1st group reached the bottom of the...
612_000003.wav 100% (80/80) Happy to go off here, woah woah woah
612_000003.wav 100% (10/10) Go boom we'll just
612_000003.wav Transducer_L anything off yeah i'm willing we'll just
612_000003.wav Transducer_S having gone here on will and wolf is
generator = iter(range(151))

for i in range(0, len(df_all)):
  if i == 0:
    df_all.iloc[[i],[3]] = next(generator)
  if df_all.iloc[i]['FileName'] == df_all.iloc[i-1]['FileName']:
    pass
  if df_all.iloc[i]['FileName'] != df_all.iloc[i-1]['FileName']:
    df_all.iloc[[i],[3]] = next(generator)

Now, what is needed to do is create some sort of a matrix (few columns), where -- for every wav file -- for every transcription of a wav file ...we calculate - wer(ground_truth, hyposesis) (from jiwer library)

First column (to be created), for instance, we take the first transcription as the ground truth and we compare it against itself as well as all other transcriptions of the same wav file. Then we'd move to the next wav file and start again. Second column would take the next (second) transcription as a ground truth and compare all the other transcriptions against it. Note, the number of transcriptions is not fixed for every file.

This is a gist. Is it possible at all?

Thank you <3



Solution 1:[1]

If anyone is interested, here is my solution. Since I had a few transcriptions for the same filename, I decided to group them together in a small dataframe with groupby by 'FileName' and iterate through every transcript within every df.

    wer_0 = []
    for idx, df_select in df_mturk.groupby(level=['FileName']):
        for i in range(len(df_select)):
            if i == 0:
                ground_truth = df_select.iloc[i][1]
                wer_0.append(wer(ground_truth, df_select.iloc[i][1]))
            if i > 0:
                wer_0.append(wer(ground_truth, df_select.iloc[i][1]))

In this code the ground truth was the first row. Then I twisted the code a little to calculate wer with other transcripts as ground truth.

wer_2 = []
for idx, df_select in df_mturk.groupby(level=['FileName']):
    if len(df_select) > 2:
        for i in range(len(df_select)):
          if i == 0:
              ground_truth = df_select.iloc[i+2][1]
              wer_2.append(wer(ground_truth, df_select.iloc[i][1]))
          if i > 0:
              wer_2.append(wer(ground_truth, df_select.iloc[i][1]))
    else:
        for i in range(len(df_select)):
            wer_2.append('NaN')

At the end I obtained the dataframe by adding lists as columns to original one, df_mturk.

FileName Transcriber Transcription WER_0 WER_1 WER_2 WER_3 WER_4 WER_5
612_000002.wav 100% (80/80) are we starting off 0.000000 0.500000 2.0 1.0 0.0 NaN
612_000002.wav 100% (50/50) shall we starting on 0.500000 0.000000 2.0 1.0 0.5 NaN
612_000002.wav 100% (2/2) fast mode 1.000000 1.000000 0.0 1.0 1.0 NaN
612_000002.wav 100% (258/259) go and start it up 1.250000 1.250000 2.5 0.0 1.25 NaN
612_000002.wav 100% (20/20) are we starting off 0.000000 0.500000 2.0 1.0 0.0 NaN
612_000003.wav 100% (258/259) here we go hey well woah woah woah 0.000000 2.666667 1.0 0.5 1.5 NaN
612_000003.wav 100% (23/23) evening gulf air 1.000000 0.000000 1.0 1.0 1.0 NaN
612_000003.wav 100% (32/32) and as the 1st group reached the bottom of the... 4.250000 11.333333 0.0 4.125 8.5 NaN
612_000003.wav 100% (80/80) happy to go off here woah woah woah 0.500000 2.666667 0.970588 0.0 1.75 NaN
612_000003.wav 100% (10/10) go boom well just 0.750000 1.333333 1.0 0.875 0.0 NaN
612_000008.wav 100% (10/10) yes obviously trying to get around that with b... 0.000000 1.000000 0.941176 0.545455 NaN NaN
612_000008.wav 100% (258/259) its there you will see tank on random and make... 0.900000 0.000000 0.970588 0.818182 NaN NaN
612_000008.wav 100% (32/32) and as the 1st group reached the bottom of the... 1.600000 1.833333 0.0 1.363636 NaN NaN
612_000008.wav 100% (65/65) yes and obviously try and get round the mark o... 0.600000 1.000000 0.882353 0.0 NaN NaN
612_000010.wav 100% (20/20) its about to pop a hole 0.000000 1.200000 1.25 1.0 NaN NaN
612_000010.wav 100% (258/259) stand up of the hull 1.000000 0.000000 1.0 0.8 NaN NaN
612_000010.wav 100% (50/50) sven to proper hull 0.833333 0.800000 0.0 0.6 NaN NaN
612_000010.wav 100% (80/80) just about the proper hull 0.833333 0.800000 0.75 0.0 NaN NaN
612_000011.wav 100% (80/80) we better go full speed here 0.000000 0.166667 0.857143 NaN NaN NaN
612_000011.wav 100% (65/65) we gona go full speed here 0.166667 0.000000 0.857143 NaN NaN NaN

Solution 2:[2]

I assumed you still wanted to know the transcriber of both ground truth and hypothesis. In my approach, I disregarded the new column, as I don't really see what the purpose is.

First, we join the dataframe with itself, merging only on filename (thereby creating a cross-join of 113 rows, for this example).

df_combo = pd.merge(df,
                    df, 
                    on='FileName', 
                    suffixes=('_gt', '_hyp'))

df_combo looks like this:

FileName Transcriber_gt Transcription_gt Transcriber_hyp Transcription_hyp
0 612_000002.wav 100% (80/80) Are we starting off? 100% (80/80) Are we starting off?
1 612_000002.wav 100% (80/80) Are we starting off? 100% (50/50) shall we starting on
2 612_000002.wav 100% (80/80) Are we starting off? 100% (2/2) fast mode
3 612_000002.wav 100% (80/80) Are we starting off? 100% (258/259) Go and start it up
4 612_000002.wav 100% (80/80) Are we starting off? 100% (20/20) Are we starting off?
... ... ... ... ... ...
108 612_000003.wav Transducer_S having gone here on will and wolf is 100% (32/32) And as the 1st group reached the bottom of the...
109 612_000003.wav Transducer_S having gone here on will and wolf is 100% (80/80) Happy to go off here, woah woah woah
110 612_000003.wav Transducer_S having gone here on will and wolf is 100% (10/10) Go boom we'll just
111 612_000003.wav Transducer_S having gone here on will and wolf is Transducer_L anything off yeah i'm willing we'll just
112 612_000003.wav Transducer_S having gone here on will and wolf is Transducer_S having gone here on will and wolf is

Next, we use .apply() to perform the calculation on each row and assign it to a new column wer.

df_combo['wer'] = df_combo.apply(lambda x: wer(x.Transcription_gt, x.Transcription_hyp), 
                                 axis=1)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 DocZerø