'Compare rows of pandas dataframe
Continuing my earlier question. Now I have one dataframe, where I added a new column 'New' with values 1 through 150 for each new file, that can be used as an index if required. I figured it would be easier to make the loop for every file separately. And now I am not sure how to proceed. Let me provide the code, the explanation of the task at hand, and some thoughts how to move on.
| New | FileName | Transcriber | Transcription |
|---|---|---|---|
| 1 | 612_000002.wav | 100% (80/80) | Are we starting off? |
| 612_000002.wav | 100% (50/50) | shall we starting on | |
| 612_000002.wav | 100% (2/2) | fast mode | |
| 612_000002.wav | 100% (258/259) | Go and start it up | |
| 612_000002.wav | 100% (20/20) | Are we starting off? | |
| 612_000002.wav | Quartznet | there was not inl | |
| 612_000002.wav | Transducer_M | don't start again | |
| 612_000002.wav | Transducer_L | do we start again | |
| 2 | 612_000003.wav | 100% (258/259) | here we go, hey well woah woah woah |
| 612_000003.wav | 100% (23/23) | evening gulf air | |
| 612_000003.wav | 100% (32/32) | And as the 1st group reached the bottom of the... | |
| 612_000003.wav | 100% (80/80) | Happy to go off here, woah woah woah | |
| 612_000003.wav | 100% (10/10) | Go boom we'll just | |
| 612_000003.wav | Transducer_L | anything off yeah i'm willing we'll just | |
| 612_000003.wav | Transducer_S | having gone here on will and wolf is |
generator = iter(range(151))
for i in range(0, len(df_all)):
if i == 0:
df_all.iloc[[i],[3]] = next(generator)
if df_all.iloc[i]['FileName'] == df_all.iloc[i-1]['FileName']:
pass
if df_all.iloc[i]['FileName'] != df_all.iloc[i-1]['FileName']:
df_all.iloc[[i],[3]] = next(generator)
Now, what is needed to do is create some sort of a matrix (few columns), where -- for every wav file -- for every transcription of a wav file ...we calculate - wer(ground_truth, hyposesis) (from jiwer library)
First column (to be created), for instance, we take the first transcription as the ground truth and we compare it against itself as well as all other transcriptions of the same wav file. Then we'd move to the next wav file and start again. Second column would take the next (second) transcription as a ground truth and compare all the other transcriptions against it. Note, the number of transcriptions is not fixed for every file.
This is a gist. Is it possible at all?
Thank you <3
Solution 1:[1]
If anyone is interested, here is my solution. Since I had a few transcriptions for the same filename, I decided to group them together in a small dataframe with groupby by 'FileName' and iterate through every transcript within every df.
wer_0 = []
for idx, df_select in df_mturk.groupby(level=['FileName']):
for i in range(len(df_select)):
if i == 0:
ground_truth = df_select.iloc[i][1]
wer_0.append(wer(ground_truth, df_select.iloc[i][1]))
if i > 0:
wer_0.append(wer(ground_truth, df_select.iloc[i][1]))
In this code the ground truth was the first row. Then I twisted the code a little to calculate wer with other transcripts as ground truth.
wer_2 = []
for idx, df_select in df_mturk.groupby(level=['FileName']):
if len(df_select) > 2:
for i in range(len(df_select)):
if i == 0:
ground_truth = df_select.iloc[i+2][1]
wer_2.append(wer(ground_truth, df_select.iloc[i][1]))
if i > 0:
wer_2.append(wer(ground_truth, df_select.iloc[i][1]))
else:
for i in range(len(df_select)):
wer_2.append('NaN')
At the end I obtained the dataframe by adding lists as columns to original one, df_mturk.
| FileName | Transcriber | Transcription | WER_0 | WER_1 | WER_2 | WER_3 | WER_4 | WER_5 |
|---|---|---|---|---|---|---|---|---|
| 612_000002.wav | 100% (80/80) | are we starting off | 0.000000 | 0.500000 | 2.0 | 1.0 | 0.0 | NaN |
| 612_000002.wav | 100% (50/50) | shall we starting on | 0.500000 | 0.000000 | 2.0 | 1.0 | 0.5 | NaN |
| 612_000002.wav | 100% (2/2) | fast mode | 1.000000 | 1.000000 | 0.0 | 1.0 | 1.0 | NaN |
| 612_000002.wav | 100% (258/259) | go and start it up | 1.250000 | 1.250000 | 2.5 | 0.0 | 1.25 | NaN |
| 612_000002.wav | 100% (20/20) | are we starting off | 0.000000 | 0.500000 | 2.0 | 1.0 | 0.0 | NaN |
| 612_000003.wav | 100% (258/259) | here we go hey well woah woah woah | 0.000000 | 2.666667 | 1.0 | 0.5 | 1.5 | NaN |
| 612_000003.wav | 100% (23/23) | evening gulf air | 1.000000 | 0.000000 | 1.0 | 1.0 | 1.0 | NaN |
| 612_000003.wav | 100% (32/32) | and as the 1st group reached the bottom of the... | 4.250000 | 11.333333 | 0.0 | 4.125 | 8.5 | NaN |
| 612_000003.wav | 100% (80/80) | happy to go off here woah woah woah | 0.500000 | 2.666667 | 0.970588 | 0.0 | 1.75 | NaN |
| 612_000003.wav | 100% (10/10) | go boom well just | 0.750000 | 1.333333 | 1.0 | 0.875 | 0.0 | NaN |
| 612_000008.wav | 100% (10/10) | yes obviously trying to get around that with b... | 0.000000 | 1.000000 | 0.941176 | 0.545455 | NaN | NaN |
| 612_000008.wav | 100% (258/259) | its there you will see tank on random and make... | 0.900000 | 0.000000 | 0.970588 | 0.818182 | NaN | NaN |
| 612_000008.wav | 100% (32/32) | and as the 1st group reached the bottom of the... | 1.600000 | 1.833333 | 0.0 | 1.363636 | NaN | NaN |
| 612_000008.wav | 100% (65/65) | yes and obviously try and get round the mark o... | 0.600000 | 1.000000 | 0.882353 | 0.0 | NaN | NaN |
| 612_000010.wav | 100% (20/20) | its about to pop a hole | 0.000000 | 1.200000 | 1.25 | 1.0 | NaN | NaN |
| 612_000010.wav | 100% (258/259) | stand up of the hull | 1.000000 | 0.000000 | 1.0 | 0.8 | NaN | NaN |
| 612_000010.wav | 100% (50/50) | sven to proper hull | 0.833333 | 0.800000 | 0.0 | 0.6 | NaN | NaN |
| 612_000010.wav | 100% (80/80) | just about the proper hull | 0.833333 | 0.800000 | 0.75 | 0.0 | NaN | NaN |
| 612_000011.wav | 100% (80/80) | we better go full speed here | 0.000000 | 0.166667 | 0.857143 | NaN | NaN | NaN |
| 612_000011.wav | 100% (65/65) | we gona go full speed here | 0.166667 | 0.000000 | 0.857143 | NaN | NaN | NaN |
Solution 2:[2]
I assumed you still wanted to know the transcriber of both ground truth and hypothesis. In my approach, I disregarded the new column, as I don't really see what the purpose is.
First, we join the dataframe with itself, merging only on filename (thereby creating a cross-join of 113 rows, for this example).
df_combo = pd.merge(df,
df,
on='FileName',
suffixes=('_gt', '_hyp'))
df_combo looks like this:
| FileName | Transcriber_gt | Transcription_gt | Transcriber_hyp | Transcription_hyp | |
|---|---|---|---|---|---|
| 0 | 612_000002.wav | 100% (80/80) | Are we starting off? | 100% (80/80) | Are we starting off? |
| 1 | 612_000002.wav | 100% (80/80) | Are we starting off? | 100% (50/50) | shall we starting on |
| 2 | 612_000002.wav | 100% (80/80) | Are we starting off? | 100% (2/2) | fast mode |
| 3 | 612_000002.wav | 100% (80/80) | Are we starting off? | 100% (258/259) | Go and start it up |
| 4 | 612_000002.wav | 100% (80/80) | Are we starting off? | 100% (20/20) | Are we starting off? |
| ... | ... | ... | ... | ... | ... |
| 108 | 612_000003.wav | Transducer_S | having gone here on will and wolf is | 100% (32/32) | And as the 1st group reached the bottom of the... |
| 109 | 612_000003.wav | Transducer_S | having gone here on will and wolf is | 100% (80/80) | Happy to go off here, woah woah woah |
| 110 | 612_000003.wav | Transducer_S | having gone here on will and wolf is | 100% (10/10) | Go boom we'll just |
| 111 | 612_000003.wav | Transducer_S | having gone here on will and wolf is | Transducer_L | anything off yeah i'm willing we'll just |
| 112 | 612_000003.wav | Transducer_S | having gone here on will and wolf is | Transducer_S | having gone here on will and wolf is |
Next, we use .apply() to perform the calculation on each row and assign it to a new column wer.
df_combo['wer'] = df_combo.apply(lambda x: wer(x.Transcription_gt, x.Transcription_hyp),
axis=1)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | DocZerø |
