'compare string and assign values

df1

Parc(east)	fontaine(dlf)	.	Mont Bazrougais 05	Monrochet 2
24	35	.	.	85
29	50	.	.	50
.	.	.	.	.
.	.	.	.	.
10	20	.	.	35

df2

names	id
Parc(east)	LED
fontanus (dlf)	MON
Monrochet 1	M1
Monrochet 3	M2
Monrochet 2	LVER
Mont Bazrougais 05	BR
Mont de est	MTE

I want to write a code that iterate over df2 names, compare the fisrt n characters of each string of df2 names and each string column name of df1. If there is a match, change the matching column name of df1 with the corresponding value of df2 id. The final df should look like df3.

df3

LED		.	BR	LVER
Parc(east)	fontaine(dlf)	.	Mont Bazrougais 05	Monrochet 2
24	35	.	00	85
29	50	.	09	50
.	.	.	88	.
.	.	.	100	.
10	20	.	200	35

Solution 1:^[1]

You can use:

df3 = df1.rename(columns=df2.set_index('names')['id'])

output:

   LED  MAN  LVER
0   24   35    85
1   29   50    50

If you really want the weird "header-as-data" format:

df3 = (pd
   .concat([df1.columns.to_frame().T, df1])
   .rename(columns=df2.set_index('names')['id'])
)

output:

     LED         MAN       LVER
0  Leeds  Manchester  Liverpool
0     24          35         85
1     29          50         50

use first n characters only;

n = 3

s = (df2.assign(names=df2['names'].str[:n])
    .drop_duplicates('names')
    .set_index('names')['id']
)

df3 = (pd
   .concat([df1.columns.to_frame().T, df1])
   .rename(columns=lambda x: s.loc[x[:n]])

)

Solution 2:^[2]

I suspect what's happening is you have duplicate index values before the concat(). (They're possibly only duplicated between the train & test sets, not necessarily duplicates within the sets separately.) That might throw off concat(), since index values are assumed to be unique... and it might compensate by setting some to NaN. The calls to reset_index() are going to give each of them separately index values starting from 1.

To fix this: Set ignore_index=True in pd.concat(). From the docs:

ignore_index: bool, default False If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, …, n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join.

If that doesn't work, check: Do test & train have NaNs in the index before concatenation and after reset_index()? They shouldn't, but check. If they do, those will carry over into the concat.

Solution 3:^[3]

I just did concats with different order and it worked. The nans were the result of no merging the labels right. Instead of creating one single col with labels I created two with half of them empty, one with the train_labels and one with test_labels.

import pandas as pd
from sklearn.utils import shuffle
 
# remove first col from training dataset
data = pd.read_csv('/home/katerina/Desktop/PBMC_training_set_data.csv')
first_column = data.columns[0]  
data = data.drop([first_column], axis=1)
print(data.shape)

# remove first col from testing dataset
data2 = pd.read_csv('/home/katerina/Desktop/PBMC_testing_set_data.csv')
first_column = data2.columns[0]  
data2 = data2.drop([first_column], axis=1)
print(data2.shape)

#read training labels
data_labels = pd.read_csv('/home/katerina/Desktop/PBMC_training_set_label.csv')
print(data_labels.shape)
#read testing labels
data2_labels = pd.read_csv('/home/katerina/Desktop/PBMC_testing_set_label.csv')
print(data2_labels.shape)
#concat data without labels
frames = [data, data2]
d = pd.concat(frames)

#concat labels
l = data_labels.append(data2_labels)

#create the original dataset 
print(d.shape, l.shape)
dataset = pd.concat([l, d], axis=1)
dataset = shuffle(dataset)
dataset

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1
Solution 2	Sarah Messer
Solution 3	KateB

'compare string and assign values

Solution 1:[1]

use first n characters only;

Solution 2:[2]

Solution 3:[3]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]

Solution 3:^[3]