'compare string and assign values
df1
| Parc(east) | fontaine(dlf) | . | Mont Bazrougais 05 | Monrochet 2 |
|---|---|---|---|---|
| 24 | 35 | . | . | 85 |
| 29 | 50 | . | . | 50 |
| . | . | . | . | . |
| . | . | . | . | . |
| 10 | 20 | . | . | 35 |
df2
| names | id |
|---|---|
| Parc(east) | LED |
| fontanus (dlf) | MON |
| Monrochet 1 | M1 |
| Monrochet 3 | M2 |
| Monrochet 2 | LVER |
| Mont Bazrougais 05 | BR |
| Mont de est | MTE |
I want to write a code that iterate over df2 names, compare the fisrt n characters of each string of df2 names and each string column name of df1. If there is a match, change the matching column name of df1 with the corresponding value of df2 id. The final df should look like df3.
df3
| LED | . | BR | LVER | |
|---|---|---|---|---|
| Parc(east) | fontaine(dlf) | . | Mont Bazrougais 05 | Monrochet 2 |
| 24 | 35 | . | 00 | 85 |
| 29 | 50 | . | 09 | 50 |
| . | . | . | 88 | . |
| . | . | . | 100 | . |
| 10 | 20 | . | 200 | 35 |
Solution 1:[1]
You can use:
df3 = df1.rename(columns=df2.set_index('names')['id'])
output:
LED MAN LVER
0 24 35 85
1 29 50 50
If you really want the weird "header-as-data" format:
df3 = (pd
.concat([df1.columns.to_frame().T, df1])
.rename(columns=df2.set_index('names')['id'])
)
output:
LED MAN LVER
0 Leeds Manchester Liverpool
0 24 35 85
1 29 50 50
use first n characters only;
n = 3
s = (df2.assign(names=df2['names'].str[:n])
.drop_duplicates('names')
.set_index('names')['id']
)
df3 = (pd
.concat([df1.columns.to_frame().T, df1])
.rename(columns=lambda x: s.loc[x[:n]])
)
Solution 2:[2]
I suspect what's happening is you have duplicate index values before the concat(). (They're possibly only duplicated between the train & test sets, not necessarily duplicates within the sets separately.) That might throw off concat(), since index values are assumed to be unique... and it might compensate by setting some to NaN. The calls to reset_index() are going to give each of them separately index values starting from 1.
To fix this: Set ignore_index=True in pd.concat(). From the docs:
ignore_index: bool, default False If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, …, n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join.
If that doesn't work, check: Do test & train have NaNs in the index before concatenation and after reset_index()? They shouldn't, but check. If they do, those will carry over into the concat.
Solution 3:[3]
I just did concats with different order and it worked. The nans were the result of no merging the labels right. Instead of creating one single col with labels I created two with half of them empty, one with the train_labels and one with test_labels.
import pandas as pd
from sklearn.utils import shuffle
# remove first col from training dataset
data = pd.read_csv('/home/katerina/Desktop/PBMC_training_set_data.csv')
first_column = data.columns[0]
data = data.drop([first_column], axis=1)
print(data.shape)
# remove first col from testing dataset
data2 = pd.read_csv('/home/katerina/Desktop/PBMC_testing_set_data.csv')
first_column = data2.columns[0]
data2 = data2.drop([first_column], axis=1)
print(data2.shape)
#read training labels
data_labels = pd.read_csv('/home/katerina/Desktop/PBMC_training_set_label.csv')
print(data_labels.shape)
#read testing labels
data2_labels = pd.read_csv('/home/katerina/Desktop/PBMC_testing_set_label.csv')
print(data2_labels.shape)
#concat data without labels
frames = [data, data2]
d = pd.concat(frames)
#concat labels
l = data_labels.append(data2_labels)
#create the original dataset
print(d.shape, l.shape)
dataset = pd.concat([l, d], axis=1)
dataset = shuffle(dataset)
dataset
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Sarah Messer |
| Solution 3 | KateB |
