'How to create dictionary to look for dropped zeros?
I ran into this specific problem where I have a dataframe of ID numbers. Some of these account numbers have dropped leading zeros. dataframe is df.
ID
345
345
543
000922
000345
000345
000543
So what im trying to do is create a generalized way to check if we have dropped leading zeros. So basically, in my real data set there would be millions of rows. So I want to use a pandas method to say if there is a section of ID that matches a section with the zeros to put that into another dataframe so I can further examine.
I do that like this:
new_df = df.loc[df['ID'].isin(df['ID'])]
My reasoning for this is that I want to filter that dataset to find if any of the IDs are inside the full IDs.
Now I have
ID
345
345
543
000345
000345
000543
I can use a .unique() to get a series of each unique combo.
ID
345
543
000345
000543
This is fine for a small dataset. But for rows of millions, I am wondering how I can make it easier to do this check.
I trying to find a way to create a dictionary where the keys are the 3 digit and the values are its full ID. or vice versa. Any tips on that would be appreciated. If anyone has any tips also on a different idea to checking for dropped zeros, other than the dictionary approach, that would be helpful too.
Note: It is not always 3 digits. Could be 4567 for example, where the real value would be 004567.
Solution 1:[1]
Use:
print (df)
ID
0 345
1 345
2 540
3 2922
4 002922
5 000344
6 000345
7 000543
#filter ID starting by 0 to Series
d = df.loc[df['ID'].str.startswith('0'), 'ID']
#create index in Series with remove zeros from left side
d.index = d.str.lstrip('0')
print (d)
ID
2922 002922
344 000344
345 000345
543 000543
Name: ID, dtype: object
#dict all possible values
print (d.to_dict())
{'2922': '002922', '344': '000344', '345': '000345', '543': '000543'}
#compare if exist indices in original ID column and create dict
d = d[d.index.isin(df['ID'])].to_dict()
print (d)
{'2922': '002922', '345': '000345', '543': '000543'}
Solution 2:[2]
One option is to strip leading "0"s:
out = df['ID'].str.lstrip('0').unique()
Output:
array(['345', '543', '922'], dtype=object)
or prepend "0"s:
out = df['ID'].str.zfill(df['ID'].str.len().max()).unique()
Output:
array(['000345', '000543', '000922'], dtype=object)
Solution 3:[3]
Create a dictionary for finding potentially affected records.
# Creates a dummy dataframe.
df = pd.DataFrame(['00456', '0000456', '567', '00567'], columns=['ID'])
df['stripped'] = pd.to_numeric(df['ID'])
df['affected_id'] = df.ID.str.len() == df.stripped.astype(str).str.len()
df
ID stripped affected_id
0 00456 456 False
1 0000456 456 False
2 567 567 True
3 00567 567 False
# Creates a dictionary of potentially affected records.
d = dict()
for i in df[df.affected_id == True].stripped.unique():
d[i] = df[(df.stripped == i) & (df.ID != str(i))].ID.unique().tolist()
d
{567: ['00567']}
If you want to include the stripped records into the list, then:
for i in df[df.affected_id == True].stripped.unique():
d[i] = df[df.stripped == i].ID.unique().tolist()
d
{567: ['567', '00567']}
Solution 4:[4]
You can convert the column type to int
m = df['ID'].ne(df['ID'].astype(int))
print(m)
0 False
1 False
2 False
3 True
4 True
5 True
Name: ID, dtype: bool
print(df[m])
ID
3 000345
4 000345
5 000543
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | |
Solution 3 | Flavia Giammarino |
Solution 4 | Ynjxsjmh |