'How to create dictionary to look for dropped zeros?

I ran into this specific problem where I have a dataframe of ID numbers. Some of these account numbers have dropped leading zeros. dataframe is df.

So what im trying to do is create a generalized way to check if we have dropped leading zeros. So basically, in my real data set there would be millions of rows. So I want to use a pandas method to say if there is a section of ID that matches a section with the zeros to put that into another dataframe so I can further examine.

I do that like this:

new_df = df.loc[df['ID'].isin(df['ID'])]

My reasoning for this is that I want to filter that dataset to find if any of the IDs are inside the full IDs.

Now I have

I can use a .unique() to get a series of each unique combo.

This is fine for a small dataset. But for rows of millions, I am wondering how I can make it easier to do this check.

I trying to find a way to create a dictionary where the keys are the 3 digit and the values are its full ID. or vice versa. Any tips on that would be appreciated. If anyone has any tips also on a different idea to checking for dropped zeros, other than the dictionary approach, that would be helpful too.

Note: It is not always 3 digits. Could be 4567 for example, where the real value would be 004567.

python pandas

Solution 1:^[1]

Use:

print (df)
       ID
0     345
1     345
2     540
3    2922
4  002922
5  000344
6  000345
7  000543

#filter ID starting by 0 to Series
d = df.loc[df['ID'].str.startswith('0'), 'ID']
#create index in Series with remove zeros from left side
d.index = d.str.lstrip('0')
print (d)
ID
2922    002922
344     000344
345     000345
543     000543
Name: ID, dtype: object

#dict all possible values
print (d.to_dict())
{'2922': '002922', '344': '000344', '345': '000345', '543': '000543'}

#compare if exist indices in original ID column and create dict
d = d[d.index.isin(df['ID'])].to_dict()
print (d)
{'2922': '002922', '345': '000345', '543': '000543'}

Solution 2:^[2]

One option is to strip leading "0"s:

out = df['ID'].str.lstrip('0').unique()

Output:

array(['345', '543', '922'], dtype=object)

or prepend "0"s:

out = df['ID'].str.zfill(df['ID'].str.len().max()).unique()

Output:

array(['000345', '000543', '000922'], dtype=object)

Solution 3:^[3]

Create a dictionary for finding potentially affected records.

# Creates a dummy dataframe.
df = pd.DataFrame(['00456', '0000456', '567', '00567'], columns=['ID'])
df['stripped'] = pd.to_numeric(df['ID'])
df['affected_id'] = df.ID.str.len() == df.stripped.astype(str).str.len()
df

        ID  stripped  affected_id
0    00456       456        False
1  0000456       456        False
2      567       567         True
3    00567       567        False

# Creates a dictionary of potentially affected records.
d = dict()
for i in df[df.affected_id == True].stripped.unique():  
     d[i] = df[(df.stripped == i) & (df.ID != str(i))].ID.unique().tolist()
d
{567: ['00567']}

If you want to include the stripped records into the list, then:

for i in df[df.affected_id == True].stripped.unique():
     d[i] = df[df.stripped == i].ID.unique().tolist()
d
{567: ['567', '00567']}

Solution 4:^[4]

You can convert the column type to int

m = df['ID'].ne(df['ID'].astype(int))

print(m)

0    False
1    False
2    False
3     True
4     True
5     True
Name: ID, dtype: bool

print(df[m])

       ID
3  000345
4  000345
5  000543

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1
Solution 2
Solution 3	Flavia Giammarino
Solution 4	Ynjxsjmh

'How to create dictionary to look for dropped zeros?

Solution 1:[1]

Solution 2:[2]

Solution 3:[3]

Solution 4:[4]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]

Solution 3:^[3]

Solution 4:^[4]