'Parse CSV data to get a count on rows with duplicate values

I have a csv file (data.csv):

data
cn=Clark Kent,ou=users,ou=news,ou=employee,dc=company,dc=com
cn=Peter Parker,ou=News,ou=news,ou=employee,dc=company,dc=com
cn=Mary Jane,ou=News_HQ,ou=news,ou=employee,dc=company,dc=com
cn=Oliver Twist,ou=users,ou=news,ou=employee,dc=company,dc=com
cn=Mary Poppins,ou=Ice Cream,ou=ice cream,dc=company,dc=com
cn=David Tenant,ou=userMger,ou=ice cream,ou=employee,dc=company,dc=com
cn=Pepper Jack,ou=users,ou=store,ou=employee,dc=company,dc=com
cn=Eren Jaeger,ou=Store,ou=store,ou=employee,dc=company,dc=com
cn=Monty Python,ou=users,ou=store,dc=company,dc=com
cn=John Smith,ou=userMger,ou=store,ou=employee,dc=company,dc=com
cn=Anne Potts,ou=Sprinkles_HQ,ou=sprinkles,dc=company,dc=com
cn=Harry Styles,OU=Sprinkles,ou=sprinkles,ou=employee,dc=company,dc=com
cn=James Bond,ou=Sprinkles_HQ,ou=employee,dc=company,dc=com
cn=Harry Potter,ou=users,ou=sprinkles,ou=employee,dc=company,dc=com

I need to parse the data to the point where I can get a count of how many rows have the same name in ou. So for example, if there is Sprinkles_HQ, Sprinkles, or sprinkles, they should count as the same. And if a row has Sprinkles_HQ and sprinkles (two of the same name), the row should still count as one (not two).

My desired output is something similar to this:

News, 4
Ice Cream, 2
Store, 4
Sprinkles, 4

The first steps I took was to read my csv file, the convert to my csv into a dataframe. I did this using Pandas:

#open file
file = open(directory)

#read csv and the column I want
df = pd.read_csv(file, usecols=['data'])
#make into a dataframe
rowData = pd.DataFrame(df)

Then in order to make parsing my data easier for me, I separated each row into comma separated values. Then converted those values into a list of list (each row is a list). Then remove any None values. Then I need to move all the data that starts with 'OU=' into its own list, and if any of the data has 'user' or 'userMger' or 'employee', I will remove those values from the list. This is my code as of now:

#splits the dataframe into comma separate values
lines =rowData['data'].str.split(",", expand=True)

#makes dataframe into a list of lists
a = lines.values.tolist()

#make my list of lists into a single list
employeeList = []
for i in range(len(a)):
    for j in range(len(a[0])):
        #there are some None values once converted to a list
        if a[i][j] != None: 
           employeeList.append(a[i][j])

#list for storing only OUs
ouList = []

#moving the items to the ouList that are only OUs
for i in range(len(employeeList)):
    if employeeList[i].startswith('OU='):
        ouList.append(employeeList[i])

#need to iterate in reverse as I am removing items from the list
#here I remove the other items
for i in reversed(range(len(ouList))):
     if ouList[i].endswith('users') or ouList[i].endswith('userMger') or ouList[i].endswith('employee'):
        ouList.remove(ouList[i])
        
#my list now only contains specific OUs        
print(ouList)

I believe I'm on the right track, my code is not removing any duplicates yet in each list within the list, such as Sprinkles_HQ, Sprinkles, or sprinkles. Before I make my employeelist list, I need to find a way to remove duplicates, and append them to a new list. This will make it easier for me to count.

I've researched how to remove duplicates in a list of lists. I attempted using something along the lines as:

new_list = []
for elem in a:
    if a not in new_list:
        new_list.append(elem)

But this doesn't take into consideration the words that start the same. I attempted using startswith and .lower() as there are upper and lower cases, but does not work for me yet:

new_list=[]
for i in range(len(a)):
    for j in range(len(a[0])):
        if a[i][j].lower().startswith(a[i][j].lower()) not in new_list:
           new_list.append(a[i][j])

Any suggestions, will be greatly appreciated.

Solution 1:^[1]

The solution I came up with is in parts. My first issue was the casing, I need everything to be in lowercase. So after I appended items to employeeList, I added this code:

for i in range(len(employeeList)):
    for j in range(len(employeeList[i])):
        employeeList[i][j] = employeeList[i][j].lower()

This makes everything in my employeeList lowercase.

Now once I fixed that, I then needed to change the output of my ouList from a single list, and keep it as a list of lists. So all rows with only ou= are going to be in ouList.

#list for storing only OUs
ouList = []

#moving the items to the ouList that are only OUs
for i in range(len(employeeList)):
    ouList.append([])
    for j in range(len(employeeList[i])):
        if employeeList[i][j].startswith('ou='):
           ouList[i].append(employeeList[i][j])

Then I needed to remove any items that end with users, userMger, or employee. I reverse iterated and used .endswith() to achieve this without any errors.

#need to iterate in reverse as I am removing items from the list
for i in reversed(range(len(ouList))):
     for j in reversed(range(len(ouList[i]))):
        if (ouList[i][j].endswith('users')
        or ouList[i][j].endswith('usermger')
        or ouList[i][j].endswith('employee')):
           ouList[i].remove(ouList[i][j])

Then to strip ou= or unnecessary strings, I used re (aka regular expressions or regex). Then I appended these new values to another list called ouListStrip

#stripping ou= and other strings
ouListStrip = []
for i in range(len(ouList)):
     ouListStrip.append([])
     for j in range(len(ouList[i])):
       ou = re.sub("ou=|_hq", "", ouList[i][j])
       ouListStrip[i].append(ou)

This list outputs this:

[['news'], ['news', 'news'], ['news', 'news'], ['news'], ['ice cream', 'ice cream'], ['ice cream'], ['store'], ['store', 'store'], ['store'], ['store'], ['sprinkles', 'sprinkles'], ['sprinkles', 'sprinkles'], ['sprinkles'], ['sprinkles']]

Now that I have only a list of lists, I can now work on removing duplicates in the sublists. I achieve this through using not in and appending them still as a list of lists.

no_repeats = []
for i in range(len(ouListStrip)):
     no_repeats.append([])
     for j in range(len(ouListStrip[i])):
       if ouListStrip[i][j] not in no_repeats[i]:
          no_repeats[i].append(ouListStrip[i][j])

no_repeats outputs this:

[['news'], ['news'], ['news'], ['news'], ['ice cream'], ['ice cream'], ['store'], ['store'], ['store'], ['store'], ['sprinkles'], ['sprinkles'], ['sprinkles'], ['sprinkles']]

Finally, I combine my list of list items into one single list:

allOUs = []
for i in range(len(no_repeats)):
    for j in range(len(no_repeats[i])):
        allOUs.append(no_repeats[i][j])

allOUs outputs:

['news', 'news', 'news', 'news', 'ice cream', 'ice cream', 'store', 'store', 'store', 'store', 'sprinkles', 'sprinkles', 'sprinkles', 'sprinkles']

Then I make this list into a dictionary and count the items within it using .count():

dict_of_counts = {item:allOUs.count(item) for item in allOUs}

Outputs:

{'news': 4, 'ice cream': 2, 'store': 4, 'sprinkles': 4}

To make it visually similar to what I want:

for key, value in dict_of_counts.items():
    print(key,',',value)

Outputs:

news , 4
ice cream , 2
store , 4
sprinkles , 4

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1

'Parse CSV data to get a count on rows with duplicate values

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]