'Parse CSV data to get a count on rows with duplicate values
I have a csv file (data.csv):
data
cn=Clark Kent,ou=users,ou=news,ou=employee,dc=company,dc=com
cn=Peter Parker,ou=News,ou=news,ou=employee,dc=company,dc=com
cn=Mary Jane,ou=News_HQ,ou=news,ou=employee,dc=company,dc=com
cn=Oliver Twist,ou=users,ou=news,ou=employee,dc=company,dc=com
cn=Mary Poppins,ou=Ice Cream,ou=ice cream,dc=company,dc=com
cn=David Tenant,ou=userMger,ou=ice cream,ou=employee,dc=company,dc=com
cn=Pepper Jack,ou=users,ou=store,ou=employee,dc=company,dc=com
cn=Eren Jaeger,ou=Store,ou=store,ou=employee,dc=company,dc=com
cn=Monty Python,ou=users,ou=store,dc=company,dc=com
cn=John Smith,ou=userMger,ou=store,ou=employee,dc=company,dc=com
cn=Anne Potts,ou=Sprinkles_HQ,ou=sprinkles,dc=company,dc=com
cn=Harry Styles,OU=Sprinkles,ou=sprinkles,ou=employee,dc=company,dc=com
cn=James Bond,ou=Sprinkles_HQ,ou=employee,dc=company,dc=com
cn=Harry Potter,ou=users,ou=sprinkles,ou=employee,dc=company,dc=com
I need to parse the data to the point where I can get a count of how many rows have the same name in ou. So for example, if there is Sprinkles_HQ, Sprinkles, or sprinkles, they should count as the same. And if a row has Sprinkles_HQ and sprinkles (two of the same name), the row should still count as one (not two).
My desired output is something similar to this:
News, 4
Ice Cream, 2
Store, 4
Sprinkles, 4
The first steps I took was to read my csv file, the convert to my csv into a dataframe. I did this using Pandas:
#open file
file = open(directory)
#read csv and the column I want
df = pd.read_csv(file, usecols=['data'])
#make into a dataframe
rowData = pd.DataFrame(df)
Then in order to make parsing my data easier for me, I separated each row into comma separated values. Then converted those values into a list of list (each row is a list). Then remove any None values. Then I need to move all the data that starts with 'OU=' into its own list, and if any of the data has 'user' or 'userMger' or 'employee', I will remove those values from the list. This is my code as of now:
#splits the dataframe into comma separate values
lines =rowData['data'].str.split(",", expand=True)
#makes dataframe into a list of lists
a = lines.values.tolist()
#make my list of lists into a single list
employeeList = []
for i in range(len(a)):
for j in range(len(a[0])):
#there are some None values once converted to a list
if a[i][j] != None:
employeeList.append(a[i][j])
#list for storing only OUs
ouList = []
#moving the items to the ouList that are only OUs
for i in range(len(employeeList)):
if employeeList[i].startswith('OU='):
ouList.append(employeeList[i])
#need to iterate in reverse as I am removing items from the list
#here I remove the other items
for i in reversed(range(len(ouList))):
if ouList[i].endswith('users') or ouList[i].endswith('userMger') or ouList[i].endswith('employee'):
ouList.remove(ouList[i])
#my list now only contains specific OUs
print(ouList)
I believe I'm on the right track, my code is not removing any duplicates yet in each list within the list, such as Sprinkles_HQ, Sprinkles, or sprinkles. Before I make my employeelist list, I need to find a way to remove duplicates, and append them to a new list. This will make it easier for me to count.
I've researched how to remove duplicates in a list of lists. I attempted using something along the lines as:
new_list = []
for elem in a:
if a not in new_list:
new_list.append(elem)
But this doesn't take into consideration the words that start the same.
I attempted using startswith and .lower() as there are upper and lower cases, but does not work for me yet:
new_list=[]
for i in range(len(a)):
for j in range(len(a[0])):
if a[i][j].lower().startswith(a[i][j].lower()) not in new_list:
new_list.append(a[i][j])
Any suggestions, will be greatly appreciated.
Solution 1:[1]
The solution I came up with is in parts. My first issue was the casing, I need everything to be in lowercase. So after I appended items to employeeList, I added this code:
for i in range(len(employeeList)):
for j in range(len(employeeList[i])):
employeeList[i][j] = employeeList[i][j].lower()
This makes everything in my employeeList lowercase.
Now once I fixed that, I then needed to change the output of my ouList from a single list, and keep it as a list of lists. So all rows with only ou= are going to be in ouList.
#list for storing only OUs
ouList = []
#moving the items to the ouList that are only OUs
for i in range(len(employeeList)):
ouList.append([])
for j in range(len(employeeList[i])):
if employeeList[i][j].startswith('ou='):
ouList[i].append(employeeList[i][j])
Then I needed to remove any items that end with users, userMger, or employee. I reverse iterated and used .endswith() to achieve this without any errors.
#need to iterate in reverse as I am removing items from the list
for i in reversed(range(len(ouList))):
for j in reversed(range(len(ouList[i]))):
if (ouList[i][j].endswith('users')
or ouList[i][j].endswith('usermger')
or ouList[i][j].endswith('employee')):
ouList[i].remove(ouList[i][j])
Then to strip ou= or unnecessary strings, I used re (aka regular expressions or regex). Then I appended these new values to another list called ouListStrip
#stripping ou= and other strings
ouListStrip = []
for i in range(len(ouList)):
ouListStrip.append([])
for j in range(len(ouList[i])):
ou = re.sub("ou=|_hq", "", ouList[i][j])
ouListStrip[i].append(ou)
This list outputs this:
[['news'], ['news', 'news'], ['news', 'news'], ['news'], ['ice cream', 'ice cream'], ['ice cream'], ['store'], ['store', 'store'], ['store'], ['store'], ['sprinkles', 'sprinkles'], ['sprinkles', 'sprinkles'], ['sprinkles'], ['sprinkles']]
Now that I have only a list of lists, I can now work on removing duplicates in the sublists. I achieve this through using not in and appending them still as a list of lists.
no_repeats = []
for i in range(len(ouListStrip)):
no_repeats.append([])
for j in range(len(ouListStrip[i])):
if ouListStrip[i][j] not in no_repeats[i]:
no_repeats[i].append(ouListStrip[i][j])
no_repeats outputs this:
[['news'], ['news'], ['news'], ['news'], ['ice cream'], ['ice cream'], ['store'], ['store'], ['store'], ['store'], ['sprinkles'], ['sprinkles'], ['sprinkles'], ['sprinkles']]
Finally, I combine my list of list items into one single list:
allOUs = []
for i in range(len(no_repeats)):
for j in range(len(no_repeats[i])):
allOUs.append(no_repeats[i][j])
allOUs outputs:
['news', 'news', 'news', 'news', 'ice cream', 'ice cream', 'store', 'store', 'store', 'store', 'sprinkles', 'sprinkles', 'sprinkles', 'sprinkles']
Then I make this list into a dictionary and count the items within it using .count():
dict_of_counts = {item:allOUs.count(item) for item in allOUs}
Outputs:
{'news': 4, 'ice cream': 2, 'store': 4, 'sprinkles': 4}
To make it visually similar to what I want:
for key, value in dict_of_counts.items():
print(key,',',value)
Outputs:
news , 4
ice cream , 2
store , 4
sprinkles , 4
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
