'Trying to remove commas and dollars signs with Pandas in Python
Tring to remove the commas and dollars signs from the columns. But when I do, the table prints them out and still has them in there. Is there a different way to remove the commans and dollars signs using a pandas function. I was unuable to find anything in the API Docs or maybe i was looking in the wrong place
import pandas as pd
import pandas_datareader.data as web
players = pd.read_html('http://www.usatoday.com/sports/mlb/salaries/2013/player/p/')
df1 = pd.DataFrame(players[0])
df1.drop(df1.columns[[0,3,4, 5, 6]], axis=1, inplace=True)
df1.columns = ['Player', 'Team', 'Avg_Annual']
df1['Avg_Annual'] = df1['Avg_Annual'].replace(',', '')
print (df1.head(10))
Solution 1:[1]
Shamelessly stolen from this answer... but, that answer is only about changing one character and doesn't complete the coolness: since it takes a dictionary, you can replace any number of characters at once, as well as in any number of columns.
# if you want to operate on multiple columns, put them in a list like so:
cols = ['col1', 'col2', ..., 'colN']
# pass them to df.replace(), specifying each char and it's replacement:
df[cols] = df[cols].replace({'\$': '', ',': ''}, regex=True)
@shivsn caught that you need to use regex=True; you already knew about replace (but also didn't show trying to use it on multiple columns or both the dollar sign and comma simultaneously).
This answer is simply spelling out the details I found from others in one place for those like me (e.g. noobs to python an pandas). Hope it's helpful.
Solution 2:[2]
@bernie's answer is spot on for your problem. Here's my take on the general problem of loading numerical data in pandas.
Often the source of the data is reports generated for direct consumption. Hence the presence of extra formatting like %, thousand's separator, currency symbols etc. All of these are useful for reading but causes problems for the default parser. My solution is to typecast the column to string, replace these symbols one by one then cast it back to appropriate numerical formats. Having a boilerplate function which retains only [0-9.] is tempting but causes problems where the thousand's separator and decimal gets swapped, also in case of scientific notation. Here's my code which I wrap into a function and apply as needed.
df[col] = df[col].astype(str) # cast to string
# all the string surgery goes in here
df[col] = df[col].replace('$', '')
df[col] = df[col].replace(',', '') # assuming ',' is the thousand's separator in your locale
df[col] = df[col].replace('%', '')
df[col] = df[col].astype(float) # cast back to appropriate type
Solution 3:[3]
I used this logic
df.col = df.col.apply(lambda x:x.replace('$','').replace(',',''))
Solution 4:[4]
import datetime
import re
from dataclasses import dataclass, field
raw_data = ['ARTS 111 A', 'M', '09:00 - 12:00', 'W', '09:00 - 12:00', 'F', '02:00 - 12:00',
'COMP 111 A', 'M', '09:00 - 12:00', 'W', '09:00 - 12:00',
'COMP 200 A', 'M', '09:30 - 11:30', 'W', '09:00 - 12:00']
# the data is not structured, so let's parse it !
days_letters = ('M', 'T', 'W', 'H', 'F') # 'H' used for tHursday
timerange_pattern = re.compile(r"(\d\d):(\d\d) - (\d\d):(\d\d)")
# group(1) ^^^^ 2^^^^ 3^^^^ 4^^^^
coursename_pattern = re.compile(r"(\w+\s+\d+\s+\w)")
# group(1) ^^^^^^^^^^^^^^
@dataclass
class Course:
name: str
slots: list = field(default_factory=list)
@dataclass
class CourseSlot:
day: str
time_start: datetime.time
time_end: datetime.time
tokens = list(raw_data)
courses = []
while tokens:
token = tokens.pop(0) # get the first string
course_name_match = coursename_pattern.fullmatch(token)
assert course_name_match is not None
course_name = course_name_match.group(1)
course = Course(name=course_name)
# then read the days and hours
while tokens:
if tokens[0] in days_letters:
day = tokens.pop(0)
timerange_match = timerange_pattern.fullmatch(tokens.pop(0))
assert timerange_match is not None
start_hour = int(timerange_match.group(1))
start_minute = int(timerange_match.group(2))
end_hour = int(timerange_match.group(3))
end_minute = int(timerange_match.group(4))
course.slots.append(CourseSlot(
day=day,
time_start=datetime.time(hour=start_hour, minute=start_minute),
time_end=datetime.time(hour=end_hour, minute=end_minute),
))
else:
break
courses.append(course)
print(courses)
# [Course(name='ARTS 111 A', slots=[CourseSlot(day='M', time_start=datetime.time(9, 0), time_end=datetime.time(12, 0)),
# CourseSlot(day='W', time_start=datetime.time(9, 0), time_end=datetime.time(12, 0)),
# CourseSlot(day='F', time_start=datetime.time(2, 0), time_end=datetime.time(12, 0))]),
# Course(name='COMP 111 A', slots=[CourseSlot(day='M', time_start=datetime.time(9, 0), time_end=datetime.time(12, 0)),
# CourseSlot(day='W', time_start=datetime.time(9, 0), time_end=datetime.time(12, 0))]),
# Course(name='COMP 200 A', slots=[CourseSlot(day='M', time_start=datetime.time(9, 30), time_end=datetime.time(11, 30)),
# CourseSlot(day='W', time_start=datetime.time(9, 0), time_end=datetime.time(12, 0))])]
@dataclass
class Clash:
course1_name: str
course2_name: str
course1_slot: CourseSlot
course2_slot: CourseSlot
# now search for overlap :
clashes = []
for i, first_course in enumerate(courses[:-1], start=1): # from first to last-1
for second_course in courses[i:]: # from after the first to the last
for first_slot in first_course.slots:
for second_slot in second_course.slots:
if first_slot.day == second_slot.day:
if first_slot.time_start <= second_slot.time_end and \
second_slot.time_start <= first_slot.time_end: # see https://stackoverflow.com/a/3269471/11384184
# we have a match !
clashes.append(Clash(course1_name=first_course.name,
course2_name=second_course.name,
course1_slot=first_slot,
course2_slot=second_slot))
print(clashes)
# [
# Clash(course1_name='ARTS 111 A',
# course2_name='COMP 111 A',
# course1_slot=CourseSlot(day='M', time_start=datetime.time(9, 0), time_end=datetime.time(12, 0)),
# course2_slot=CourseSlot(day='M', time_start=datetime.time(9, 0), time_end=datetime.time(12, 0))
# ),
# Clash(course1_name='ARTS 111 A',
# course2_name='COMP 111 A',
# course1_slot=CourseSlot(day='W', time_start=datetime.time(9, 0), time_end=datetime.time(12, 0)),
# course2_slot=CourseSlot(day='W', time_start=datetime.time(9, 0), time_end=datetime.time(12, 0))
# ),
# Clash(course1_name='ARTS 111 A',
# course2_name='COMP 200 A',
# course1_slot=CourseSlot(day='M', time_start=datetime.time(9, 0), time_end=datetime.time(12, 0)),
# course2_slot=CourseSlot(day='M', time_start=datetime.time(9, 30), time_end=datetime.time(11, 30))
# ),
# Clash(course1_name='ARTS 111 A',
# course2_name='COMP 200 A',
# course1_slot=CourseSlot(day='W', time_start=datetime.time(9, 0), time_end=datetime.time(12, 0)),
# course2_slot=CourseSlot(day='W', time_start=datetime.time(9, 0), time_end=datetime.time(12, 0))
# ),
# Clash(course1_name='COMP 111 A',
# course2_name='COMP 200 A',
# course1_slot=CourseSlot(day='M', time_start=datetime.time(9, 0), time_end=datetime.time(12, 0)),
# course2_slot=CourseSlot(day='M', time_start=datetime.time(9, 30), time_end=datetime.time(11, 30))
# ),
# Clash(course1_name='COMP 111 A',
# course2_name='COMP 200 A',
# course1_slot=CourseSlot(day='W', time_start=datetime.time(9, 0), time_end=datetime.time(12, 0)),
# course2_slot=CourseSlot(day='W', time_start=datetime.time(9, 0), time_end=datetime.time(12, 0))
# )
# ]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Hendy |
| Solution 2 | BiGYaN |
| Solution 3 | demokritos |
| Solution 4 | Lenormju |
