'Trying to remove commas and dollars signs with Pandas in Python

Tring to remove the commas and dollars signs from the columns. But when I do, the table prints them out and still has them in there. Is there a different way to remove the commans and dollars signs using a pandas function. I was unuable to find anything in the API Docs or maybe i was looking in the wrong place

 import pandas as pd
    import pandas_datareader.data as web

players = pd.read_html('http://www.usatoday.com/sports/mlb/salaries/2013/player/p/')


df1 = pd.DataFrame(players[0])


df1.drop(df1.columns[[0,3,4, 5, 6]], axis=1, inplace=True)
df1.columns = ['Player', 'Team', 'Avg_Annual']
df1['Avg_Annual'] = df1['Avg_Annual'].replace(',', '')

print (df1.head(10))


Solution 1:[1]

Shamelessly stolen from this answer... but, that answer is only about changing one character and doesn't complete the coolness: since it takes a dictionary, you can replace any number of characters at once, as well as in any number of columns.

# if you want to operate on multiple columns, put them in a list like so:
cols = ['col1', 'col2', ..., 'colN']

# pass them to df.replace(), specifying each char and it's replacement:
df[cols] = df[cols].replace({'\$': '', ',': ''}, regex=True)

@shivsn caught that you need to use regex=True; you already knew about replace (but also didn't show trying to use it on multiple columns or both the dollar sign and comma simultaneously).

This answer is simply spelling out the details I found from others in one place for those like me (e.g. noobs to python an pandas). Hope it's helpful.

Solution 2:[2]

@bernie's answer is spot on for your problem. Here's my take on the general problem of loading numerical data in pandas.

Often the source of the data is reports generated for direct consumption. Hence the presence of extra formatting like %, thousand's separator, currency symbols etc. All of these are useful for reading but causes problems for the default parser. My solution is to typecast the column to string, replace these symbols one by one then cast it back to appropriate numerical formats. Having a boilerplate function which retains only [0-9.] is tempting but causes problems where the thousand's separator and decimal gets swapped, also in case of scientific notation. Here's my code which I wrap into a function and apply as needed.

df[col] = df[col].astype(str)  # cast to string

# all the string surgery goes in here
df[col] = df[col].replace('$', '')
df[col] = df[col].replace(',', '')  # assuming ',' is the thousand's separator in your locale
df[col] = df[col].replace('%', '')

df[col] = df[col].astype(float)  # cast back to appropriate type

Solution 3:[3]

I used this logic

df.col = df.col.apply(lambda x:x.replace('$','').replace(',',''))

Solution 4:[4]

import datetime
import re
from dataclasses import dataclass, field

raw_data = ['ARTS  111  A', 'M', '09:00 - 12:00', 'W', '09:00 - 12:00', 'F', '02:00 - 12:00',
            'COMP 111  A', 'M', '09:00 - 12:00', 'W', '09:00 - 12:00',
            'COMP 200 A', 'M', '09:30 - 11:30', 'W', '09:00 - 12:00']

# the data is not structured, so let's parse it !

days_letters = ('M', 'T', 'W', 'H', 'F')  # 'H' used for tHursday
timerange_pattern = re.compile(r"(\d\d):(\d\d) - (\d\d):(\d\d)")
#                        group(1) ^^^^  2^^^^    3^^^^  4^^^^
coursename_pattern = re.compile(r"(\w+\s+\d+\s+\w)")
#                         group(1) ^^^^^^^^^^^^^^


@dataclass
class Course:
    name: str
    slots: list = field(default_factory=list)


@dataclass
class CourseSlot:
    day: str
    time_start: datetime.time
    time_end: datetime.time


tokens = list(raw_data)
courses = []
while tokens:
    token = tokens.pop(0)  # get the first string

    course_name_match = coursename_pattern.fullmatch(token)
    assert course_name_match is not None
    course_name = course_name_match.group(1)

    course = Course(name=course_name)

    # then read the days and hours
    while tokens:
        if tokens[0] in days_letters:
            day = tokens.pop(0)
            timerange_match = timerange_pattern.fullmatch(tokens.pop(0))
            assert timerange_match is not None

            start_hour = int(timerange_match.group(1))
            start_minute = int(timerange_match.group(2))
            end_hour = int(timerange_match.group(3))
            end_minute = int(timerange_match.group(4))

            course.slots.append(CourseSlot(
                day=day,
                time_start=datetime.time(hour=start_hour, minute=start_minute),
                time_end=datetime.time(hour=end_hour, minute=end_minute),
            ))

        else:
            break
    courses.append(course)

print(courses)
# [Course(name='ARTS  111  A', slots=[CourseSlot(day='M', time_start=datetime.time(9, 0), time_end=datetime.time(12, 0)),
#                                     CourseSlot(day='W', time_start=datetime.time(9, 0), time_end=datetime.time(12, 0)),
#                                     CourseSlot(day='F', time_start=datetime.time(2, 0), time_end=datetime.time(12, 0))]),
#  Course(name='COMP 111  A', slots=[CourseSlot(day='M', time_start=datetime.time(9, 0), time_end=datetime.time(12, 0)),
#                                    CourseSlot(day='W', time_start=datetime.time(9, 0), time_end=datetime.time(12, 0))]),
#  Course(name='COMP 200 A', slots=[CourseSlot(day='M', time_start=datetime.time(9, 30), time_end=datetime.time(11, 30)),
#                                   CourseSlot(day='W', time_start=datetime.time(9, 0), time_end=datetime.time(12, 0))])]

@dataclass
class Clash:
    course1_name: str
    course2_name: str
    course1_slot: CourseSlot
    course2_slot: CourseSlot

# now search for overlap :
clashes = []
for i, first_course in enumerate(courses[:-1], start=1):  # from first to last-1
    for second_course in courses[i:]:  # from after the first to the last
        for first_slot in first_course.slots:
            for second_slot in second_course.slots:
                if first_slot.day == second_slot.day:
                    if first_slot.time_start <= second_slot.time_end and \
                            second_slot.time_start <= first_slot.time_end:  # see https://stackoverflow.com/a/3269471/11384184
                        # we have a match !
                        clashes.append(Clash(course1_name=first_course.name,
                                             course2_name=second_course.name,
                                             course1_slot=first_slot,
                                             course2_slot=second_slot))

print(clashes)
# [
#   Clash(course1_name='ARTS  111  A',
#         course2_name='COMP 111  A',
#         course1_slot=CourseSlot(day='M', time_start=datetime.time(9, 0), time_end=datetime.time(12, 0)),
#         course2_slot=CourseSlot(day='M', time_start=datetime.time(9, 0), time_end=datetime.time(12, 0))
#   ),
#   Clash(course1_name='ARTS  111  A',
#         course2_name='COMP 111  A',
#         course1_slot=CourseSlot(day='W', time_start=datetime.time(9, 0), time_end=datetime.time(12, 0)),
#         course2_slot=CourseSlot(day='W', time_start=datetime.time(9, 0), time_end=datetime.time(12, 0))
#   ),
#   Clash(course1_name='ARTS  111  A',
#         course2_name='COMP 200 A',
#         course1_slot=CourseSlot(day='M', time_start=datetime.time(9, 0), time_end=datetime.time(12, 0)),
#         course2_slot=CourseSlot(day='M', time_start=datetime.time(9, 30), time_end=datetime.time(11, 30))
#   ),
#   Clash(course1_name='ARTS  111  A',
#         course2_name='COMP 200 A',
#         course1_slot=CourseSlot(day='W', time_start=datetime.time(9, 0), time_end=datetime.time(12, 0)),
#         course2_slot=CourseSlot(day='W', time_start=datetime.time(9, 0), time_end=datetime.time(12, 0))
#   ),
#   Clash(course1_name='COMP 111  A',
#         course2_name='COMP 200 A',
#         course1_slot=CourseSlot(day='M', time_start=datetime.time(9, 0), time_end=datetime.time(12, 0)),
#         course2_slot=CourseSlot(day='M', time_start=datetime.time(9, 30), time_end=datetime.time(11, 30))
#   ),
#   Clash(course1_name='COMP 111  A',
#         course2_name='COMP 200 A',
#         course1_slot=CourseSlot(day='W', time_start=datetime.time(9, 0), time_end=datetime.time(12, 0)),
#         course2_slot=CourseSlot(day='W', time_start=datetime.time(9, 0), time_end=datetime.time(12, 0))
#   )
# ]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Hendy
Solution 2 BiGYaN
Solution 3 demokritos
Solution 4 Lenormju