'How to filter out data based on date in python of a csv file
I have a data set as of below & I want to filter data from 2021-07-30 to 2021-08-03 Below is the dataset
input.csv
created_at,text,label
2021-07-24,Newzeland Wins the worldcup,Sport
2021-07-25,ABC Wins the worldcup,Sport
2021-07-26,Hello the worldcup,Sport
2021-07-27,Cricket worldcup,Sport
2021-07-28,Rugby worldcup,Sport
2021-07-29,LLL Wins,Sport
2021-07-30,MMM Wins the worldcup,Sport
2021-07-31,RRR Wins the worldcup,Sport
2021-08-01,OOO Wins the worldcup,Sport
2021-08-02,JJJ Wins the worldcup,Sport
2021-08-03,YYY Wins the worldcup,Sport
2021-08-04,KKK Wins the worldcup,Sport
2021-08-05,YYY Wins the worldcup,Sport
2021-08-06,GGG Wins the worldcup,Sport
2021-08-07,FFF Wins the worldcup,Sport
2021-08-08,SSS Wins the worldcup,Sport
2021-08-09,XYZ Wins the worldcup,Sport
2021-08-10,PQR Wins the worldcup,Sport
output.csv
created_at,text,label
2021-07-30,MMM Wins the worldcup,Sport
2021-07-31,RRR Wins the worldcup,Sport
2021-08-01,OOO Wins the worldcup,Sport
2021-08-02,JJJ Wins the worldcup,Sport
2021-08-03,YYY Wins the worldcup,Sport
import pandas as pd
def save():
tweets = pd.read_csv(r'input.csv.csv')
df = pd.DataFrame(tweets, columns=['created_at', 'text','label'])
if __name__ == '__main__':
save()
Solution 1:[1]
df[(df.created_at >= '2021-07-30') & (df.created_at <= '2021-08-03')]
Output:
created_at text label
6 2021-07-30 MMM Wins the worldcup Sport
7 2021-07-31 RRR Wins the worldcup Sport
8 2021-08-01 OOO Wins the worldcup Sport
9 2021-08-02 JJJ Wins the worldcup Sport
10 2021-08-03 YYY Wins the worldcup Sport
Solution 2:[2]
Try:
df = pd.read_csv('input.csv', parse_dates=['created_at'])
out = df[df['created_at'].between('2021-07-30', '2021-08-03')]
out.to_csv('output.csv', index=False)
Content of output.csv:
created_at,text,label
2021-07-30,MMM Wins the worldcup,Sport
2021-07-31,RRR Wins the worldcup,Sport
2021-08-01,OOO Wins the worldcup,Sport
2021-08-02,JJJ Wins the worldcup,Sport
2021-08-03,YYY Wins the worldcup,Sport
Solution 3:[3]
Another way to achieve this, perhaps even faster:
df = pd.DataFrame({'data':['SPY', 'SPY','SPY', 'SPY', 'SPY', 'SPY','SPY'],
'created_at': ['2021-07-30', '2021-07-31', '2021-08-01', '2021-08-02', '2010-05-06', '2021-08-03', '2021-08-04']
})
df = df.assign(created_at=pd.to_datetime(df['created_at']))
df.query("created_at > '2021-07-30' & created_at < '2021-08-03'")
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | BeRT2me |
| Solution 2 | Corralien |
| Solution 3 | Vishnu |
