'Removing dublicate on PostgreSQL based on 2 columns
I want to drop entries based on 2 columns in Postgres. The code for to get the data is:
import pandas as pd
from sqlalchemy import create_engine
from pandas_datareader import data as web
import yfinance as yf
import datetime
from dateutil.relativedelta import relativedelta
engine = create_engine('postgresql://postgres:{}@localhost:5433/stockdata'.format(123))
tickers = ['amzn', 'fb']
for ticker in tickers:
df = yf.download(tickers=ticker, start=datetime.datetime.now()-relativedelta(days=7), end= datetime.datetime.now(), interval="5m", progress = False)
df['symbol'] = ticker
df.to_sql('stockdata', con=engine, if_exists='append', index=True)
This will produce a table which looks like this but running the code multiple times will create duplicates:
Datetime Open High Low Close Adj Close Volume symbol
0 2022-03-28 13:30:00+00:00 3299.500000 3318.000000 3298.000000 3311.023438 3311.023438 133898 amzn
1 2022-03-28 13:35:00+00:00 3313.000000 3344.479980 3313.000000 3339.060059 3339.060059 104565 amzn
2 2022-03-28 13:40:00+00:00 3348.500000 3365.989990 3342.344482 3354.570068 3354.570068 126489 amzn
3 2022-03-28 13:45:00+00:00 3356.000000 3356.000000 3340.000000 3340.000000 3340.000000 70492 amzn
4 2022-03-28 13:50:00+00:00 3338.500000 3348.669922 3337.385010 3347.584961 3347.584961 64041 amzn
... ... ... ... ... ... ... ... ...
1559 2022-04-01 19:40:00+00:00 223.380005 223.660004 223.039993 223.110001 223.110001 258564 fb
1560 2022-04-01 19:45:00+00:00 223.080002 223.589996 223.059998 223.570007 223.570007 246820 fb
1561 2022-04-01 19:50:00+00:00 223.600006 224.610001 223.600006 224.485001 224.485001 469905 fb
1562 2022-04-01 19:55:00+00:00 224.460007 224.940002 224.460007 224.869995 224.869995 651629 fb
1563 2022-04-01 20:00:00+00:00 224.850006 224.850006 224.850006 224.850006 224.850006 0 fb
1564 rows × 8 columns
There is no primary key in the database(I don't have any control over that). Thus the objective is to make sure that Datetime is unique to the Symbol. For example FB can only have 1 instance of unique Datetime entry. The data is on a 5m resolution. How can I group by on Datetime and Symbol and remove duplicate entry based on column Datetime
I was trying to do this earlier:
query = """delete from stockdata s
using
(SELECT "Datetime" , max(ctid) as max_ctid from stockdata group by "Datetime") t
where s.ctid<>t.max_ctid
and s."Datetime"=t."Datetime";"""
But this was looking at just Datetime duplicates and not based on grouping Symbol symbol column as well.
To get the a list of duplicate:
SELECT "symbol", "Datetime", COUNT(*) AS CNT
FROM stockdata
GROUP BY "symbol", "Datetime"
HAVING COUNT(*) > 1;
How can I remove all duplicates and keep 1. Any help would be appreciated!
Solution 1:[1]
Revised query. Test before using in production:
query = """delete from stockdata s
using
(SELECT "symbol", "Datetime" , max(ctid) as max_ctid from stockdata group by "symbol","Datetime") t
where s.ctid <> t.max_ctid
and s."Datetime" = t."Datetime"
and s."Symbol" = t."Symbol";"""
Also from my comments on previous question, it would be better to add a UNIQUE constraint on the receiving table to prevent duplicates from be entered in the first place.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Adrian Klaver |
