'Upsert in Scrapy + PostgreSQL Pipeline?
I have a Scrapy spider that inserts the scraped data into a PostgreSQL database. It works fine on the first run, but if the primary key already exists, I get the corresponding errors
I tried to implement the upsert DO NOTHING or DO UPDATE commands I read about. I even read the Blog from Depesz about why upsert is so complicated - and while it was interesting, it was a bit "high" for me on some parts and I won't get any race problems with this project.
I tried to use this code for my pipline:
def store_db(self, item):
if "pos" in item:
self.cur.execute("insert into horses(pos, draw, dwinner, dnext, startnumber, horsename, horsecountry, odds, jockey, trainer, weight, age, headgear, ofr, rp, ts, rprc, date, track, racename) values(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s) ON CONFLICT (pos, draw, dwinner, dnext, startnumber, horsename, horsecountry, odds, jockey, trainer, weight, age, headgear, ofr, rp, ts, rprc, date, track, racename) DO NOTHING",
(
item['pos'],
item['draw'],
item['dwinner'],
item['dnext'],
item['startnumber'],
item['pferde'],
item['horsecountry'],
item['odd'],
item['jockey'],
item['trainer'],
item['weightkg'],
item['alter'],
item['headgear'],
item['officalrating'],
item['rp'],
item['ts'],
item['rprc'],
item['date'],
item['track'],
item['racename']
))
self.connection.commit()
return item
But I guess I messed it up. How do I implement the ON CONFLICT command in this pipeline correctly?
****** EDIT ***** In another table, I set date, track and racename as a primary key. If I update the code to only these three
(sel.cur.execute("insert into( date, track, racename) value(%s, %s, %s) ON CONFLICT (date, track, racename) DO NOTHING
it works as intended. But - if I add another, not primary key field like "racetype", I get this error:
File "C:\Anaconda3\envs\virtual_workspace\lib\site-packages\twisted\internet\defer.py", line 858, in _runCallbacks
current.result = callback( # type: ignore[misc]
File "C:\Anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\utils\defer.py", line 150, in f
return deferred_from_coro(coro_f(*coro_args, **coro_kwargs))
File "C:\Users\Schalumpa\projects\jsontest\jsontest\pipelines.py", line 30, in process_item
self.store_db(item)
File "C:\Users\Schalumpa\projects\jsontest\jsontest\pipelines.py", line 42, in store_db
item['racetype']
KeyError: 'racetype'
I read here: that "The optional ON CONFLICT clause specifies an alternative action to raising a unique violation or exclusion constraint violation error" - does that mean that I can only use "ON CONFLICT" on Primary keys - and that the rest still gets processed? I thought on "DO NOTHING" the operation on this column will just be aborted, I don`t want to update anything in the tables, as there is no "new" data once it already is insertet once.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
