'Postgresql / Psycopg2 issue while scraping -
I ran into an issue with my scraper and I am at a loss and need you help here.
I scrape data from www.racingpost.com and store the data in a Postgresql database (using pgadmin4 to organize it) and psycopg2 for the connection
This worked perfectly fine until yesterday, where I wanted to scrape some newer data.
I collect the links to the sites with one spider, save them in a json, read the json with the second spider and crawl these links. I log the errors so I can see where a problem might be.
Now, if I run the spider, the second spider does not crawl all the links in the json properly. The code is exactly the same as before, and is not the issue and it seems the behavior of psycopg2 has changed.
After the first error:
ERROR:scrapy.core.scraper:Error processing {'date': ('16.03.2022',), 'track': ('Chantilly (FR)',), 'racename': ('Prix des Ecuries Cantiliennes (Handicap) (4yo+) (All-Weather Track) (Polytrack)',), 'racetype': ('Flat',), 'distance': (1911.1,), 'group': ('',), 'raceclass': (0,), 'classrating': (0,), 'alterteilnehmer': ('4yo+',), 'starterzahl': (13,), 'minalter': (4,), 'maxalter': (99,), 'winningtime': (0,), 'going': ('Standard',), 'finalhurdle': (0,), 'omitted': (0,), 'pricemoney1': (9319.0,), 'pricemoney2': (3727.0,), 'pricemoney3': (2796.0,), 'pricemoney4': (1863.0,), 'pricemoney5': (932.0,), 'pricemoney6': (0.0,), 'pricemoney7': (0.0,), 'pricemoney8': (0.0,), 'racetime': ('3:45',)}
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\virtual_workspace\envs\py39\lib\site-packages\twisted\internet\defer.py", line 857, in _runCallbacks
current.result = callback( # type: ignore[misc]
File "C:\ProgramData\Anaconda3\envs\virtual_workspace\envs\py39\lib\site-packages\scrapy\utils\defer.py", line 162, in f
return deferred_from_coro(coro_f(*coro_args, **coro_kwargs))
File "C:\Users\****\projects\jsontest\jsontest\pipelines.py", line 30, in process_item
self.store_db(item)
File "C:\Users\****\projects\jsontest\jsontest\pipelines.py", line 64, in store_db
self.cur.execute("insert into races(racedate, track, racename, racetype, distancefinal, gruppe, raceclass, classrating, alterteilnehmer, starterzahl, minalter, maxalter, winningtime, going, finalhurdle, omitted, pricemoney1, pricemoney2, pricemoney3, pricemoney4, pricemoney5, pricemoney6, pricemoney7, pricemoney8, racetime) values(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)",
psycopg2.errors.InFailedSqlTransaction: FEHLER: aktuelle Transaktion wurde abgebrochen, Befehle werden bis zum Ende der Transaktion ignoriert
every other crawled result from here on will result in an error message like this.
This was not the case before, where the next transaction (the next result) was processed normaly. And if I crawl these "error" results with the same spider, but without reading the json file (so only this one result) everything works fine. And as I said - this code just read 165000 sites woithout a single error three days ago.
So I guess the problem is not the code per se - what could prompt such problems in psycopg2?
- Any known issues I missed by googling?
- How can I narrow it down?
Thanks so much!
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
