'Scrapy INSERT into "new_table" only if records do not exist in "current table"
I tried some website scrapping. I success scraped datas in my current db table. But I would like to INSERT into "new_table" only if records do not exist in "current table"
My code is (pipelines)
table = 'products'
table2 = 'new_products'`
def save(self, row):
cursor = self.cnx.cursor()
cursor.execute("SELECT DISTINCT product_id FROM products;")
old_ids = [row[0] for row in cursor.fetchall()]
create_query = ("INSERT INTO " + self.table +
"(rowid, date, listing_id, product_id, product_name, price, url) "
"VALUES (%(rowid)s, %(date)s, %(listing_id)s, %(product_id)s, %(product_name)s, %(price)s, %(url)s)")
cursor.execute(create_query, row)
lastRecordId = cursor.lastrowid
self.cnx.commit()
cursor.close()
print("Item saved with ID: {}" . format(lastRecordId))
if not product_id in old_ids:
create_query = ("INSERT INTO " + self.table2 +
"(rowid, date, listing_id, product_id, product_name, price, url) "
"VALUES (%(rowid)s, %(date)s, %(listing_id)s, %(product_id)s, %(product_name)s, %(price)s, %(url)s)")
it's not working and I got errors.
2022-05-06 12:26:57 [scrapy.core.scraper] ERROR: Error processing {'date': '2022-05-06 12:26:57.575507',
'listing_id': '0190199600119',
'price': '4199.00',
'product_id': '1209298',
'product_name': 'APPLE 11" Magic Türkçe Q Klavye Siyah',
'rowid': 456274953331128512,
'url': 'https://www.mediamarkt.com.tr/tr/product/APPLE%2011%22%20Magic%20T%C3%BCrk%C3%A7e%20Q%20Klavye%20Siyah-1209298.html'}
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/local/lib/python3.8/dist-packages/scrapy/utils/defer.py", line 162, in f
return deferred_from_coro(coro_f(*coro_args, **coro_kwargs))
File "/root/teknosa/teknosa/pipelines.py", line 28, in process_item
self.save(dict(item))
File "/root/teknosa/teknosa/pipelines.py", line 62, in save
if not product_id in old_ids:
NameError: name 'product_id' is not defined
Saving item into db ...
I have unique product_id.
if not product_id in current_table INSERT this product_id to the "new_products"
How to make this?
Thank you.
last edit: I got this error.
2022-05-07 18:17:11 [scrapy.core.scraper] ERROR: Error processing {'date': '2022-05-07 18:17:11.902622',
'listing_id': '8713439219357',
'price': '99.00',
'product_id': '1175529',
'product_name': 'TRUST 21935 NANGA USB 3.1 Kart Okuyucu',
'rowid': -411152717288573423,
'url': 'https://www.mediamarkt.com.tr/tr/product/TRUST%2021935%20NANGA%20USB%203.1%20Kart%20Okuyucu-1175529.html'}
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/mysql/connector/connection_cext.py", line 523, in cmd_query
self._cmysql.query(query,
_mysql_connector.MySQLInterfaceError: Duplicate entry '-411152717288573423' for key 'products.rowid'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/local/lib/python3.8/dist-packages/scrapy/utils/defer.py", line 162, in f
return deferred_from_coro(coro_f(*coro_args, **coro_kwargs))
File "/root/teknosa/teknosa/pipelines.py", line 28, in process_item
self.save(dict(item))
File "/root/teknosa/teknosa/pipelines.py", line 69, in save
cursor.execute(create_query, row)
File "/usr/local/lib/python3.8/dist-packages/mysql/connector/cursor_cext.py", line 269, in execute
result = self._cnx.cmd_query(stmt, raw=self._raw,
File "/usr/local/lib/python3.8/dist-packages/mysql/connector/connection_cext.py", line 528, in cmd_query
raise errors.get_mysql_exception(exc.errno, msg=exc.msg,
mysql.connector.errors.IntegrityError: 1062 (23000): Duplicate entry '-411152717288573423' for key 'products.rowid'
Solution 1:[1]
If you want to INSERT only if it does not exist, you don´t need to do what you are doing.There is no need to select all and then look if the one you are looking exists.
What you need is makinq a unique index for produc_id in table2
Then change your code to:
table = 'products'
table2 = 'new_products'`
def save(self, row):
create_query = ("INSERT INTO " + self.table +
"(rowid, date, listing_id, product_id, product_name, price, url) "
"VALUES (%(rowid)s, %(date)s, %(listing_id)s, %(product_id)s, %(product_name)s, %(price)s, %(url)s)")
cursor.execute(create_query, row)
lastRecordId = cursor.lastrowid
self.cnx.commit()
print("Item saved with ID: {}" . format(lastRecordId))
create_query = ("INSERT INTO " + self.table2 +
"(rowid, date, listing_id, product_id, product_name, price, url) "
"VALUES (%(rowid)s, %(date)s, %(listing_id)s, %(product_id)s, %(product_name)s, %(price)s, %(url)s) ON DUPLICATE KEY UPDATE product_id=product_id")
cursor.execute(create_query, row)
self.cnx.commit()
If you use ON DUPLICATE KEY, when it finds a duplicate row (a product_id that already exists) the system tryes to update the product_id to the same product_id, so it takes no effect.
And you can remove those commit if you set autocommit= True.
EDIT
If as you say in your comment, you need to insert in the new table only if it does not exist in your table, you can change your code like this:
You need to change the name of the variable in the line old_ids = [row[0] for row in cursor.fetchall()] because you are changing the value of your row
parameter
2. Your problem is in the if statement, the product_id variable doesn´t exist, you need to change it
table = 'products'
table2 = 'new_products'`
def save(self, row):
cursor = self.cnx.cursor()
cursor.execute("SELECT DISTINCT product_id FROM products;")
old_ids = [element[0] for element in cursor.fetchall()]
create_query = ("INSERT INTO " + self.table +
"(rowid, date, listing_id, product_id, product_name, price, url) "
"VALUES (%(rowid)s, %(date)s, %(listing_id)s, %(product_id)s, %(product_name)s, %(price)s, %(url)s)")
cursor.execute(create_query, row)
lastRecordId = cursor.lastrowid
self.cnx.commit()
cursor.close()
print("Item saved with ID: {}" . format(lastRecordId))
if not row['product_id'] in old_ids:
create_query = ("INSERT INTO " + self.table2 +
"(rowid, date, listing_id, product_id, product_name, price, url) "
"VALUES (%(rowid)s, %(date)s, %(listing_id)s, %(product_id)s, %(product_name)s, %(price)s, %(url)s)")
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |