'Checking to see if record exists in MongoDB before Scrapy inserts
As the title implies, I'm running a Scrapy spider and storing results in MongoDB. Everything is running smoothly, except when I re-run the spider, it adds everything again, and I don't want the duplicates. My pipelines.py file looks like this:
import logging
import pymongo
from pymongo import MongoClient
from scrapy.conf import settings
from scrapy import log
class MongoPipeline(object):
collection_name = 'openings'
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
## pull in information from settings.py
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE')
)
def open_spider(self, spider):
## initializing spider
## opening db connection
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
## clean up when spider is closed
self.client.close()
def process_item(self, item, spider):
## how to handle each post
if self.db.openings.find({' quote_text': item['quote_text']}) == True:
pass
else:
self.db[self.collection_name].insert(dict(item))
logging.debug("Post added to MongoDB")
return item
My spider looks like this:
import scrapy
from ..items import QuotesItem
class QuoteSpider(scrapy.Spider):
name = 'quote'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
items = QuotesItem()
quotes = response.xpath('//*[@class="quote"]')
for quote in quotes:
author = quote.xpath('.//*[@class="author"]//text()').extract_first()
quote_text = quote.xpath('.//span[@class="text"]//text()').extract_first()
items['author'] = author
items['quote_text'] = quote_text
yield items
The current syntax is obviously wrong, but is there a slight fix to the for loop to make to fix it? Should I be running this loop in the spider instead? I was also looking at upsert but was having trouble understanding how to use that effectively. Any help would be great.
Solution 1:[1]
- Looks like you have a leading space here:
self.db.openings.find({' quote_text': item['quote_text']}). I suppose it should just be 'quote_text'? - You should use
is Trueinstead of== True. This is the reason it adds everything again. - I would suggest to use findOne instead of find, will be more efficient.
- Using upsert instead is indeed a good idea but the logic will be slightly different: you will update the data if the item already exists, and insert it when it doesn't exists (instead of not doing anything if the item already exists). The syntax should look something like this:
self.db[self.collection_name].update({'quote_text': quote_text}, dict(item),upsert=True)
Solution 2:[2]
steps :
- check if the collection is empty else : write in collection
- if not empty and item exist : pass
- else (collection not empty + item dosen't exist) : write in collection
code:
def process_item(self, item, spider):
## how to handle each post
# empty
if len(list(self.db[self.collection_name].find({}))) == 0 :
self.db[self.collection_name].insert_one(dict(item))
# not empty
elif item in list(self.db[self.collection_name].find(item,{"_id":0})) :
print("item exist")
pass
else:
print("new item")
#print("here is item",item)
self.db[self.collection_name].insert_one(dict(item))
logging.debug("Post added to MongoDB")
return item
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Wim Hermans |
| Solution 2 | med kalil ben ahmed |
