'Checking to see if record exists in MongoDB before Scrapy inserts

As the title implies, I'm running a Scrapy spider and storing results in MongoDB. Everything is running smoothly, except when I re-run the spider, it adds everything again, and I don't want the duplicates. My pipelines.py file looks like this:

import logging
import pymongo
from pymongo import MongoClient
from scrapy.conf import settings
from scrapy import log

class MongoPipeline(object):

    collection_name = 'openings'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        ## pull in information from settings.py
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE')
        )

    def open_spider(self, spider):
        ## initializing spider
        ## opening db connection
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        ## clean up when spider is closed
        self.client.close()

    def process_item(self, item, spider):
        ## how to handle each post
        if self.db.openings.find({' quote_text': item['quote_text']}) == True:
            pass
        else:
            self.db[self.collection_name].insert(dict(item))
        logging.debug("Post added to MongoDB")
        return item

My spider looks like this:

import scrapy
from ..items import QuotesItem

class QuoteSpider(scrapy.Spider):
    name = 'quote'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        items = QuotesItem()

        quotes = response.xpath('//*[@class="quote"]')

        for quote in quotes:
            author = quote.xpath('.//*[@class="author"]//text()').extract_first()
            quote_text = quote.xpath('.//span[@class="text"]//text()').extract_first()

            items['author'] = author
            items['quote_text'] = quote_text

            yield items

The current syntax is obviously wrong, but is there a slight fix to the for loop to make to fix it? Should I be running this loop in the spider instead? I was also looking at upsert but was having trouble understanding how to use that effectively. Any help would be great.



Solution 1:[1]

  • Looks like you have a leading space here: self.db.openings.find({' quote_text': item['quote_text']}). I suppose it should just be 'quote_text'?
  • You should use is True instead of == True. This is the reason it adds everything again.
  • I would suggest to use findOne instead of find, will be more efficient.
  • Using upsert instead is indeed a good idea but the logic will be slightly different: you will update the data if the item already exists, and insert it when it doesn't exists (instead of not doing anything if the item already exists). The syntax should look something like this: self.db[self.collection_name].update({'quote_text': quote_text}, dict(item),upsert=True)

Solution 2:[2]

steps :

  1. check if the collection is empty else : write in collection
  2. if not empty and item exist : pass
  3. else (collection not empty + item dosen't exist) : write in collection

code:

def process_item(self, item, spider):
    ## how to handle each post
    #   empty 
    if len(list(self.db[self.collection_name].find({}))) == 0 :
        self.db[self.collection_name].insert_one(dict(item))
    #   not empty 
    elif item in list(self.db[self.collection_name].find(item,{"_id":0})) :
        print("item exist")
        pass
    else:
        print("new item")
        #print("here is item",item)
        self.db[self.collection_name].insert_one(dict(item))
        logging.debug("Post added to MongoDB")
        return item

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Wim Hermans
Solution 2 med kalil ben ahmed