'Pass a list of start_urls as parameter from Django to Scrapyd

I'm working in a little scraping platform using Django and Scrapy (scrapyd as API). Default spider is working as expected, and using ScrapyAPI (python-scrapyd-api) I'm passing a URL from Django and scrap data, I'm even saving results as JSON to a postgres instance. This is for a SINGLE URL pass as parameter.

When trying to pass a list of URLs, scrapy just takes the first URL from a list. I don't know if it's something about how Python or ScrapyAPI is treating or processing this arguments.

# views.py
# This is how I pass parameters from Django
task = scrapyd.schedule(
        project=scrapy_project,
        spider=scrapy_spider,
        settings=scrapy_settings,
        url=urls
    )


# default_spider.py
def __init__(self, *args, **kwargs):
        super(SpiderMercadoLibre, self).__init__(*args, **kwargs)
        self.domain = kwargs.get('domain')
        self.start_urls = [self.url] # list(kwargs.get('url'))<--Doesn't work               
        self.allowed_domains = [self.domain]

# Setup to tell Scrapy to make calls from same URLs
def start_requests(self):
...
for url in self.start_urls:             
    yield scrapy.Request(url, callback=self.parse, meta={'original_url': url}, dont_filter=True)

Of course I can make some changes to my model so I can save every result iterating from the list of URLs and scheduling each URL using ScrapydAPI, but I'm wondering if this is a limitation of scrapyd itself or am I missing something about Python mechanics.

This is how ScrapydAPI is processing the schedule method:

def schedule(self, project, spider, settings=None, **kwargs):
        """
        Schedules a spider from a specific project to run. First class, maps
        to Scrapyd's scheduling endpoint.
        """

        url = self._build_url(constants.SCHEDULE_ENDPOINT)
        data = {
            'project': project,
            'spider': spider
        }
        data.update(kwargs)
        if settings:
            setting_params = []
            for setting_name, value in iteritems(settings):
                setting_params.append('{0}={1}'.format(setting_name, value))
            data['setting'] = setting_params
        json = self.client.post(url, data=data, timeout=self.timeout)
        return json['jobid']

I think i'm implementing everything as expected but everytime, no matter what approach is used, only the first URL from the list of URLs is scraped



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source