'Python - Improve urllib performance

I have a huge dataframe with 1M+ rows and I want to collect some information from Nominatim using URL. I have used the geopy library before, but I had some problems, so I decide to use the API instead.

But, my code is running too slow to get the requests.

URLs sample:

  urls = ['https://nominatim.openstreetmap.org/?addressdetails=1&q=Suape&format=json&limit=1&accept-language=en',
 'https://nominatim.openstreetmap.org/?addressdetails=1&q=Kwangyang&format=json&limit=1&accept-language=en',
 'https://nominatim.openstreetmap.org/?addressdetails=1&q=Luanda&format=json&limit=1&accept-language=en',
 'https://nominatim.openstreetmap.org/?addressdetails=1&q=Ancona&format=json&limit=1&accept-language=en',
 'https://nominatim.openstreetmap.org/?addressdetails=1&q=Xiamen&format=json&limit=1&accept-language=en',
 'https://nominatim.openstreetmap.org/?addressdetails=1&q=Jebel%20Dhanna/Ruwais&format=json&limit=1&accept-language=en',
 'https://nominatim.openstreetmap.org/?addressdetails=1&q=Nemrut%20Bay%20&format=json&limit=1&accept-language=en']

Sample code below

For a single url for test:

import pandas as pd
import urllib.request
import json

req = urllib.request.Request(url, None)
f = urllib.request.urlopen(req)
page = f.read()
nominatim = json.loads(page.decode())
result = nominatim[0]['address']['country']

So, I created a function to apply in my dataframe column that have location address (eg. Suape, Kwangyang, Luanda...):

def country(address):
try:
    if ' ' in address:
        address = address.replace(' ', '%20')
        url = f'https://nominatim.openstreetmap.org/?addressdetails=1&q={address}&format=json&limit=1&accept-language=en'
        req = urllib.request.Request(url, None)
        f = urllib.request.urlopen(req)
        page = f.read()
        nominatim = json.loads(page.decode())
        result = nominatim[0]['address']['country']
    else:
        url = f'https://nominatim.openstreetmap.org/?addressdetails=1&q={address}&format=json&limit=1&accept-language=en'
        req = urllib.request.Request(url, None)
        f = urllib.request.urlopen(req)
        page = f.read()
        nominatim = json.loads(page.decode())
        result = nominatim[0]['address']['country']
    return result
except:
    pass

It takes too long to run. I have tried to optmize my function, try different approaches to apply to column, and now I'm trying to improve the requests, because is the part that takes more time to conclude. Any suggestions? I have tried threading, but not works as expected (maybe I have not done properly). I test asyncio also, but the code doesn't worked.

Thank you!

Edit: I'll only use the requests in unique values of this column, that corresponding to approx. 4K rows. But, even in 4K rows, the code takes too much time to run.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source