'how can I optimize a python code to calculate distance between two GPS Points

I'm looking for a faster way to optimize my python code to calculate the distance between two GPS points, longitude, and latitude. Here is my code and I want to optimize it to work faster.

 def CalcDistanceKM(lat1, lon1, lat2, lon2):
        lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
        dlon = lon2 - lon1
        dlat = lat2 - lat1
        a = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2
        c = 2 * atan2(sqrt(a), sqrt(1 - a))
        distance = 6371 * c

        return distance

The behavior of this code is to calculate a distance between two latitudes, and longitudes from two different excel (CSV files), and return the distance between them.

A more code to explain the behavior:

for i in range(File1):
            for j in range(File2):
                if File1['AA'][i] == File2['BB'][j]:
                            distance = CalcDistanceKM(File2['LATITUDE'][j], File2['LONGITUDE'][j],
                                                      File1['Latitude'][i],File1['Longitude'][I])
                        File3 = File3.append({'DistanceBetweenTwoPoints' : (distance) })

Thanks.



Solution 1:[1]

prepare your points into numpy arrays and then call this haversine function once with the prepared arrays to take advantage of c performance and vectorisation optimisations - both freebies from the brilliant numpy library:


def haversine(x1: np.ndarray,
              x2: np.ndarray,
              y1: np.ndarray,
              y2: np.ndarray
              ) -> np.ndarray:
    """
    input in degrees, arrays or numbers.
    
    compute haversine distance between coords (x1, y1) and (x2, y2)
    Parameters
    ----------
    x1 : np.ndarray
        X/longitude in degrees for coords pair 1
    x2 : np.ndarray
        Y/latitude in degrees for coords pair 1.
    y1 : np.ndarray
        X/longitude in degrees for coords pair 2.
    y2 : np.ndarray
        Y/latitude in degrees for coords pair 2.
    Returns
    -------
    np.ndarray or float
        haversine distance (meters) between the two given points. 
    """
    x1 = np.deg2rad(x1)
    x2 = np.deg2rad(x2)
    y1 = np.deg2rad(y1)
    y2 = np.deg2rad(y2)
    return 12730000*np.arcsin(((np.sin((y2-y1)*0.5)**2) + np.cos(y1)*np.cos(y2)*np.sin((x2-x1)*0.5)**2)**0.5)

I see in File1 and File 2 you are iterating both repeatedly, are you searching for matches there? for loops are very slow so that will be a big bottleneck but without a bit more information on the csv's being used and how records in file1 are matched with file2 I can't help with that. Maybe add the first couple of records from both files to the question to give it a bit of context?

update: Thanks for including the colab link.

You start with two dataframes drive_test and Cells. one of your "if" conditions:

if drive_test['Serving Cell Identity'][i] == Cells['CI'][j] \
  or drive_test['Serving Cell Identity'][i] == Cells['PCIG'][j] \
  and drive_test['E_ARFCN'][i] == Cells['EARFCN_DL'][j]:
# btw this is ambiguous, use bracket, python reads this as (a or b) and c but that may not be the intention.

can be written as a pandas merge and filter, based on this method of a cross merge Create combination of two pandas dataframes in two dimensions

new_df = drive_test.assign(merge_key = 1).merge(Cells.assign(merge_key = 1), on = 'merge_key', suffixes = ("", "")).drop('merge_key', axis = 1)
# will need to use suffixes if your dataframes have common column names

cond1_df = new_df[((new_df['Serving Cell Identity'] == new_df.CI) | (new_df['Serving Cell Identity'] == new_df.PCIG)) & (new_df.E_ARFCN == new_df.EARFCN_DL)]
cond1_df = cond1_df.assign(distance_between = haversine(cond1_df.Longitude.to_numpy(), cond1_df.LONGITUDE.to_numpy(), cond1_df.Latitude.to_numpy(), cond1_df.LATITUDE.to_numpy()))
# note that my haversine input args are differently ordered to yours

and then you should have all the results for the first condition, and this can be repeated for the remaining conditions. I'm not able to test this on your csvs so it might need a little bit of debugging but the idea should be fine.

note, depending on how big your csvs are, this could explode into an extremely big dataframe and max out your RAM, in which case you are pretty much stuck with iterating it one by one as you are unless you wanted to make a piecewise method where you iterate columns in one dataframe and match all columns subject to the conditions in the other. this will still be faster than iterating both one at a time but probably slower so than doing it all at once.

update - trying the second idea since the new dataframe seems to crash the kernel

In your loop, you can do something like this for the first condition (and similar for all the next matching conditions)

for i in range(drive_test_size):
  matching_records = Cells[((Cells.CI == drive_test['Serving Cell Identity'][i]) | (Cells.PCIG == drive_test['Serving Cell Identity'][i])) & (Cells.EARFCN_DL == drive_test['E_ARFCN'][i])]
  if len(matching_records) > 0:
    matching_records = matching_records.assign(distance_between = haversine(matching_records.Longitude.to_numpy(), matching_records.LONGITUDE.to_numpy(), matching_records.Latitude.to_numpy(), matching_records.LATITUDE.to_numpy()))

which should be quite considerably faster anyway since you'll be using just 1 python "for" loop and then letting the superfast numpy/pandas query do the next. this template should also be applicable to your remaining conditions.

Solution 2:[2]

I'd suggest to have a look at the geod module from pyproj... Since pyproj is an interface to the c++ Proj library I'd expect a major speedup compared to pure python...

https://pyproj4.github.io/pyproj/stable/examples.html#geodesic-line-length

from pyproj import CRS
geod = CRS.from_epsg(4326).get_geod()

lons, lats = [11, 12, 13, 14], [11, 12, 13, 14]

tot_distance = geod.line_length(lons, lats)
intermediate_distances = geod.line_lengths(lons, lats)

print("tot_distance =", tot_distance)
print("intermediate_distances =", intermediate_distances )
>>> tot_distance = 465249.2859017318
>>> intermediate_distances = [155366.4523864174, 155090.444205422, 154792.3893098924]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 raphael