'Apply formula on a column conditional to another column

import requests
from geopy.geocoders import Nominatim
import streamlit as st
import numpy as np
import pandas as pd
import random, string

print('ville :')
codePostal= str(input())

print('Adresse:')
adresse= str(input())


def randomword(length):
   letters = string.ascii_lowercase
   return ''.join(random.choice(letters) for i in range(length))

geolocator = Nominatim(user_agent=randomword(10))

location = geolocator.geocode(adresse+' ,France ,'+codePostal)

lat=str(location.latitude)
 
lon=str(location.longitude)

dist='1000'

url= 'http://api.cquest.org/dvf?lat='+lat+'&lon='+lon+'&dist='+dist

request= requests.get(url)
 
dataR = request.json()
 
datacomr1=list(dict.values(dataR))
 
area=datacomr1[4]

df = pd.DataFrame.from_dict(pd.json_normalize(area), orient='columns')

df.columns = df.columns.str.replace('properties.', '')
df.columns = df.columns.str.replace('.', '_')
df['nature_mutation'] = df['nature_mutation'].str.replace("'",'')

df = df[df['surface_relle_bati'] > 0]

df['surface_terrain'] = df['surface_terrain'].fillna(0)

At this point I want to calculate the price per square meter depending on wether or not the real estate transaction (valeur fonciere) contain a piece of land (surface terrain).

To be sure that my code is good I made a value count of the transactions that appear without land:

print(df.surface_terrain.value_counts())

out:

0.0       4703
1116.0     181
5751.0      90
395.0       75
1333.0      41
          ... 
998.0        1
717.0        1
1855.0       1
502.0        1
453.0        1
Name: surface_terrain, Length: 141, dtype: int64

So I have 4703 transaction with zero land.

Now I apply my formula in two columns :

-one without taking into account the land df['price_m22'] -another taking it into consideration df['price_m2']

df['price_m22'] = df['valeur_fonciere']/df['surface_relle_bati']

df.loc[df['surface_terrain'] > 0, 'price_m2'] = df['valeur_fonciere']/(df['surface_relle_bati']+ np.log(df.surface_terrain))

df.loc[df['surface_terrain'] == 0, 'price_m2'] = df['valeur_fonciere']/df['surface_relle_bati']

df['difference'] = df['price_m22'] - df['price_m2']

So at this point I should have a difference of 0.00 for 4703 rows but when I value_counts() the column difference I get :

0.000000e+00    4775
6.922453e+06     175
3.216753e+05      18
1.992464e+04       7
1.093293e+06       6
                ... 
4.414213e+04       1
6.632227e+04       1
7.221947e+05       1
5.838055e+04       1
3.467473e+01       1
Name: difference, Length: 1235, dtype: int64

Any idea how to fix it?



Solution 1:[1]

There are two ways in which price_m22 ends up being equal to price_m2:

  • surface_terrain is 1: in this case np.log(df.surface_terrain) == 0.
  • valeur_fonciere is 0, in this case price_m22 == price_m2 == 0.

So it is likely a combination of the two scenarios. Check the value_counts for both, they will give you a number matching the difference you are seeing.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1