'Adding new dataframe colonms using information extracted from the url in the url column, but the url could be missing information
Given: A pandas dataframe that contains a user_url column among other columns.
Expectation: New columns added to the original dataframe where the columns are composed of information extracted from the URL in the user_url column. Those columns being car_make, model, year and user_id.
Some Extra info: We know that the car_make will only contain letters either with or without a '-'. The model can contain any characters. The year will only be 4 digits long. The user_id will consist of digits of any length.
I tired using a regex to split the url but it failed when there was missing information or extra information. I also tried just splinting the data but I has the same issue using split.
Given dataframe
mpg miles user_url
0 NaN NaN https://www.somewebsite.com/suzuki/swift/2015/674857
1 31.6 NaN https://www.somewebsite.com/bmw/x3/2009/461150
2 28.5 NaN https://www.somewebsite.com/mercedes-benz/e300/1998/13
3 46.8 NaN https://www.somewebsite.com/320d/2010/247233
4 21.0 244.4 https://www.somewebsite.com/honda/pass/2019/1038865
5 25.0 254.4 https://www.somewebsite.com/volkswagen/passat/11
Expected Dataframe
mpg miles user_url car_make model year \
0 NaN NaN https://www.somewebsite.com/suzuki/swift/2015/674857 suzuki swift 2015
1 31.6 NaN https://www.somewebsite.com/bmw/x3/2009/461150 bmw x3 2009
2 28.5 NaN https://www.somewebsite.com/mercedes-benz/e300/1998/13 mercedes-benz e300 1998
3 46.8 NaN https://www.somewebsite.com/320d/2010/247233 NaN 320d 2010
4 21.0 244.4 https://www.somewebsite.com/honda/pass/2019/1038865 honda pass 2019
5 25.0 254.4 https://www.somewebsite.com/volkswagen/passat/11 volkswagen passat NaN
user_id
0 674857
1 461150
2 13
3 247233
4 1038865
5 11
Solution 1:[1]
you just have to do
split=df['user_url'].str.split("/", n = 4, expand = True)
df['car_make']=split[3]
df.loc[df['car_make'].str.contains('1|2|3|4|5|6|7|8|9|0'),'car_make']=np.nan
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | DataSciRookie |
