'Adding new dataframe colonms using information extracted from the url in the url column, but the url could be missing information

Given: A pandas dataframe that contains a user_url column among other columns.

Expectation: New columns added to the original dataframe where the columns are composed of information extracted from the URL in the user_url column. Those columns being car_make, model, year and user_id.

Some Extra info: We know that the car_make will only contain letters either with or without a '-'. The model can contain any characters. The year will only be 4 digits long. The user_id will consist of digits of any length.

I tired using a regex to split the url but it failed when there was missing information or extra information. I also tried just splinting the data but I has the same issue using split.

Given dataframe

    mpg  miles                                           user_url  
0   NaN    NaN    https://www.somewebsite.com/suzuki/swift/2015/674857 
1  31.6    NaN      https://www.somewebsite.com/bmw/x3/2009/461150  
2  28.5    NaN  https://www.somewebsite.com/mercedes-benz/e300/1998/13  
3  46.8    NaN            https://www.somewebsite.com/320d/2010/247233  
4  21.0  244.4     https://www.somewebsite.com/honda/pass/2019/1038865
5  25.0  254.4        https://www.somewebsite.com/volkswagen/passat/11

Expected Dataframe

    mpg  miles                                           user_url        car_make     model   year \
0   NaN    NaN   https://www.somewebsite.com/suzuki/swift/2015/674857   suzuki         swift  2015
1  31.6    NaN         https://www.somewebsite.com/bmw/x3/2009/461150   bmw               x3  2009
2  28.5    NaN  https://www.somewebsite.com/mercedes-benz/e300/1998/13  mercedes-benz   e300  1998
3  46.8    NaN           https://www.somewebsite.com/320d/2010/247233   NaN             320d  2010
4  21.0  244.4    https://www.somewebsite.com/honda/pass/2019/1038865   honda           pass  2019
5  25.0  254.4       https://www.somewebsite.com/volkswagen/passat/11   volkswagen    passat   NaN

   user_id  
0   674857
1   461150
2       13
3   247233
4  1038865
5       11



Solution 1:[1]

you just have to do

split=df['user_url'].str.split("/", n = 4, expand = True)
df['car_make']=split[3]
 df.loc[df['car_make'].str.contains('1|2|3|4|5|6|7|8|9|0'),'car_make']=np.nan

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 DataSciRookie