'Removing HTTP and WWW from URL python
url1='www.google.com'
url2='http://www.google.com'
url3='http://google.com'
url4='www.google'
url5='http://www.google.com/images'
url6='https://www.youtube.com/watch?v=6RB89BOxaYY
How to strip http(s) and www from url in Python?
Solution 1:[1]
You can use the string method replace:
url = 'http://www.google.com/images'
url = url.replace("http://www.","")
or you can use regular expressions:
import re
url = re.compile(r"https?://(www\.)?")
url = url.sub('', 'http://www.google.com/images').strip().strip('/')
Solution 2:[2]
A more elegant solution would be using urlparse:
from urllib.parse import urlparse
def get_hostname(url, uri_type='both'):
"""Get the host name from the url"""
parsed_uri = urlparse(url)
if uri_type == 'both':
return '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
elif uri_type == 'netloc_only':
return '{uri.netloc}'.format(uri=parsed_uri)
The first option includes https or http, depending on the link, and the second part netloc includes what you were looking for.
Solution 3:[3]
Could use regex, depending on how strict your data is. Are http and www always going to be there? Have you thought about https or w3 sites?
import re
new_url = re.sub('.*w\.', '', url, 1)
1 to not harm websites ending with a w.
edit after clarification
I'd do two steps:
if url.startswith('http'):
url = re.sub(r'https?:\\', '', url)
if url.startswith('www.'):
url = re.sub(r'www.', '', url)
Solution 4:[4]
This will replace when http/https exist and finally if www. exist:
url=url.replace('http://','')
url=url.replace('https://','')
url=url.replace('www.','')
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Tomerikoo |
| Solution 2 | JohnAndrews |
| Solution 3 | |
| Solution 4 | Limbail |
