'How to Make this Regex Greedy?
I'm trying to extract the domain + subdomain from any URL (without the full URL suffix or http and www prefix).
I have the following lists of domains:
p.io -> p.io
amazon.com -> amazon.com
d.amazon.ca -> d.amazon.ca
domain.amazon.co.uk -> domain.amazon.co.uk
https://regex101.com/ -> regex101.com
www.regex101.comdddd -> regex101.com
www.wix.com.co -> wix.com.co
https://stackoverflow.com/questions/2301285/what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions -> stackoverflow.com
smile.amazon.com -> smile.amazon.com
I'm using the following regex to extract domain + subdomain:
[^w.\:\/]+[a-zA-Z\.]?\.[a-zA-Z]{1,3}(\.[a-zA-Z]{1,3})?(\.[a-zA-Z]{0,3})?|[w]{1,2}[^w.]+\.[a-zA-Z]{1,3}(\.[a-zA-Z]{1,3})?
The issue is that it is splitting several domains into two such as: d.amazon.ca -> d.ama + zon.ca and matching some non domain text such as: what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions as seen in image below:
How can I force the regex to be greedy in the sense that it matches the full domain as a single match?
I'm using Java.
Solution 1:[1]
I'd use the standard URI class instead of a regular expression to parse out the domain:
import java.net.URI;
import java.net.URISyntaxException;
import java.util.Optional;
public class Demo {
private static Optional<String> getHostname(String domain) {
try {
// Add a scheme if missing
if (domain.indexOf("://") == -1) {
domain = "https://" + domain;
}
URI uri = new URI(domain);
return Optional.ofNullable(uri.getHost()).map(s -> s.startsWith("www.") ? s.substring(4) : s);
} catch (URISyntaxException e) {
return Optional.empty();
}
}
public static void main(String[] args) {
String[] domains = new String[] {
"p.io",
"amazon.com",
"d.amazon.ca",
"domain.amazon.co.uk",
"https://regex101.com/",
"www.regex101.comdddd", // .comdddd is (potentially) a valid TLD; not sure why your output removes the d's
"www.wix.com.co",
"https://stackoverflow.com/questions/2301285/what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions",
"smile.amazon.com"
};
for (String domain : domains) {
System.out.println(getHostname(domain).orElse("hostname not found"));
}
}
}
outputs
p.io
amazon.com
d.amazon.ca
domain.amazon.co.uk
regex101.com
regex101.comdddd
wix.com.co
stackoverflow.com
smile.amazon.com
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Shawn |

