'How to Make this Regex Greedy?

I'm trying to extract the domain + subdomain from any URL (without the full URL suffix or http and www prefix).

I have the following lists of domains:

p.io -> p.io
amazon.com -> amazon.com
d.amazon.ca -> d.amazon.ca
domain.amazon.co.uk -> domain.amazon.co.uk
https://regex101.com/ -> regex101.com
www.regex101.comdddd -> regex101.com
www.wix.com.co -> wix.com.co
https://stackoverflow.com/questions/2301285/what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions -> stackoverflow.com
smile.amazon.com -> smile.amazon.com

I'm using the following regex to extract domain + subdomain:

[^w.\:\/]+[a-zA-Z\.]?\.[a-zA-Z]{1,3}(\.[a-zA-Z]{1,3})?(\.[a-zA-Z]{0,3})?|[w]{1,2}[^w.]+\.[a-zA-Z]{1,3}(\.[a-zA-Z]{1,3})?

The issue is that it is splitting several domains into two such as: d.amazon.ca -> d.ama + zon.ca and matching some non domain text such as: what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions as seen in image below:

enter image description here

How can I force the regex to be greedy in the sense that it matches the full domain as a single match?

I'm using Java.



Solution 1:[1]

I'd use the standard URI class instead of a regular expression to parse out the domain:

import java.net.URI;
import java.net.URISyntaxException;
import java.util.Optional;

public class Demo {
    private static Optional<String> getHostname(String domain) {
        try {
            // Add a scheme if missing
            if (domain.indexOf("://") == -1) {
                domain = "https://" + domain;
            }
            URI uri = new URI(domain);
            return Optional.ofNullable(uri.getHost()).map(s -> s.startsWith("www.") ? s.substring(4) : s);
        } catch (URISyntaxException e) {
            return Optional.empty();
        }
    }

    public static void main(String[] args) {
        String[] domains = new String[] {
            "p.io",
            "amazon.com",
            "d.amazon.ca",
            "domain.amazon.co.uk",
            "https://regex101.com/",
            "www.regex101.comdddd", // .comdddd is (potentially) a valid TLD; not sure why your output removes the d's                                                                                                                            
            "www.wix.com.co",
            "https://stackoverflow.com/questions/2301285/what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions",
            "smile.amazon.com"
        };
        for (String domain : domains) {
            System.out.println(getHostname(domain).orElse("hostname not found"));
        }
    }
}

outputs

p.io
amazon.com
d.amazon.ca
domain.amazon.co.uk
regex101.com
regex101.comdddd
wix.com.co
stackoverflow.com
smile.amazon.com

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Shawn