'How to extract slug from URL with regular expression in Python?

I'm struggling with Python's re. I don't know how to solve the following problem in a clean way.

I want to extract a part of an URL,

What I tried so far:

url = http://www.example.com/this-2-me-4/123456-subj
m = re.search('/[0-9]+-', url)
m = m.group(0).rstrip('-')
m = m.lstrip('/')

This leaves me with the desired output 123456, but I feel this is not the proper way to extract the slug.

How can I solve this quicker and cleaner?



Solution 1:[1]

Use a capturing group by putting parentheses around the part of the regex that you want to capture (...). You can get the contents of a capturing group by passing in its number as an argument to m.group():

>>> m = re.search('/([0-9]+)-', url)
>>> m.group(1) 
123456

From the docs:

(...)
Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below. To match the literals '(' or ')', use \( or \), or enclose them inside a character class: [(] [)].

Solution 2:[2]

You may want to use urllib.parse combined with a capturing group for mildly cleaner code.

import urllib.parse, re

url = 'http://www.example.com/this-2-me-4/123456-subj'
parsed = urllib.parse.urlparse(url)
path = parsed.path
slug = re.search(r'/([\d]+)-', path).group(1)
print(slug)

Result:

123456

In Python 2, use urlparse instead of urllib.parse.

Solution 3:[3]

if you wants to find all the slugs available in a URL you can use this code.

from slugify import slugify

url = "https://www.allrecipes.com/recipe/79300/real-poutine?search=random/some-name/".split("/")

for i in url:
    i = i.split("?")[0] if "?" in i else i
    if "-" in i  and slugify(i) == i:
        print(i)

This will provide with an output of

real-poutine
some-name

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 senshin
Solution 3