'Regex to find date format

I have a dataframe with column Date. It contains values like 2022-01-03.

I want to create a new df with only those rows that fulfil the conditions:

  • starting with 20
  • - signs between the four-digit year and the two-digit month and then between the two-digit month and two-digit day

The below doesn't work:

df2 = my_df.where('Date like "20\d\d-\d\d-\d\d"')

Any ideas?



Solution 1:[1]

You may use rlike() here:

df2 = my_df.filter(col("Date").rlike("20[0-9]{2}-[0-9]{2}-[0-9]{2}"))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Tim Biegeleisen