'R Remove url without http or www

I am trying to remove urls that may or may not start with www in a large corpus with R.

For example, I would like to remove

ftse.com

My idea was to remove data that finish with .com and start with a space with gsub("\\s.*\\.com"," ",text)

By doing so, I remove all the part of the text starting with a space and finishing with .com. For instance:

gsub("\\s(www.)?.*\\.com"," ","this famous url ftse.com is appreciated")

[1] "this  is appreciated"

Instead of "this famous url is appreciated"

Any idea?



Solution 1:[1]

How about this: "\\s(www\\.)?(\\w|\\-)+?\\.com"

This will also delete URLs containing dashes "-":

gsub(
  "\\s(www\\.)?(\\w|\\-)+?\\.com",
  "",
  c(
    "this famous url ftse.com www.test.com is appreciated",
    "this famous url www.test.com is appreciated",
    "this famous url www.test-test.com is appreciated"
  )
)
#> [1] "this famous url is appreciated" "this famous url is appreciated"
#> [3] "this famous url is appreciated"

Created on 2022-03-30 by the reprex package (v2.0.1)

Explaination

  • \s match whitespace
  • (www\.)? match www. zero or one time
  • (\w|\-)+? match words or dashes one or more times, but as few times as possible
  • \.com match .com

Solution 2:[2]

I think you need a better regex for matching the URL part. Here is an example:

urlpatt <- paste0(
   "\\s(https?:\\/\\/)?",                        # match protocol
   "[-a-zA-Z0-9@:%._\\+~#=]+",                   # match (sub-) domain
   "\\.[a-zA-Z0-9()]+",                          # match top-level domain
   "\\b([-a-zA-Z0-9()!@:%_\\+.~#?&\\/\\/=]*)"    # match port & subdirectory
)

urlstr = "this famous url ftse.com is appreciated"

gsub(urlpatt, "", urlstr)

I tried to break it down by components for you, but most probably you need to improve it. I hope it good enough to get an idea though.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 jpiversen
Solution 2 user51187286016