'How to remove a date written as a string in R?
I am trying to pull data from a website: https://transtats.bts.gov/PREZIP/
I am interested in downloading the datasets named Origin_and_Destination_Survey_DB1BMarket_1993_1.zip to Origin_and_Destination_Survey_DB1BMarket_2021_3.zip
For this I am trying to automate and put the url in a loop
# dates of all files
year_quarter_comb <- crossing(year = 1993:2021, quarter = 1:4) %>%
mutate(year_quarter_comb = str_c(year, "_", quarter)) %>%
pull(year_quarter_comb)
# download all files
for(year_quarter in year_quarter_comb){
get_BTS_data(str_glue("https://transtats.bts.gov/PREZIP/Origin_and_Destination_Survey_DB1BMarket_", year_quarter, ".zip"))
}
What I was wondering is how I can exclude 2021 quarter 4 since the data for this is not available yet. Also is there a better way to automate the task? I was thinking of matching by "DB1BMarket" but R is actually case-sensitive. The names for certain dates change to "DB1BMARKET"
I can use this year_quarter_comb[-c(116)] to remove 2021_4 from the output:

EDIT: I was actually trying to download the files into a specific folder with these set of codes:
path_to_local <- "whatever location" # this is the folder where the raw data is stored.
# download data from BTS
get_BTS_data <- function(BTS_url) {
# INPUT: URL for the zip file with the data
# OUTPUT: NULL (this just downloads the data)
# store the download in the path_to_local folder
# down_file <- str_glue(path_to_local, "QCEW_Hawaii_", BLS_url %>% str_sub(34) %>% str_replace_all("/", "_"))
down_file <- str_glue(path_to_local, fs::path_file(BTS_url))
# download data to folder
QCEW_files <- BTS_url %>%
# download file
curl::curl_download(down_file)
}
EDIT2:
I edited the codes a little from the answer below and it runs:
url <- "http://transtats.bts.gov/PREZIP"
content <- read_html(url)
file_paths <- content %>%
html_nodes("a") %>%
html_attr("href")
origin_destination_paths <-
file_paths[grepl("DB1BM", file_paths)]
base_url <- "https://transtats.bts.gov"
origin_destination_urls <-
paste0(base_url, origin_destination_paths)
h <- new_handle()
handle_setopt(h, ssl_verifyhost = 0, ssl_verifypeer=0)
lapply(origin_destination_urls, function(x) {
tmp_file <- tempfile()
curl_download(x, tmp_file, handle = h)
unzip(tmp_file, overwrite = F, exdir = "airfare data")
})
It takes a while to download these datasets as the files are quite large. It downloaded files until 2007_2 but then I got an error with the curl connection dropping out.
Solution 1:[1]
Instead of trying to generate the URL you could scrape the file paths from the website. This avoids generating any non-existing files.
Below is a short script that downloads all of the zip files you are looking for and unzips them into your working directory.
The hardest part for me here was, that the server seems to have a misconfigured SSL certificate. I was able to find help here on SO for turning of SSL certificate verification for read_html() and curl_download(). These solutions are integrated in the script below.
library(tidyverse)
library(rvest)
library(curl)
url <- "http://transtats.bts.gov/PREZIP"
content <-
httr::GET(url, config = httr::config(ssl_verifypeer = FALSE)) |>
read_html()
file_paths <-
content |>
html_nodes("a") |>
html_attr("href")
origin_destination_paths <-
file_paths[grepl("DB1BM", file_paths)]
base_url <- "https://transtats.bts.gov"
origin_destination_urls <-
paste0(base_url, origin_destination_paths)
h <- new_handle()
handle_setopt(h, ssl_verifyhost = 0, ssl_verifypeer=0)
lapply(origin_destination_urls, function(x) {
tmp_file <- tempfile()
curl_download(x, tmp_file, handle = h)
unzip(tmp_file)
})
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Till |
