'Gather multiple data sets from an URL/FTP site and merge them into a single dataframe for wrangling

Okay R community. I have a myrid of code pieces going on here from data.table, dyplr, base, etc.. My goal is to download a group of files from NOAA into a single data frame for wrangling. Currently, my code is ugly, to say the least and of course not working. I should have all of data set 1950, then right below it i have 1951 data, etc.


library(data.table)
library(XML)
library(RCurl)
library(tidyverse)


#hard code website addressess
noaa.url <- "https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/"
noaa.ftp <- "ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/"

#set fixed name of files for download
details.str <- "StormEvents_details-ftp_*"
fatalities.str <- "StormEvents_fatalities-ftp_"
locations.str <- "StormEvents_locations-ftp_"

#test function to download file using manual operation
index.storm <- "https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1950_c20210803.csv.gz"
storm.1950 <- fread(index.storm )
storm.1951 <- fread("https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1951_c20210803.csv.gz")

#test append
storm.append <- rbind(storm.1950, storm.1951)

#create a list of colnames
detail.colnames <- colnames(storm.1950)

#-------------------------------Begin Website Scrap-------------------------------------

#create a directory from the NOAA website. Must use the FTP directory. Will get 404 error if using the http site
dir_list <-
  read.table(
    textConnection(
      getURLContent(noaa.ftp)
    ),
    sep = "",
    strip.white = TRUE)

#subset the data we want
dir_list <- dir_list %>% 
  select("V9","V5")

#fix column names
colnames(dir_list) <- c("FileName", "FileSize")

#create new table for loop through list with complete website directory. This will get just the storm details we want
details.dir <- dir_list %>% 
  select(1) %>% 
  filter(str_detect(FileName,"details")) %>% 
  mutate(FileName = paste0(noaa.url,FileName))

#how many rows to get. could use this in counter for loop if needed
total.count <- count(details.dir)
total.count

#subset just first 5 rows
details.dirsub <- head(details.dir,5)
details.dirsub

#very basic loop and apply a list. Note: files get larger as years go on.
for (x in details.dirsub) {
  something = details.dirsub$FileName
  #print(something)
  storm.append = lapply(something, fread) #lapply is creating a join not an append
  #storm.append = rbindlist(fread(something)) #does not work
  return(storm.append)
}

#expand the list into a dataframe for wrangling 
storm.full <- as.data.frame(do.call(cbind, storm.append))

# try to set colnames if use sapply instead of lapply
#colnames(storm.full)
#setnames(storm.full, detail.colnames)

#filter by GEORGIA -- can not filter because lapply is creating joins instead of append. tried rbindlist() but errors.
storm.georgia <- storm.full %>% 
  filter(STATE == "GEORGIA")




Solution 1:[1]

If I understand correctly, the OP wants

  • to read all data files
  • whose file names include the string "details"
  • from a certain FTP directory,
  • combine them into one large data.frame
  • but keep only rows whih are related to GEORGIA.

This is what I would do using my favourite tools:

library(data.table)
library(RCurl)
library(magrittr)

noaa_ftp <- "ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/"
state_to_select <- "GEORGIA"

storm.georgia <- 
  getURL(noaa_ftp, dirlistonly = TRUE) %>% 
  fread(header = FALSE) %>% 
  .[V1 %like% "details.*csv.gz", V1] %>% 
  lapply(function(x) {
    url <- file.path(noaa.ftp, x)
    cat(url, "\n")
    fread(url)[STATE == state_to_select]
  }) %>% 
  rbindlist()

This filters each file directly after reading in order to reduce memory allocation. The result consists of one data.table with nearly 50k rows and 51 columns:

tables()
            NAME   NROW NCOL MB                                                                    COLS KEY
1: storm.georgia 48,257   51 28 BEGIN_YEARMONTH,BEGIN_DAY,BEGIN_TIME,END_YEARMONTH,END_DAY,END_TIME,...

Total: 28MB

Please note that there are inconsistencies in the data files as can be seen from the last lines of output

ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1990_c20220425.csv.gz
trying URL 'ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1990_c20220425.csv.gz'
Content type 'unknown' length 385707 bytes (376 KB)
ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1988_c20220425.csv.gz
trying URL 'ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1988_c20220425.csv.gz'
Content type 'unknown' length 255646 bytes (249 KB)
ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1986_c20220425.csv.gz
trying URL 'ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1986_c20220425.csv.gz'
Content type 'unknown' length 298130 bytes (291 KB)
Warning messages:
1: In fread(url) :
Stopped early on line 33957. Expected 51 fields but found 65. Consider fill=TRUE and comment.char=. First discarded non-empty line:
<<199712,7,400,199712,8,1800,2071900,5623952,"UTAH",49,1997,"December","Winter
Storm","Z",1,"NW CACHE","SLC","07-DEC-97 04:00:00","MST","08-DEC-97
18:00:00","20","0","0","0","200K",,,,,,,,,,,,,,,,,,,,,,,,"A strong
trough moved through northern Utah on the 7th.  The cold moist airmass
remained unstable into the 8th in a strong northwest flow. 
Lake-enhanced snowbands developed along the Wasatch Front on the 8th
as well.  Criteria snow fell across much of the state and locally
strong winds occurred >>
2: In fread(url) :
Found and resolved improper quoting out-of-sample. First healed line 23934:
<<199703,14,1543,199703,14,1543,2059347,5589290,"FLORIDA",12,1997,"March","Waterspout","C",87,"MONROE","MFL","14-MAR-97
15:43:00","EST","14-MAR-97
15:43:00","0","0","0","0",,,,,,,,,,,,,,,,,"S OF "7 MILE" BRIDGE",,,"S
OF "7 MILE" BRIDGE",,,,,"TWO WATERSPOUTS OBSERVED BY SKYWARN
SPOTTER...LOST IN RAIN SHAFT.",,"PDC">>. If the fields are not quoted
(e.g. field separator does not appear within any field), try quote=""
to avoid this warning. ```

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Uwe