'Gather multiple data sets from an URL/FTP site and merge them into a single dataframe for wrangling
Okay R community. I have a myrid of code pieces going on here from data.table, dyplr, base, etc.. My goal is to download a group of files from NOAA into a single data frame for wrangling. Currently, my code is ugly, to say the least and of course not working. I should have all of data set 1950, then right below it i have 1951 data, etc.
library(data.table)
library(XML)
library(RCurl)
library(tidyverse)
#hard code website addressess
noaa.url <- "https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/"
noaa.ftp <- "ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/"
#set fixed name of files for download
details.str <- "StormEvents_details-ftp_*"
fatalities.str <- "StormEvents_fatalities-ftp_"
locations.str <- "StormEvents_locations-ftp_"
#test function to download file using manual operation
index.storm <- "https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1950_c20210803.csv.gz"
storm.1950 <- fread(index.storm )
storm.1951 <- fread("https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1951_c20210803.csv.gz")
#test append
storm.append <- rbind(storm.1950, storm.1951)
#create a list of colnames
detail.colnames <- colnames(storm.1950)
#-------------------------------Begin Website Scrap-------------------------------------
#create a directory from the NOAA website. Must use the FTP directory. Will get 404 error if using the http site
dir_list <-
read.table(
textConnection(
getURLContent(noaa.ftp)
),
sep = "",
strip.white = TRUE)
#subset the data we want
dir_list <- dir_list %>%
select("V9","V5")
#fix column names
colnames(dir_list) <- c("FileName", "FileSize")
#create new table for loop through list with complete website directory. This will get just the storm details we want
details.dir <- dir_list %>%
select(1) %>%
filter(str_detect(FileName,"details")) %>%
mutate(FileName = paste0(noaa.url,FileName))
#how many rows to get. could use this in counter for loop if needed
total.count <- count(details.dir)
total.count
#subset just first 5 rows
details.dirsub <- head(details.dir,5)
details.dirsub
#very basic loop and apply a list. Note: files get larger as years go on.
for (x in details.dirsub) {
something = details.dirsub$FileName
#print(something)
storm.append = lapply(something, fread) #lapply is creating a join not an append
#storm.append = rbindlist(fread(something)) #does not work
return(storm.append)
}
#expand the list into a dataframe for wrangling
storm.full <- as.data.frame(do.call(cbind, storm.append))
# try to set colnames if use sapply instead of lapply
#colnames(storm.full)
#setnames(storm.full, detail.colnames)
#filter by GEORGIA -- can not filter because lapply is creating joins instead of append. tried rbindlist() but errors.
storm.georgia <- storm.full %>%
filter(STATE == "GEORGIA")
Solution 1:[1]
If I understand correctly, the OP wants
- to read all data files
- whose file names include the string "details"
- from a certain FTP directory,
- combine them into one large data.frame
- but keep only rows whih are related to GEORGIA.
This is what I would do using my favourite tools:
library(data.table)
library(RCurl)
library(magrittr)
noaa_ftp <- "ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/"
state_to_select <- "GEORGIA"
storm.georgia <-
getURL(noaa_ftp, dirlistonly = TRUE) %>%
fread(header = FALSE) %>%
.[V1 %like% "details.*csv.gz", V1] %>%
lapply(function(x) {
url <- file.path(noaa.ftp, x)
cat(url, "\n")
fread(url)[STATE == state_to_select]
}) %>%
rbindlist()
This filters each file directly after reading in order to reduce memory allocation. The result consists of one data.table with nearly 50k rows and 51 columns:
tables()
NAME NROW NCOL MB COLS KEY 1: storm.georgia 48,257 51 28 BEGIN_YEARMONTH,BEGIN_DAY,BEGIN_TIME,END_YEARMONTH,END_DAY,END_TIME,... Total: 28MB
Please note that there are inconsistencies in the data files as can be seen from the last lines of output
ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1990_c20220425.csv.gz trying URL 'ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1990_c20220425.csv.gz' Content type 'unknown' length 385707 bytes (376 KB) ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1988_c20220425.csv.gz trying URL 'ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1988_c20220425.csv.gz' Content type 'unknown' length 255646 bytes (249 KB) ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1986_c20220425.csv.gz trying URL 'ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1986_c20220425.csv.gz' Content type 'unknown' length 298130 bytes (291 KB) Warning messages: 1: In fread(url) : Stopped early on line 33957. Expected 51 fields but found 65. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<199712,7,400,199712,8,1800,2071900,5623952,"UTAH",49,1997,"December","Winter Storm","Z",1,"NW CACHE","SLC","07-DEC-97 04:00:00","MST","08-DEC-97 18:00:00","20","0","0","0","200K",,,,,,,,,,,,,,,,,,,,,,,,"A strong trough moved through northern Utah on the 7th. The cold moist airmass remained unstable into the 8th in a strong northwest flow. Lake-enhanced snowbands developed along the Wasatch Front on the 8th as well. Criteria snow fell across much of the state and locally strong winds occurred >> 2: In fread(url) : Found and resolved improper quoting out-of-sample. First healed line 23934: <<199703,14,1543,199703,14,1543,2059347,5589290,"FLORIDA",12,1997,"March","Waterspout","C",87,"MONROE","MFL","14-MAR-97 15:43:00","EST","14-MAR-97 15:43:00","0","0","0","0",,,,,,,,,,,,,,,,,"S OF "7 MILE" BRIDGE",,,"S OF "7 MILE" BRIDGE",,,,,"TWO WATERSPOUTS OBSERVED BY SKYWARN SPOTTER...LOST IN RAIN SHAFT.",,"PDC">>. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning. ```
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Uwe |
