'Running on loop on a data pull with Amazon S3

I'm trying to pull data from Eviction Lab, which uses Amazon S3. To pull the data from Amazon's servers, I am using the cloudy package's aws.s3 function. I want to pull the census tract data for all 50 states. They are each large, .csv files. If I run this code, I can get successfully pull each individual state's data:

NY.tract<-get_object("NY/tracts.csv", bucket = "eviction-lab-data-downloads")

But, I want to run a loop that automates the process, in case I want to change what I pull later on down the road.

I'm running into two main problems with my loop:

  • (1) I have to figure out how to specify the "NY/tracts.csv" within the get_object function so that it changes with each pull. I'm not sure my loop is doing that

  • (2) I need to name each data pull by the State appreciation. I could use another list or data.frame to specify the Stata abbreviation, but I have no idea where to begin with that.

My attempt at a loop is still missing quite a bit. The "file.paths" that I reference in the sequence of the loop function is a data.frame I pulled into R that is a string variable with all the 50 state names that I want to pull as 50 row observations. For example, the first row is "AL/tracts.csv," the second is "AK/tracts.csv", etc. Here is the loop that I've written:

for(i in 1:nrow(file.paths)){
   my.data<-get_object("i", bucket = "eviction-lab-data-downloads")
  }

View(my.data)

When I run this loop, it returns 272 observations for 1 variable. I want to get 50 different .csv datasets, named according to the state abbreviation, which I can bind together into 1 nationwide dataset.

Maybe it's not possible to do with the get_object function? I can certainly write 50 lines of code to get the individual objects I want, but I'd prefer a loop so I can edit it in the future.

Any help here would be awesome.

Thanks. Best, Kasey



Solution 1:[1]

Something like this should work...

create an empty data frame, then pull the files in and append.

df <- data.frame(Date=as.Date(character()),
             File=character(), 
             User=character(), 
             stringsAsFactors=FALSE) 

for (i in list_of_paths) {
  
  object <- get_object(i, 'bucket_name')
  
  df_i <- read_csv(object)
  
  bind_rows(df, df_i)

}

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 pyll