'Download a large zipped CSV file, unzip and read into R on Linux

I wish to read into my environment a large CSV (~ 8Gb) but I am having issues.

My data is a publicly available dataset:

# CREATE A TEMP FILE TO STORE THE DOWNLOADED DATA
temp <- tempfile()

# DOWNLOAD THE FILE FROM THE CMS
download.file("https://download.cms.gov/nppes/NPPES_Data_Dissemination_February_2022.zip",
              destfile = temp)

This is where I'm running into difficulty, I am unfamiliar with linux working directories and where temp folders are created.

When I use list.dir() or list.files() I don't see any reference to this temp file.

I am working in an R project and my working director is as follows:

getwd()
[1] "/home/myName/myProjectName"

I'm able to read in the first part of the file but my system crashes after about 4Gb.

# UNZIP THE NPI FILE
npi <- unz(temp, "npidata_pfile_20050523-20220213.csv")

I then came across this post which has a function for decompressing large zip files using the system2 unzip functionality. However due to my limited R knowledge and Linux experience I couldn't get the function to point to the downloaded file in the temp folder

checking the path for temp above I get the following path:

temp
[1] "/tmp/Rtmpl6SHIJ/file7e5e6c1fc693"

Using the system2 function from the link above I tried the following:

x <- decompress_file(directory = temp,
                     file = "NPPES_Data_Dissemination_February_2022.zip")

But get the following error about setting the working directory:

enter image description here

Any pointers to how I can get this file unzipped given it's size and read it into memory would be much appreciated.



Solution 1:[1]

It might be a file permission issue. To get around it work in a directory you're already in, or know you have access to.


# DOWNLOAD THE FILE 
# to a directory you can access, and name the file. No need to overcomplicate this.

download.file("https://download.cms.gov/nppes/NPPES_Data_Dissemination_February_2022.zip",
              destfile = "/home/myName/myProjectname/npi.csv")

# use the decompress function if you need to, though unzip might work
x <- decompress_file(directory = "/home/myName/myProjectname/",
                     file = "npi.zip")

# remove .zip file if you need the space back
file.remove("/home/myName/myProjectname/npi.zip")

Solution 2:[2]

temp is the path to the file, not just the directory. By default, tempfile does not add a file extension. It can be done by using tempfile(fileext = ".zip")

Consequently, decompress_file can not set the working directory to a file. Try this:

x <- decompress_file(directory = dirname(temp), file = basename(temp))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 mrhellmann
Solution 2 Marcus