'How to speed up collect() from a dbconnect database

I'm reading tables from an odbc driver on an amazonaws server. I am using R 4.0.0 on a 64 bit windows computer. Working from home and using company's VPN so my internet speed is about 17 Mbps download.

I connect to the database with this statement con <- dbConnect(odbc::odbc(), "databasename", timeout = 10)

and I access the table with this statement case_tbl <- tbl(con, dbplyr::in_schema("fhv_owner", "case")) case <- collect(case_tbl)

The table has 500K rows and 60 columns, so it's not huge but it takes about 2 minutes to download. Sometimes it takes 30 minutes if my internet is slow.

Is there any way to speed this up? Maybe something with data.table instead of dplyr?

I've tried filtering the columns I pick but sometimes I need more columns and have to rerun.

Maybe I just need to pay for faster internet.

Thanks, Jennifer



Solution 1:[1]

As you don't do here any data manipulation data.table won't help you speed up the process.
To make sure that the delay is due to your Internet connection speed and has nothing to do with dbplyr, use directly DBI:

library(DBI)
con <- dbConnect(odbc::odbc(), "databasename", timeout = 10)
system.time({case_tbl <- dbGetQuery(con, "SELECT * FROM fhv_owner.case")})

Solution 2:[2]

Depending on your usecase and your database system, it might make sense to replace odbc::odbc() with RPostgres::Postgres() (if you have a postgres database that is, otherwise there are alternative packages...). This at least helped me a lot!

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 David