How to import data from Azure to R faster - r

I'm dealing with a huge database (67mi of rows and 55 columns). This database is on Azure and I want to make analysis in R with that. So I'm using the following:
library("odbc")
library("DBI")
library("tidyverse")
library("data.table")
con <- DBI::dbConnect(odbc::odbc(),
UID = rstudioapi::askForPassword("myEmail"),
Driver="ODBC Driver 17 for SQL Server",
Server = server, Database = database,
Authentication = "ActiveDirectoryInteractive")
#selecting columns
myData = data.table::setDT(DBI::dbGetQuery(conn = con, statement =
'SELECT Column1, Column2, Column3
FROM myTable'))
I'm trying to use data.table::setDT to convert the data.frame to data.base, but it is still taking a very long time to load.
Any hint on how can I load this data faster?

Related

R - anyway to speed up dbWriteTable?

I am working with a dataframe c. 700k rows and 400 columns which I'm uploading to Microsoft SQL Server. I'm currently using the code:
dbWriteTable(conn = con,
name = "table_test",
value = x,
row.names = FALSE
)
from the library library(odbc) and con is my connection made by using the function dbConnect
The code does run, but it takes a few hours. Is there any faster way I can do this?

Define mlr3 task using data from a database (different tables)?

This is a newbie question.
How do I define a (classification) tsk that uses data from a (sqlite) database? The mlr3db example seems to write data from memory first. In my case, the data is already in the database. What is maybe a bigger problem, the target data and the features are in different tables.
What I tried:
con <- DBI::dbConnect(RSQLite::SQLite(), dbname = "my_data.db")
my_features <- dplyr::tbl(con, "my_features")
my_target <- dplyr::tbl(con, "my_targets")
task <- mlr3::TaskClassif$new("my_task", backend=my_features, target="???")
and then I don't know how to specify the target argument.
Maybe a solution would be to create a VIEW in the database that joins features and targets?
Having the data split into (1) multiple tables or (2) multiple data bases is possible. In your case, it looks like the data is just split into multiple tables, but you can use the same DBI connection to access them.
All you need is a key column to join the two tables. In the following example I'm using a simple integer key column and an inner_join() to merge the two tables into a single new table, but this somewhat depends on your database scheme.
library(mlr3)
library(mlr3db)
# base data set
data = iris
data$row_id = 1:nrow(data)
# create data base with two tables, split data into features and target and
# keep key column `row_id` in both tables
path = tempfile()
con = DBI::dbConnect(RSQLite::SQLite(), dbname = path)
DBI::dbWriteTable(con, "features", subset(data, select = - Species))
DBI::dbWriteTable(con, "target", subset(data, select = c(row_id, Species)))
DBI::dbDisconnect(con)
# re-open table
con = DBI::dbConnect(RSQLite::SQLite(), dbname = path)
# access tables with dplyr
tbl_features = dplyr::tbl(con, "features")
tbl_target = dplyr::tbl(con, "target")
# join tables with an inner_join
tbl_joined = dplyr::inner_join(tbl_features, tbl_target, by = "row_id")
# convert to a backend and create the task
backend = as_data_backend(tbl_joined, primary_key = "row_id")
mlr3::TaskClassif$new("my_task", backend, target = "Species")

How to execute RxSqlServerData method in R?

I'd like to create and deploy a model into SQL Server in order to use it with new PREDICT() in-built function.
However, it seems I'm stuck with RxSqlServerData method in R.
Whenever I run my script, I get this error:
Error in rxExecJob(rxCallInfo(matchCall, .rxDeprecated =
"covariance"), : Data must be an RxSqlServerData data source for
this compute context.
Here's my code so far:
#Logistic plain select sql query
#input_query = 'SELECT app.ClientAgeToApplicationDate AS Age, IIF(conc.FirstInstallmentDelay>60,1,0) AS FPD60 FROM dim.Application app JOIN dim.Contract con ON app.ApplicationID = con.ApplicationID JOIN dim.Contract_Calculated conc ON con.ContractID = conc.ContractId'
#LinReg aggregated query
input_query = '
*SQL QUERY, too long to paste...*
'
connStr <- paste("Driver=SQL Server; Server=", "czphaddwh01\\dev",
";Database=", "DWH_Staging", ";Trusted_Connection=true", sep = "");
#Set compute context to SQL Server. Does not load any data into a memory of the local client. OBDC can't.
cc <- RxInSqlServer(connectionString = connStr);
rxSetComputeContext(cc)
input_data <- RxSqlServerData(sqlQuery = input_query, connectionString = connStr)
risk <- rxImport(input_data)
#head(risk)
#Binary regression for non-aggregated sql query
#logit_model <- rxLogit(Age ~ FPD60, data = risk)
#LinReg for aggregated sql query
LinReg_model <- rxLinMod(RiskFPD60 ~ Age, data = risk)
I'm new to R. Any help would be greatly appreciated.
When you run
cc <- RxInSqlServer(connectionString = connStr);
rxSetComputeContext(cc)
you tell R to run any Microsoft analytics functions (basically those starting with rx) in the SQL compute context. This means that all the processing will be handled inside the database. R essentially is acting as a shell to SQL.
Naturally, this requires that the dataset you're working with has to be actually in the database: a table, view, or query returning a resultset.
When you then run
risk <- rxImport(input_data)
LinReg_model <- rxLinMod(RiskFPD60 ~ Age, data = risk)
you import your data into a local data frame, and then try to fit a model on it. But you previously told R to do the number-crunching in-database, and your data is local. So it will complain.
The solution is to pass your RxSqlServerData object directly to rxLinMod:
LinReg_model <- rxLinMod(RiskFPD60 ~ Age, data = input_data)

export table to CSV in Redshift using R PostGreSQL

I connect to Redshift using R remotely from my workstation.
install.packages("RPostgreSQL")
library (RPostgreSQL)
drv <- dbDriver("PostgreSQL")
con1 <- dbConnect(drv, host="url", port="xxxx",
dbname="db_name", user="id", password="password")
dbGetInfo(con1)
then I create a table:
dbSendQuery(con1, "create table schema.table_name as select * from schema.table_name;")
now I want to export this table to a .csv file on my workstation, how to do this ? Again, I don't have PostGres database installed on my workstation, only using R to get to it.
Also, this table is LARGE, 4 columns, 14 million rows.
Thanks!
You'll need to pull down the results of a query into a local object, then dump the object to a CSV. Something along the lines of res <- dbSendQuery(con1, "select * from schema.table_name"); dat <-dbFetch(res); readr::write_csv(dat, "~/output.csv") should get you started.
I figured this out after posting - sharing..
system.time( fwrite(dbReadTable(con1, c("schema","table")), file="file.csv", sep=",", na="", row.names=FALSE, col.names=TRUE ))
I hear feather is even faster ?
this was for 43 million rows with 4 columns, took 15 minutes.

Reading huge csv files into R with sqldf works but sqlite file takes twice the space it should and needs "vacuuming"

Reading around, I found out that the best way to read a larger-than-memory csv file is to use read.csv.sql from package sqldf. This function will read the data directly into a sqlite database, and consequently execute a sql statement.
I noticed the following: it seems that the data read into sqlite is stored into a temporary table, so that in order to make it accessible for future use, it needs to be asked so in the sql statement.
As a example, the following code reads some sample data into sqlite:
# generate sample data
sample_data <- data.frame(col1 = sample(letters, 100000, TRUE), col2 = rnorm(100000))
# save as csv
write.csv(sample_data, "sample_data.csv", row.names = FALSE)
# create a sample sqlite database
library(sqldf)
sqldf("attach sample_db as new")
# read the csv into the database and create a table with its content
read.csv.sql("sample_data.csv", sql = "create table data as select * from file",
dbname = "sample_db", header = T, row.names = F, sep = ",")
The data can then be accessed with sqldf("select * from data limit 5", dbname = "sample_db").
The problem is the following: the sqlite file takes up twice as much space as it should. My guess is that it contains the data twice: once for the temporary read, and once for the stored table. It is possible to clean up the database with sqldf("vacuum", dbname = "sample_db"). This will reclaim the empty space, but it takes a long time, especially when the file is big.
Is there a better solution to this that doesn't create this data duplication in the first time ?
Solution: using RSQLite without going through sqldf:
library(RSQLite)
con <- dbConnect("SQLite", dbname = "sample_db")
# read csv file into sql database
dbWriteTable(con, name="sample_data", value="sample_data.csv",
row.names=FALSE, header=TRUE, sep = ",")

Resources