I am working with a dataframe c. 700k rows and 400 columns which I'm uploading to Microsoft SQL Server. I'm currently using the code:
dbWriteTable(conn = con,
name = "table_test",
value = x,
row.names = FALSE
)
from the library library(odbc) and con is my connection made by using the function dbConnect
The code does run, but it takes a few hours. Is there any faster way I can do this?
Related
I was hoping to receive some clarification on optimizing Sparklyr performance in R on my local machine.
I have imported a CSV file with 211 million rows (CSV is 17 gigabytes, so wont fit in memory), with just a few columns, and I would like to only select the distinct values for one of the columns. To accomplish this I imported the data as "test" using spark_read_csv Memory = FALSE and a data generated schema saved separately in its own object (the import took a few minutes).
After importing using the function I ran very basic code to dedpulicate one column.
It has been running for 2 hours, so I decided to try using SAS. I was able to accomplish what I needed in a few minutes.
This seems very problematic to me, even if I am using a local machine it does not seem like a very difficult problem.
sc <- spark_connect(master = "local", version = "2.3")
download <- function(datapath, dataname) {
spec_with_r <- sapply(read.csv(datapath, nrows = 1000), class)
#spec_explicit <- c(x = "character", y = "numeric")
system.time(dataname <- spark_read_csv(
sc,
path = datapath,
columns = spec_with_r,
Memory = FALSE
))
return(dataname)
}
test <- download("./data/metastases17.csv", test)
test2 <- test %>% select(DX) %>% distinct()
I'm dealing with a huge database (67mi of rows and 55 columns). This database is on Azure and I want to make analysis in R with that. So I'm using the following:
library("odbc")
library("DBI")
library("tidyverse")
library("data.table")
con <- DBI::dbConnect(odbc::odbc(),
UID = rstudioapi::askForPassword("myEmail"),
Driver="ODBC Driver 17 for SQL Server",
Server = server, Database = database,
Authentication = "ActiveDirectoryInteractive")
#selecting columns
myData = data.table::setDT(DBI::dbGetQuery(conn = con, statement =
'SELECT Column1, Column2, Column3
FROM myTable'))
I'm trying to use data.table::setDT to convert the data.frame to data.base, but it is still taking a very long time to load.
Any hint on how can I load this data faster?
I'd like to create and deploy a model into SQL Server in order to use it with new PREDICT() in-built function.
However, it seems I'm stuck with RxSqlServerData method in R.
Whenever I run my script, I get this error:
Error in rxExecJob(rxCallInfo(matchCall, .rxDeprecated =
"covariance"), : Data must be an RxSqlServerData data source for
this compute context.
Here's my code so far:
#Logistic plain select sql query
#input_query = 'SELECT app.ClientAgeToApplicationDate AS Age, IIF(conc.FirstInstallmentDelay>60,1,0) AS FPD60 FROM dim.Application app JOIN dim.Contract con ON app.ApplicationID = con.ApplicationID JOIN dim.Contract_Calculated conc ON con.ContractID = conc.ContractId'
#LinReg aggregated query
input_query = '
*SQL QUERY, too long to paste...*
'
connStr <- paste("Driver=SQL Server; Server=", "czphaddwh01\\dev",
";Database=", "DWH_Staging", ";Trusted_Connection=true", sep = "");
#Set compute context to SQL Server. Does not load any data into a memory of the local client. OBDC can't.
cc <- RxInSqlServer(connectionString = connStr);
rxSetComputeContext(cc)
input_data <- RxSqlServerData(sqlQuery = input_query, connectionString = connStr)
risk <- rxImport(input_data)
#head(risk)
#Binary regression for non-aggregated sql query
#logit_model <- rxLogit(Age ~ FPD60, data = risk)
#LinReg for aggregated sql query
LinReg_model <- rxLinMod(RiskFPD60 ~ Age, data = risk)
I'm new to R. Any help would be greatly appreciated.
When you run
cc <- RxInSqlServer(connectionString = connStr);
rxSetComputeContext(cc)
you tell R to run any Microsoft analytics functions (basically those starting with rx) in the SQL compute context. This means that all the processing will be handled inside the database. R essentially is acting as a shell to SQL.
Naturally, this requires that the dataset you're working with has to be actually in the database: a table, view, or query returning a resultset.
When you then run
risk <- rxImport(input_data)
LinReg_model <- rxLinMod(RiskFPD60 ~ Age, data = risk)
you import your data into a local data frame, and then try to fit a model on it. But you previously told R to do the number-crunching in-database, and your data is local. So it will complain.
The solution is to pass your RxSqlServerData object directly to rxLinMod:
LinReg_model <- rxLinMod(RiskFPD60 ~ Age, data = input_data)
I am trying to query my MongoDB database from R.
I think I lost part of it in the process.
Does R have any limit, and how can I ensure all my records are loaded into R?
Code:
# inspect number of record in mongodb
db.complaints.count()
>395 853
# write a query to load data into R
library(dplyr)
complaints = data.frame(stringsAsFactors = FALSE)
db = "customers.complaints"
cursor = mongo.find(mongo, db)
i = 1
while (mongo.cursor.next(cursor))
{
tmp = mongo.bson.to.list(mongo.cursor.value(cursor))
tmp.df = as.data.frame(t(unlist(tmp)), stringsAsFactors=F)
complaints = rbind.fill(complaints, tmp.df)
}
I get [1] 47077 15 after checking the loading in R with dim(complaints).
How can make sure I get all my collections in R?
http://www.analyticbridge.com/profiles/blogs/time-issue-in-creating-a-huge-data-frame-from-mongodb-collection
the above code using environment variables might help you! Please do comment over here if you get a solution.
Reading around, I found out that the best way to read a larger-than-memory csv file is to use read.csv.sql from package sqldf. This function will read the data directly into a sqlite database, and consequently execute a sql statement.
I noticed the following: it seems that the data read into sqlite is stored into a temporary table, so that in order to make it accessible for future use, it needs to be asked so in the sql statement.
As a example, the following code reads some sample data into sqlite:
# generate sample data
sample_data <- data.frame(col1 = sample(letters, 100000, TRUE), col2 = rnorm(100000))
# save as csv
write.csv(sample_data, "sample_data.csv", row.names = FALSE)
# create a sample sqlite database
library(sqldf)
sqldf("attach sample_db as new")
# read the csv into the database and create a table with its content
read.csv.sql("sample_data.csv", sql = "create table data as select * from file",
dbname = "sample_db", header = T, row.names = F, sep = ",")
The data can then be accessed with sqldf("select * from data limit 5", dbname = "sample_db").
The problem is the following: the sqlite file takes up twice as much space as it should. My guess is that it contains the data twice: once for the temporary read, and once for the stored table. It is possible to clean up the database with sqldf("vacuum", dbname = "sample_db"). This will reclaim the empty space, but it takes a long time, especially when the file is big.
Is there a better solution to this that doesn't create this data duplication in the first time ?
Solution: using RSQLite without going through sqldf:
library(RSQLite)
con <- dbConnect("SQLite", dbname = "sample_db")
# read csv file into sql database
dbWriteTable(con, name="sample_data", value="sample_data.csv",
row.names=FALSE, header=TRUE, sep = ",")