I'm trying to analyze data stored in an SQL database (MS SQL server) in R, and on a mac. Typical queries might return a few GB of data, and the entire database is a few TB. So far, I've been using the R package odbc, and it seems to work pretty well.
However, dbFetch() seems really slow. For example, a somewhat complex query returns all results in ~6 minutes in SQL server, but if I run it with odbc and then try dbFetch, it takes close to an hour to get the full 4 GB into a data.frame. I've tried fetching in chunks, which helps modestly: https://stackoverflow.com/a/59220710/8400969. I'm wondering if there is another way to more quickly pipe the data to my mac, and I like the line of thinking here: Quickly reading very large tables as dataframes
What are some strategies for speeding up dbFetch when the results of queries are a few GB of data? If the issue is generating a data.frame object from larger tables, are there savings available by "fetching" in a different manner? Are there other packages that might help?
Thanks for your ideas and suggestions!
My answer includes use of a different package. I use RODBC which is found in cran at https://cran.r-project.org/web/packages/RODBC/index.html.
This has saved me SO MUCH frustration and wasted time that came from my previous method of exporting each query result to .csv to load it into my R environment. I found regular ODBC to be much slower than RODBC.
I use the following functions:
sqlQuery() wraps the function that opens the connection to the SQL db with the first argument (in parentheses) and the query itself as the second argument. Put the query itself in quote marks.
odbcConnect() is itself the first argument in sqlquery(). The argument in odbcConnect() is the name of your connection to the SQL db. Put the connection name in quote marks.
odbcCloseAll() is the final function for this task set. Use this after each sqlQuery() to close the connection and save yourself from annoying warning messages.
Here is a simple example.
library(RODBC)
result <- sqlQuery(odbcConnect("ODBCConnectionName"),
"SELECT *
FROM dbo.table
WHERE Collection_ID = 2498")
odbcCloseAll()
Here is the same example PLUS data manipulation directly from the query result.
library(dplyr)
library(RODBC)
result <- sqlQuery(odbcConnect("ODBCConnectionName"),
"SELECT *
FROM dbo.table
WHERE Collection_ID = 2498") %>%
mutate(matchid = paste0(schoolID, "-", studentID)) %>%
distinct(matchid, .keep_all - TRUE)
odbcCloseAll()
I would suggest using the dbcooper found on github. https://github.com/chriscardillo/dbcooper
I have found huge improvements in speed when querying large datasets.
Firstly, Add your connection to your environment.
conn <- DBI::dbConnect(odbc::odbc(),
Driver = "",
Server = "",
Database = "",
UID="",
PWD="")
devtools::install_github("chriscardillo/dbcooper")
library(dbcooper)
dbcooper::dbc_init(con = conn,
con_id = "test",
tables = c("schema.table"))
This adds the function test_schema_table() to your environment which is used to call the data. To collect into your environment use scheme_table %>% collect()
Here is a microbenchmark I did to compare the results of both DBI and dbcooper.
mbm <- microbenchmark::microbenchmark(
DBI = DBI::dbFetch(DBI::dbSendQuery(conn,qry)),
dbcooper = ava_qry() %>% collect() , times=5
)
Here are the results of a microbenchmark I did to compare DBI with dbcooper.
Related
My colleague is using pyspark in Databricks and the usual step is to run an import using data = spark.read.format('delta').parquet('parquet_table').select('column1', column2') and then this caching step, which is really fast.
data.cache()
data.registerTempTable("data")
As an R user I am looking for this registerTempTable equivalent in sparklyr.
I would usually do
data = sparklyr::spark_read_parquet(sc = sc, path = "parquet_table", memory = FALSE) %>% dplyr::select(column1, column2)
In case I opt for memory = TRUE or tbl_cache(sc, "data") it keeps running and never stops. The contrast in time difference seems very obvious - my colleague's registerTempTable takes seconds whereas my option of sparklyr keeps running, i.e. unknown when it will stop. Is there a better function in sparklyr for R users which can do this registerTempTable faster?
You can try using cache and createOrReplaceTempView
library(SparkR)
df <- read.df("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", source = "csv", header="true", inferSchema = "true")
cache(df)
createOrReplaceTempView(df,"df_temp")
The above works for SparkR.
For sparklyR, you're code the spark_read_parquet() with memory=FALSE is a pretty similar procedure to what is happening with SparkR.
You shouldn't need to cache the data if you just want to create a temporary table so I would just use memory = FALSE.
See this question.
I managed to load and merge the 6 heavy excel files I had from my RStudio instance (on EC2 server) into one single table in PostgreQSL (linked with RDS).
Now this table has 14 columns and 2,4 Million rows.
The size of the table in PostgreSQL is 1059MB.
The EC2 instance is a t2.medium.
I wanted to analyze it, so I thought I could simply load the table with DBI package and perform different operations on it.
So I did:
my_big_df <- dbReadTable(con, "my_big_table")
my_big_df <- unique(my_big_df)
and my RStudio froze, out of memory...
My questions would be:
1) Is what I have been doing (to handle big tables like this) a ok/good practice?
2) If yes to 1), is the only way to be able to perform the unique() operation or other similar operations to increase the EC2 server memory?
3) If yes to 2), how can I know to which extent should I increase the EC2 server memory?
Thanks!
dbReadTable convert the entire table to a data.frame, which is not what you want to do for such a big tables.
As #cory told you, you need to extract the required info using SQL queries.
You can do that with DBI using combinations of dbSendQuery,dbBind,dbFetch or dbGetQuery.
For example, you could define a function to get the required data
filterBySQLString <- function(databaseDB,sqlString){
sqlString <- as.character(sqlString)
dbResponse <- dbSendQuery(databaseDB,sqlString)
requestedData <- dbFetch(dbResponse)
dbClearResult(dbResponse)
return(requestedData)
}
# write your query to get unique values
SQLquery <- "SELECT * ...
DISTINCT ..."
my_big_df <- filterBySQLString(myDB,SQLquery)
my_big_df <- unique(my_big_df)
If you cannot use SQL, then you have two options:
1) stop using Rstudio and try to run your code from the terminal or via Rscript.
2) beef up your instance
I'm trying to import a large Database table into R to do some global analysis.
I connect to Oracle DB with ROracle and use dbGetquery.
Make minimum selection and necessary where clauses directly in the query to reduce the scope of the dataset but still it is 40 columns for 12 million of rows.
My PC has only 8GB of RAM how can I handle this?
There is no way to store those data on the disk rather than on the RAM ? or something similar to that way?
The same things made in SAS works fine.
Any Idea?
Few ideas:
May be some aggregation could be done on server side?
You are going to do something with this data in R, right? So you can try not to load data, but to create tbl object and made manipulations and aggregations in R
library(dplyr)
my_tbl <- 'SELECT ... FROM ...' %>% sql() %>% tbl(con, .)
where con is your connection
Here are a couple ideas for you to consider.
library(RODBC)
dbconnection <- odbcDriverConnect("Driver=ODBC Driver 11 for SQL Server;Server=Server_Name; Database=DB_Name;Uid=; Pwd=; trusted_connection=yes")
initdata <- sqlQuery(dbconnection,paste("select * from MyTable Where Name = 'Asher';"))
odbcClose(channel)
If you can export the table as a CSV file...
require(sqldf)
df <- read.csv.sql("C:\\your_path\\CSV1.csv", "select * from file where Name='Asher'")
df
So I am currently working with a connecting to an Access database. I am able to get connected to the Access DB which is located on my local system. This is actually connected to a SharePoint list. I would love to automate the process handling this SharePoint list with an R and Access combo! What I want to be able to do actually pretty basic, I want to introduce new data via a .csv which is processed for the relevant content and then compared to the current Access DB and finally the new information uploaded from r to Access.
I've learned that you need to pair the bit version of your Windows OS, Office version, and R version. So I am x64 on all of the above. This allowed me to connect to the Access DB. You also need the 'Microsoft Access Database Engine 2016 Redistributable' which is essentially the driver for the connection.
So what I have so far is:
library(odbc)
library(DBI)
file_path <- "C:/user/Documents/R Projects/...pathtofile.../filename.accdb"
accdb_con <- dbConnect(drv = odbc(), .connection_string = paste0("Driver={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=",file_path,";"))
access.db <- dbReadTable(accdb_con, "sNPS Deep Dives")
That now connects!
I then read in a .csv of new information
new.df <- read.csv("C:/user/Documents/R projects/...pathtofile.csv", header=T, stringsAsFactors=FALSE, na.strings=c("","NA"))
an example of the data set might just look something like this:
date <- c("15/10/2018","15/10/2018", "16/10/2018", "12/11/2018", "07/09/2018")
score <- c("6", "10", "7", "10", "9")
group <- c("a","b", "b", "a", "b")
CaseID <- c("301", "302", "303", "304", "305")
new.df <- data.frame(date,score,group,CaseID)
new.df$date <- as.character(new.df$date)
new.df$score <- as.numeric(new.df$score)
new.df$group <- as.character(new.df$group)
new.df$CaseID <- as.numeric(new.df$CaseID)
Notably there are more columns in the Access DB that people will fill in by hand with further information.
and I process it to be ready go into the Access DB.
probably not that interesting...
Then I compare the the new data against the Access DB as such:
library(dplyr)
new <- anti_join(new.df, access.db, by= "Case.ID")
Now I've tried:
dbWriteTable(access.db.copy, new, append = TRUE)
dbAppendTable(access.db.copy, new)
I don't seem to be able to get this to go anywhere
I am getting an error:
Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘dbWriteTable’ for signature ‘"ACCESS", "data.frame", "missing"’
I've seen plenty of posts in which people are having trouble connecting to an Access DB but I haven't seen anything about writing new data into that database.
I know this isn't quite a reproducible example but it seems like a difficult problem to recreate since it's a connection problem between different tools. I would be happy to provide example sets that might make this easier
I would appreciate any direction you all can provide.
Thanks!
Edit:
It appears that Bing Sun was right, I was missing an argument. So it appears that we need something more like:
dbWriteTable(access.db.copy, "Name of table",new, append = TRUE)
Which produces the error:
Error in result_insert_dataframe(rs#ptr, values) :
nanodbc/nanodbc.cpp:1944: HY104: [Microsoft][ODBC Microsoft Access Driver]Invalid precision value
I wonder if this may something that is an error from Access about a file type?
now if I use the append I don't get an error I get a 0 for output
dbAppendTable(access.db.copy, "Name of table", new, append= TRUE)
With output:
[1] 0
But I don't see any of the new values when I check the Access file.
I know it's years later, but hopefully this will help someone else with this issue since you're right CrayCrayTown, there aren't very many posts covering this issue.
I've run into this problem repeatedly when dealing with R and MS Access. The solution that I've come up with is pretty "hacky" but it accomplishes what's trying to be done...just not very eloquently.
The way I do this is with a combo of RODBC and DBI packages.
First, I open a connection to the DB with RODBC, and use that connection to write my data to the DB as an intermediary table:
chan <- RODBC::odbcDriverConnection(connection = "/path/to/database.accdb")
RODBC::sqlSave(channel = chan,
dat = df,
tablename = "tbl_intermediary",
rownames = FALSE,
append = FALSE)
RODBC::odbcClose(chan)
rm(chan)
Make sure to close the RODBC connection, I also destroy it for good measure, because why not? I use RODBC for the intermediary table because it supports batch insert statements. I know that the same thing can, in theory, be done with DBI with DBI::dbAppendTable()(but we wouldn't be on this post if that worked how we had hoped). I tried this in a previous SO question here, but it didn't solve my problem. I also don't know how big my intermediary tables could get in the future. Hopefully by the time they get too big we'll be in a different DBMS.
Next, I reopen the connection, this time with DBI, and send a statement to the DB to write those data from the intermediary table to the final resting place for those data, and then drop the intermediary table.
con <- DBI::dbConnect(odbc::odbc(), .connection_string = "/path/to/database.accdb")
DBI::dbSendStatement(
conn = con,
statement = 'UPDATE
tbl_intermediary INNER JOIN final_tbl ON tbl_intermediary.SampleID = final_tbl.sampleNumber
SET
final_tbl.field1 = [tbl_intermediary].[field1],
final_tbl.notes = IIf(Nz([tbl_intermediary].[Notes],"")="",[final_tbl].[notes],[final_tbl].[notes] & "; Newest Notes: " & [tbl_intermediary].[Notes]);'
)
DBI::dbSendStatement(
conn = con,
statement = 'DROP TABLE tbl_intermediary;'
DBI::dbDisconnect(con)
rm(con)
)
The main reason why I chose this method is because some of the SQL I use with Access also has some VBA in it. When I send the SQL-VBA hybrid string with RODBC, I get assorted errors in the IIF() and Nz() functions (see example above). From the RODBC CRAN docs the query argument for the sqlQuery() function is strictly assumed to be a valid SQL statement. So, RODBC has no clue how to interpret the IIf() and Nz() MS Access functions. I think this also has to do with how the ODBC driver handles communication as well (please, someone correct me if I'm wrong about this).
As I understand it, DBI::dbSendStatment() however lets the database engine you're working with interpret how to use the statement argument you provide. In the situation above, the VBA is executed exactly how I would expect if it were run in Access directly. As per the DBI docs, for interactive use you'll generally want to use dbGetQuery or dbExecute.
I've recently begun using RODBC to connect to PostgreSQL as I couldn't get RPostgreSQL to compile and run in Windows x64. I've found that read performance is similar between the two packages, but write performance is not. For example, using RODBC (where z is a ~6.1M row dataframe):
library(RODBC)
con <- odbcConnect("PostgreSQL84")
#autoCommit=FALSE seems to speed things up
odbcSetAutoCommit(con, autoCommit = FALSE)
system.time(sqlSave(con, z, "ERASE111", fast = TRUE))
user system elapsed
275.34 369.86 1979.59
odbcEndTran(con, commit = TRUE)
odbcCloseAll()
Whereas for the same ~6.1M row dataframe using RPostgreSQL (under 32-bit):
library(RPostgreSQL)
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname="gisdb", user="postgres", password="...")
system.time(dbWriteTable(con, "ERASE222", z))
user system elapsed
467.57 56.62 668.29
dbDisconnect(con)
So, in this test, RPostgreSQL is about 3X as fast as RODBC in writing tables. This performance ratio seems to stay more-or-less constant regardless of the number of rows in the dataframe (but the number of columns has far less effect). I do notice that RPostgreSQL uses something like COPY <table> FROM STDIN while RODBC issues a bunch of INSERT INTO <table> (columns...) VALUES (...) queries. I also notice that RODBC seems to choose int8 for integers, while RPostgreSQL chooses int4 where appropriate.
I need to do this kind of dataframe copy often, so I would very sincerely appreciate any advice on speeding up RODBC. For example, is this just inherent in ODBC, or am I not calling it properly?
It seems there is no immediate answer to this, so I'll post a kludgy workaround in case it is helpful for anyone.
Sharpie is correct--COPY FROM is by far the fastest way to get data into Postgres. Based on his suggestion, I've hacked together a function that gives a significant performance boost over RODBC::sqlSave(). For example, writing a 1.1 million row (by 24 column) dataframe took 960 seconds (elapsed) via sqlSave vs 69 seconds using the function below. I wouldn't have expected this since the data are written once to disk then again to the db.
library(RODBC)
con <- odbcConnect("PostgreSQL90")
#create the table
createTab <- function(dat, datname) {
#make an empty table, saving the trouble of making it by hand
res <- sqlSave(con, dat[1, ], datname)
res <- sqlQuery(con, paste("TRUNCATE TABLE",datname))
#write the dataframe
outfile = paste(datname, ".csv", sep = "")
write.csv(dat, outfile)
gc() # don't know why, but memory is
# not released after writing large csv?
# now copy the data into the table. If this doesn't work,
# be sure that postgres has read permissions for the path
sqlQuery(con,
paste("COPY ", datname, " FROM '",
getwd(), "/", datname,
".csv' WITH NULL AS 'NA' DELIMITER ',' CSV HEADER;",
sep=""))
unlink(outfile)
}
odbcClose(con)