I'm trying to import a large Database table into R to do some global analysis.
I connect to Oracle DB with ROracle and use dbGetquery.
Make minimum selection and necessary where clauses directly in the query to reduce the scope of the dataset but still it is 40 columns for 12 million of rows.
My PC has only 8GB of RAM how can I handle this?
There is no way to store those data on the disk rather than on the RAM ? or something similar to that way?
The same things made in SAS works fine.
Any Idea?
Few ideas:
May be some aggregation could be done on server side?
You are going to do something with this data in R, right? So you can try not to load data, but to create tbl object and made manipulations and aggregations in R
library(dplyr)
my_tbl <- 'SELECT ... FROM ...' %>% sql() %>% tbl(con, .)
where con is your connection
Here are a couple ideas for you to consider.
library(RODBC)
dbconnection <- odbcDriverConnect("Driver=ODBC Driver 11 for SQL Server;Server=Server_Name; Database=DB_Name;Uid=; Pwd=; trusted_connection=yes")
initdata <- sqlQuery(dbconnection,paste("select * from MyTable Where Name = 'Asher';"))
odbcClose(channel)
If you can export the table as a CSV file...
require(sqldf)
df <- read.csv.sql("C:\\your_path\\CSV1.csv", "select * from file where Name='Asher'")
df
Related
I am trying to ingest a large amount of data (a timestamp every second, with 222 variables for months) from an SQL Server database.
However, RStudio takes around 30 seconds to import just one day.
This is the actual code I use to import the data:
library(RODBC)
table_name <- "[dbo].[sqldata]"
query_string <- paste0("SELECT * FROM ", table_name)
df <- sqlQuery(connect_to_db(),query_string)
odbcCloseAll()
Is there a way to import the data faster?
Could importing data each day and then using 'rbind' to unite them be faster?
Have you see this?
https://github.com/agstudy/rsqlserver/wiki/benchmarking
There is something called 'dbBulkCopy'. Look into that. Also, check out this link.
https://cran.r-project.org/doc/manuals/r-devel/R-data.html#RODBC
I have never had a problem with the traditional RODBC method of grabbing data from SQL Server.
library(RODBC)
dbconnection <- odbcDriverConnect("Driver=ODBC Driver 11 for SQL Server;Server=Server_Name; Database=DB_Name;Uid=; Pwd=; trusted_connection=yes")
initdata <- sqlQuery(dbconnection,paste("select * from MyTable;"))
odbcClose(channel)
I'm trying to analyze data stored in an SQL database (MS SQL server) in R, and on a mac. Typical queries might return a few GB of data, and the entire database is a few TB. So far, I've been using the R package odbc, and it seems to work pretty well.
However, dbFetch() seems really slow. For example, a somewhat complex query returns all results in ~6 minutes in SQL server, but if I run it with odbc and then try dbFetch, it takes close to an hour to get the full 4 GB into a data.frame. I've tried fetching in chunks, which helps modestly: https://stackoverflow.com/a/59220710/8400969. I'm wondering if there is another way to more quickly pipe the data to my mac, and I like the line of thinking here: Quickly reading very large tables as dataframes
What are some strategies for speeding up dbFetch when the results of queries are a few GB of data? If the issue is generating a data.frame object from larger tables, are there savings available by "fetching" in a different manner? Are there other packages that might help?
Thanks for your ideas and suggestions!
My answer includes use of a different package. I use RODBC which is found in cran at https://cran.r-project.org/web/packages/RODBC/index.html.
This has saved me SO MUCH frustration and wasted time that came from my previous method of exporting each query result to .csv to load it into my R environment. I found regular ODBC to be much slower than RODBC.
I use the following functions:
sqlQuery() wraps the function that opens the connection to the SQL db with the first argument (in parentheses) and the query itself as the second argument. Put the query itself in quote marks.
odbcConnect() is itself the first argument in sqlquery(). The argument in odbcConnect() is the name of your connection to the SQL db. Put the connection name in quote marks.
odbcCloseAll() is the final function for this task set. Use this after each sqlQuery() to close the connection and save yourself from annoying warning messages.
Here is a simple example.
library(RODBC)
result <- sqlQuery(odbcConnect("ODBCConnectionName"),
"SELECT *
FROM dbo.table
WHERE Collection_ID = 2498")
odbcCloseAll()
Here is the same example PLUS data manipulation directly from the query result.
library(dplyr)
library(RODBC)
result <- sqlQuery(odbcConnect("ODBCConnectionName"),
"SELECT *
FROM dbo.table
WHERE Collection_ID = 2498") %>%
mutate(matchid = paste0(schoolID, "-", studentID)) %>%
distinct(matchid, .keep_all - TRUE)
odbcCloseAll()
I would suggest using the dbcooper found on github. https://github.com/chriscardillo/dbcooper
I have found huge improvements in speed when querying large datasets.
Firstly, Add your connection to your environment.
conn <- DBI::dbConnect(odbc::odbc(),
Driver = "",
Server = "",
Database = "",
UID="",
PWD="")
devtools::install_github("chriscardillo/dbcooper")
library(dbcooper)
dbcooper::dbc_init(con = conn,
con_id = "test",
tables = c("schema.table"))
This adds the function test_schema_table() to your environment which is used to call the data. To collect into your environment use scheme_table %>% collect()
Here is a microbenchmark I did to compare the results of both DBI and dbcooper.
mbm <- microbenchmark::microbenchmark(
DBI = DBI::dbFetch(DBI::dbSendQuery(conn,qry)),
dbcooper = ava_qry() %>% collect() , times=5
)
Here are the results of a microbenchmark I did to compare DBI with dbcooper.
I managed to load and merge the 6 heavy excel files I had from my RStudio instance (on EC2 server) into one single table in PostgreQSL (linked with RDS).
Now this table has 14 columns and 2,4 Million rows.
The size of the table in PostgreSQL is 1059MB.
The EC2 instance is a t2.medium.
I wanted to analyze it, so I thought I could simply load the table with DBI package and perform different operations on it.
So I did:
my_big_df <- dbReadTable(con, "my_big_table")
my_big_df <- unique(my_big_df)
and my RStudio froze, out of memory...
My questions would be:
1) Is what I have been doing (to handle big tables like this) a ok/good practice?
2) If yes to 1), is the only way to be able to perform the unique() operation or other similar operations to increase the EC2 server memory?
3) If yes to 2), how can I know to which extent should I increase the EC2 server memory?
Thanks!
dbReadTable convert the entire table to a data.frame, which is not what you want to do for such a big tables.
As #cory told you, you need to extract the required info using SQL queries.
You can do that with DBI using combinations of dbSendQuery,dbBind,dbFetch or dbGetQuery.
For example, you could define a function to get the required data
filterBySQLString <- function(databaseDB,sqlString){
sqlString <- as.character(sqlString)
dbResponse <- dbSendQuery(databaseDB,sqlString)
requestedData <- dbFetch(dbResponse)
dbClearResult(dbResponse)
return(requestedData)
}
# write your query to get unique values
SQLquery <- "SELECT * ...
DISTINCT ..."
my_big_df <- filterBySQLString(myDB,SQLquery)
my_big_df <- unique(my_big_df)
If you cannot use SQL, then you have two options:
1) stop using Rstudio and try to run your code from the terminal or via Rscript.
2) beef up your instance
R. I need to read table from MYsql faster.
Now it used for connection:
conn <- dbPool(
drv = RMySQL::MySQL(),
dbname = ............................)
and construction like this is used for read (and it takes more then 10 sec to exicute for only ~100 000 rows):
my_query <- sqlInterpolate(conn,"SELECT * from Ttable")
result <- dbGetQuery(conn,my_query)
Is there any way to read the table faster?
You can use the following as a general rule for JDBC calls
Make sure your JDBC driver supports configuring fetch size
The fetch size should be based on your memory settings. (Try various values)
Go for prepared statements.
I've recently begun using RODBC to connect to PostgreSQL as I couldn't get RPostgreSQL to compile and run in Windows x64. I've found that read performance is similar between the two packages, but write performance is not. For example, using RODBC (where z is a ~6.1M row dataframe):
library(RODBC)
con <- odbcConnect("PostgreSQL84")
#autoCommit=FALSE seems to speed things up
odbcSetAutoCommit(con, autoCommit = FALSE)
system.time(sqlSave(con, z, "ERASE111", fast = TRUE))
user system elapsed
275.34 369.86 1979.59
odbcEndTran(con, commit = TRUE)
odbcCloseAll()
Whereas for the same ~6.1M row dataframe using RPostgreSQL (under 32-bit):
library(RPostgreSQL)
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname="gisdb", user="postgres", password="...")
system.time(dbWriteTable(con, "ERASE222", z))
user system elapsed
467.57 56.62 668.29
dbDisconnect(con)
So, in this test, RPostgreSQL is about 3X as fast as RODBC in writing tables. This performance ratio seems to stay more-or-less constant regardless of the number of rows in the dataframe (but the number of columns has far less effect). I do notice that RPostgreSQL uses something like COPY <table> FROM STDIN while RODBC issues a bunch of INSERT INTO <table> (columns...) VALUES (...) queries. I also notice that RODBC seems to choose int8 for integers, while RPostgreSQL chooses int4 where appropriate.
I need to do this kind of dataframe copy often, so I would very sincerely appreciate any advice on speeding up RODBC. For example, is this just inherent in ODBC, or am I not calling it properly?
It seems there is no immediate answer to this, so I'll post a kludgy workaround in case it is helpful for anyone.
Sharpie is correct--COPY FROM is by far the fastest way to get data into Postgres. Based on his suggestion, I've hacked together a function that gives a significant performance boost over RODBC::sqlSave(). For example, writing a 1.1 million row (by 24 column) dataframe took 960 seconds (elapsed) via sqlSave vs 69 seconds using the function below. I wouldn't have expected this since the data are written once to disk then again to the db.
library(RODBC)
con <- odbcConnect("PostgreSQL90")
#create the table
createTab <- function(dat, datname) {
#make an empty table, saving the trouble of making it by hand
res <- sqlSave(con, dat[1, ], datname)
res <- sqlQuery(con, paste("TRUNCATE TABLE",datname))
#write the dataframe
outfile = paste(datname, ".csv", sep = "")
write.csv(dat, outfile)
gc() # don't know why, but memory is
# not released after writing large csv?
# now copy the data into the table. If this doesn't work,
# be sure that postgres has read permissions for the path
sqlQuery(con,
paste("COPY ", datname, " FROM '",
getwd(), "/", datname,
".csv' WITH NULL AS 'NA' DELIMITER ',' CSV HEADER;",
sep=""))
unlink(outfile)
}
odbcClose(con)