Analyze big data in R on EC2 server - r

I managed to load and merge the 6 heavy excel files I had from my RStudio instance (on EC2 server) into one single table in PostgreQSL (linked with RDS).
Now this table has 14 columns and 2,4 Million rows.
The size of the table in PostgreSQL is 1059MB.
The EC2 instance is a t2.medium.
I wanted to analyze it, so I thought I could simply load the table with DBI package and perform different operations on it.
So I did:
my_big_df <- dbReadTable(con, "my_big_table")
my_big_df <- unique(my_big_df)
and my RStudio froze, out of memory...
My questions would be:
1) Is what I have been doing (to handle big tables like this) a ok/good practice?
2) If yes to 1), is the only way to be able to perform the unique() operation or other similar operations to increase the EC2 server memory?
3) If yes to 2), how can I know to which extent should I increase the EC2 server memory?
Thanks!

dbReadTable convert the entire table to a data.frame, which is not what you want to do for such a big tables.
As #cory told you, you need to extract the required info using SQL queries.
You can do that with DBI using combinations of dbSendQuery,dbBind,dbFetch or dbGetQuery.
For example, you could define a function to get the required data
filterBySQLString <- function(databaseDB,sqlString){
sqlString <- as.character(sqlString)
dbResponse <- dbSendQuery(databaseDB,sqlString)
requestedData <- dbFetch(dbResponse)
dbClearResult(dbResponse)
return(requestedData)
}
# write your query to get unique values
SQLquery <- "SELECT * ...
DISTINCT ..."
my_big_df <- filterBySQLString(myDB,SQLquery)
my_big_df <- unique(my_big_df)
If you cannot use SQL, then you have two options:
1) stop using Rstudio and try to run your code from the terminal or via Rscript.
2) beef up your instance

Related

How to create non-clustered indexes for querying and collecting data from an SQLite DB in R for plotting?

I have a .csv file that contains 105M rows and 30 columns that I would like to query for plotting in an R shiny app.
it contains alpha-numeric data that looks like:
#Example data
df<-as.data.frame(percent=as.numeric(rep(c("50","80"),each=5e2)),
maskProportion=as.numeric(rep(c("50","80")),each=5e2),
dose=runif(1e3),
origin=as.factor(rep(c("ABC","DEF"),each=5e2)),
destination=as.factor(rep(c("XYZ","GHI"),each=5e2))
)
write.csv(df,"PassengerData.csv")
In the terminal, I have ingested it into an SQLite database as follows:
$ sqlite3 -csv PassengerData.sqlite3 '.import PassengerData.csv df'
which is from:
Creating an SQLite DB in R from an CSV file: why is the DB file 0KB and contains no tables?
So far so good.
The problem I have is speed in querying in R so I tried indexing the DB back in the terminal.
In sqlite3, I tried creating indexes on percent, maskProportion, origin and destination following this link https://data.library.virginia.edu/creating-a-sqlite-database-for-use-with-r/ :
$ sqlite3 create index "percent" on PassengerData("percent");
$ sqlite3 create index "origin" on PassengerData("origin");
$ sqlite3 create index "destination" on PassengerData("destination");
$ sqlite3 create index "maskProp" on PassengerData("maskProp");
I run out of disk space because my DB seems to grow in size every time I make an index. E.g. after running the first command the size is 20GB. How can I avoid this?
I assume the concern is that running collect() to transfer data from SQL to R is too slow for your app. It is not clear how / whether you are processing the data in SQL before passing to R.
Several things to consider:
Indexes are not copied from SQL to R. SQL works with data off disk, so knowing where to look up specific parts of your data result in time savings. R works on data in memory so indexes are not required.
collect transfers data from a remote table (in this case SQLite) into R memory. If your goal is to transfer data into R, you could read a csv direct into R instead of writing it to SQL and then reading from SQL into R.
SQL is a better choice for doing data crunching / preparation of large datasets, and R is a better choice for detailed analysis and visualisation. But if both R and SQL are running on the same machine then both are limited by the cpu speed. Not a concern is SQL and R are running on separate hardware.
Some things you can do to improve performance:
(1) Only read the data you need from SQL into R. Prepare the data in SQL first. For example, contrast the following:
# collect last
local_r_df = remote_sql_df %>%
group_by(origin) %>%
summarise(number = n()) %>%
collect()
# collect first
local_r_df = remote_sql_df %>%
collect() %>%
group_by(origin) %>%
summarise(number = n())
Both of these will produce the same output. However, in the first example, the summary takes place in SQL and only the final result is copied to R; while in the second example, the entire table is copied to R where it is then summarized. Collect last will likely have better performance than collect first because it transfers only a small amount of data between SQL and R.
(2) Preprocess the data for your app. If your app will only examine the data from a limited number of directions, then the data could be preprocessed / pre-summarized.
For example, suppose users can pick at most two dimensions and receive a cross-tab, then you could calculate all the two-way cross-tabs and save them. This is likely to be much smaller than the entire database. Then at runtime, your app loads the prepared summaries and shows the user any summary they request. This will likely be much faster.

Speed up odbc::dbFetch

I'm trying to analyze data stored in an SQL database (MS SQL server) in R, and on a mac. Typical queries might return a few GB of data, and the entire database is a few TB. So far, I've been using the R package odbc, and it seems to work pretty well.
However, dbFetch() seems really slow. For example, a somewhat complex query returns all results in ~6 minutes in SQL server, but if I run it with odbc and then try dbFetch, it takes close to an hour to get the full 4 GB into a data.frame. I've tried fetching in chunks, which helps modestly: https://stackoverflow.com/a/59220710/8400969. I'm wondering if there is another way to more quickly pipe the data to my mac, and I like the line of thinking here: Quickly reading very large tables as dataframes
What are some strategies for speeding up dbFetch when the results of queries are a few GB of data? If the issue is generating a data.frame object from larger tables, are there savings available by "fetching" in a different manner? Are there other packages that might help?
Thanks for your ideas and suggestions!
My answer includes use of a different package. I use RODBC which is found in cran at https://cran.r-project.org/web/packages/RODBC/index.html.
This has saved me SO MUCH frustration and wasted time that came from my previous method of exporting each query result to .csv to load it into my R environment. I found regular ODBC to be much slower than RODBC.
I use the following functions:
sqlQuery() wraps the function that opens the connection to the SQL db with the first argument (in parentheses) and the query itself as the second argument. Put the query itself in quote marks.
odbcConnect() is itself the first argument in sqlquery(). The argument in odbcConnect() is the name of your connection to the SQL db. Put the connection name in quote marks.
odbcCloseAll() is the final function for this task set. Use this after each sqlQuery() to close the connection and save yourself from annoying warning messages.
Here is a simple example.
library(RODBC)
result <- sqlQuery(odbcConnect("ODBCConnectionName"),
"SELECT *
FROM dbo.table
WHERE Collection_ID = 2498")
odbcCloseAll()
Here is the same example PLUS data manipulation directly from the query result.
library(dplyr)
library(RODBC)
result <- sqlQuery(odbcConnect("ODBCConnectionName"),
"SELECT *
FROM dbo.table
WHERE Collection_ID = 2498") %>%
mutate(matchid = paste0(schoolID, "-", studentID)) %>%
distinct(matchid, .keep_all - TRUE)
odbcCloseAll()
I would suggest using the dbcooper found on github. https://github.com/chriscardillo/dbcooper
I have found huge improvements in speed when querying large datasets.
Firstly, Add your connection to your environment.
conn <- DBI::dbConnect(odbc::odbc(),
Driver = "",
Server = "",
Database = "",
UID="",
PWD="")
devtools::install_github("chriscardillo/dbcooper")
library(dbcooper)
dbcooper::dbc_init(con = conn,
con_id = "test",
tables = c("schema.table"))
This adds the function test_schema_table() to your environment which is used to call the data. To collect into your environment use scheme_table %>% collect()
Here is a microbenchmark I did to compare the results of both DBI and dbcooper.
mbm <- microbenchmark::microbenchmark(
DBI = DBI::dbFetch(DBI::dbSendQuery(conn,qry)),
dbcooper = ava_qry() %>% collect() , times=5
)
Here are the results of a microbenchmark I did to compare DBI with dbcooper.

How to import large Database Table into R

I'm trying to import a large Database table into R to do some global analysis.
I connect to Oracle DB with ROracle and use dbGetquery.
Make minimum selection and necessary where clauses directly in the query to reduce the scope of the dataset but still it is 40 columns for 12 million of rows.
My PC has only 8GB of RAM how can I handle this?
There is no way to store those data on the disk rather than on the RAM ? or something similar to that way?
The same things made in SAS works fine.
Any Idea?
Few ideas:
May be some aggregation could be done on server side?
You are going to do something with this data in R, right? So you can try not to load data, but to create tbl object and made manipulations and aggregations in R
library(dplyr)
my_tbl <- 'SELECT ... FROM ...' %>% sql() %>% tbl(con, .)
where con is your connection
Here are a couple ideas for you to consider.
library(RODBC)
dbconnection <- odbcDriverConnect("Driver=ODBC Driver 11 for SQL Server;Server=Server_Name; Database=DB_Name;Uid=; Pwd=; trusted_connection=yes")
initdata <- sqlQuery(dbconnection,paste("select * from MyTable Where Name = 'Asher';"))
odbcClose(channel)
If you can export the table as a CSV file...
require(sqldf)
df <- read.csv.sql("C:\\your_path\\CSV1.csv", "select * from file where Name='Asher'")
df

Joining across databases with dbplyr

I am working with database tables with dbplyr
I have a local table and want to join it with a large (150m rows) table on the database
The database PRODUCTION is read only
# Set up the connection and point to the table
library(odbc); library(dbplyr)
my_conn_string <- paste("Driver={Teradata};DBCName=teradata2690;DATABASE=PRODUCTION;UID=",
t2690_username,";PWD=",t2690_password, sep="")
t2690 <- dbConnect(odbc::odbc(), .connection_string=my_conn_string)
order_line <- tbl(t2690, "order_line") #150m rows
I also have a local table, let's call it orders
# fill df with random data
orders <- data.frame(matrix(rexp(50), nrow = 100000, ncol = 5))
names(orders) <- c("customer_id", paste0(rep("variable_", 4), 1:4))
let's say I wanted to join these two tables, I get the following error:
complete_orders <- orders %>% left_join(order_line)
> Error: `x` and `y` must share the same src, set `copy` = TRUE (may be slow)
The issue is, if I were to set copy = TRUE, it would try to download the whole of order_line and my computer would quickly run out of memory
Another option could be to upload the orders table to the database. The issue here is that the PRODUCTION database is read only - I would have to upload to a different database. Trying to copy across databases in dbplyr results in the same error.
The only solution I have found is to upload into the writable database and use sql to join them, which is far from ideal
I have found the answer, you can use the in_schema() function within the tbl pointer to work across schemas within the same connection
# Connect without specifying a database
my_conn_string <- paste("Driver={Teradata};DBCName=teradata2690;UID=",
t2690_username,";PWD=",t2690_password, sep="")
# Upload the local table to the TEMP db then point to it
orders <- tbl(t2690, in_schema("TEMP", "orders"))
order_line <- tbl(t2690, in_schema("PRODUCTION", "order_line"))
complete_orders <- orders %>% left_join(order_line)
Another option could be to upload the orders table to the database. The issue here is that the PRODUCTION database is read only - I would have to upload to a different database. Trying to copy across databases in dbplyr results in the same error.
In your use case, it seems (based on the accepted answer) that your databases are on the same server and it's just a matter of using in_schema. If this were not the case, another approach would be that given here, which in effect gives a version of copy_to that works on a read-only connection.

Query MS SQL using R with criteria from an R data frame

I have rather a large table in MS SQL Server (120 million rows) which I would like to query. I also have a dataframe in R that has unique ID's that I would like to use as part of my query criteria. I am familiar with the dplyr package but not sure if its possible to have the R query execute on the MS SQL server rather than bring all data onto my laptop memory (likely would crash my laptop).
of course, other option is to load the dataframe onto sql as a table which is currently what I am doing but I would prefer not to do this.
depending on what exactly you want to do, you may find value in the RODBCext package.
let's say you want to pull columns from an MS SQL table where IDs are in a vector that you have in R. you might try code like this:
library(RODBC)
library(RODBCext)
library(tidyverse)
dbconnect <- odbcDriverConnect('driver={SQL Server};
server=servername;database=dbname;trusted_connection=true')
v1 <- c(34,23,56,87,123,45)
qdf <- data_frame(idlist=v1)
sqlq <- "SELECT * FROM tablename WHERE idcol %in% ( ? )"
qr <- sqlExecute(dbconnect,sqlq,qdf,fetch=TRUE)
basically you want to put all the info you want to pass to the query into a dataframe. think of it like variables or parameters for your query; for each parameter you want a column in a dataframe. then you write the query as a character string and store it in a variable. you put it all together using the sqlExecute function.

Resources