rmongodb is very slow in creating data.frame - r

I am using MongoDB do to tick data analysis in R. Initially I used MySQL, which worked fine, but I wanted to test MongoDB for this purpose. The data set contains about 200 million entries at the moment. Using RODBC I could get the query result into a data.frame very quickly using sqlQuery(conn, "select * from td where prd = 'TY' and date = '2012-01-03'")
In MongoDB I have documents like Document{{_id=5537ca647a3ad42a84374f0a, prd=TY, time=1325661600043, px=130.6875, sz=11}}
In Java I can retrieve a days worth of tick data - roughly 100,000 entries, create Tick objects and add them to an array, all in less than 2 seconds.
Using rmongodb, the below takes forever. Any ideas how to improve this?
query <- mongo.bson.from.list( list(product = "TY", date = as.POSIXct("2012-01-04")) )
res.cursor <- mongo.find(mongo, db.coll, query, limit = 100e3, options=mongo.find.exhaust)
resdf <- mongo.cursor.to.data.frame(res.cursor)
Using find.all is equally slow.

Related

R: how to query from a database multiple times based on different dates

I have an R Markup file in which I establish a database connection, query data, and store the data in an csv file. The query is based on a specific date range. How can I automated make multiple queries, so that one after another e.g. every week is queried from the database? I cannot make a query for e.g. the whole year, but I need to store the data separately for each week. I could make a data frame, in which I have two columns for the start and end date, which I would like to use for the query.
But how can I automatically run the queries multiple times depending on the date data frame?
My code so far:
#load libraries
drv <- PostgreSQL()
db_con <- dbConnect(drv, host=my_host, user=my_user, dbname=my_name, port=my_port, password=my_password)
start = "2015-01-01"
end = "2015-01-02"
result <- dbGetQuery(
db_con,
"SELECT * FROM table WHERE date >= start AND date <= end;")
st_write(result, pathname)
Consider parameterization using DBI::sqlInterpolate with a Map (wrapper to mapply) iteration:
db_con <- dbConnect(
PostgreSQL(), host=my_host, user=my_user, dbname=my_name,
port=my_port, password=my_password)
)
# ALL WEEKLY DATES IN 2015
dates_df <- data.frame(
start=seq.Date(as.Date("2015-01-01"), as.Date("2016-01-01"), by="week"),
end=seq.Date(as.Date("2015-01-08"), as.Date("2016-01-08"), by="week")
)
# USER DEFINED METHOD TO QUERY AND WRITE DATA
query_db <- function(s, e) {
# PREPARED STATEMENT WITH PLACEHOLDERS
sql <- "SELECT * FROM table WHERE date >= ?start AND date <= ?end;"
# BIND PARAMETERS AND QUERY
stmt <- DBI::sqlInterpolate(db_con, sql, start=s, end=e)
result <- dbGetQuery(db_con, stmt)
# WRITE DATA TO DISK
st_write(result, pathname)
# RETURN QUERY RESULTSET
return(result)
}
# WRITE AND STORE DATA IN MEMORY
df_list <- Map(query_db, dates_df$start, dates_df$end)
I suggest DBI::dbBind and a frame (similar to #Parfait's answer).
For demonstration, I have a "sessions" table on my pg instance that has a field ScheduledStart. In this case, it's a TIMESTAMPTZ column, not a Date column, so I need to take one more step in my demo (convert from R's Date to POSIXt classes).
# pg <- DBI::dbConnect(...)
ranges <- data.frame(
start = seq(as.Date("2020-03-01"), length.out = 4, by = "week")
)
ranges$end <- ranges$start + 6
# this line is only necessary because of my local table
ranges[] <- lapply(ranges, as.POSIXct)
Here is the bulk of the "query multiple weeks":
res <- DBI::dbSendQuery(pg, "select count(*) as n from Sessions where ScheduledStart between ? and ?")
DBI::dbBind(res, ranges)
out <- DBI::dbFetch(res)
DBI::dbClearResult(res)
out
# n
# 1 8
# 2 1
# 3 0
# 4 0
While sqlInterpolate is much better than forming your own query strings (e.g., with sprintf or paste), using dbBind allows for internal iteration like above, and allows the DBMS to optimize the query with the binding parameter ? instead of actual data. (Using sqlInterpolate, the DBMS would see four different queries. Using dbBind, it sees one query, optimizes it, and uses it four times.)
That query was really boring (select * ... works, too), but I think it gets the point across. The only downside of this method is that while it makes it really easy to get all of the data, there is nothing here that inherently tells you which of your queries a particular row came from. I suspect that you can determine that from your data, that your main intent on breaking it down by-week is the amount retrieved per-query.
Side note: I often use code like this in functions, where it is feasible that something between dbSendQuery and dbClearResult might interrupt operation. In that case, I tend to reorder my code a little, like this:
somefunc <- function(...) {
# ...
res <- DBI::dbSendQuery(pg, "select count(*) as n from Sessions where ScheduledStart between ? and ?")
on.exit({
suppressWarnings(DBI::dbClearResult(res))
}, add = TRUE)
DBI::dbBind(res, ranges)
DBI::dbClearResult(res)
return(out)
}

RODBC sqlQuery returns different number of rows every time I run it

I have a simple SQL query that should return 74m rows. I know so because I ran the same query in SSMS using count(*). However, when using RODBC sqlQuery in R, it returned less rows and I tried running the same code many times and found that the returned dataframe contains different number of rows every time (ranging from 1.1m to 20m). I cannot replicate the sql query exactly here due to confidentiality but conceptually it is similar to the below.
con <- odbcConnect("db_name", uid="my_user_id", pwd="my_password")
df_2 <- sqlQuery(con, paste("SELECT *
FROM table_name
WHERE CountryName <> 'US' and Month in (1,2,3)"),
stringsAsFactors = FALSE)
Why would an exact same code return different results every time? I tried running this more than 10 times.

creating a looped SQL QUERY using RODBC in R

First and foremost - thank you for taking your time to view my question, regardless of if you answer or not!
I am trying to create a function that loops through my df and queries in the necessary data from SQL using the RODBC package in R. However, I am having trouble setting up the query, since the parameter of the query change through each iteration (example below)
So my df looks like this:
ID Start_Date End_Date
1 2/2/2008 2/9/2008
2 1/1/2006 1/1/2007
1 5/7/2010 5/15/2010
5 9/9/2009 10/1/2009
How would I go about specifying the start date and end date in my sql program?
here's what i have so far:
data_pull <- function(df) {
a <- data.frame()
b <- data.frame()
for (i in df$id)
{
dbconnection <- odbcDriverConnect(".....")
query <- paste("Select ID, Date, Account_Balance from Table where ID = (",i,") and Date > (",df$Start_Date,") and Date <= (",df$End_Date,")")
a <- sqlQuery(dbconnection, paste(query))
b <- rbind(b,a)
}
return(b)
}
However, this doesn't query in anything. I believe it has something to do with how I am specifying the start and the end date for the iteration.
If anyone can help on this it would be greatly appreciated. If you need further explanation, please don't hesitate to ask!
A couple of syntax issues arise from current setup:
LOOP: You do not iterate through all rows of data frame but only the atomic ID values in the single column, df$ID. In that same loop you are passing the entire vectors of df$Start_Date and df$End_Date into query concatenation.
DATES: Your date formats do not align to most data base date formats of 'YYYY-MM-DD'. And still some others like Oracle, you require string to data conversion: TO_DATE(mydate, 'YYYY-MM-DD').
A couple of aforementioned performance / best practices issues:
PARAMETERIZATION: While parameterization is not needed for security reasons since your values are not generated by user input who can inject malicious SQL code, for maintainability and readability, parameterized queries are advised. Hence, consider doing so.
GROWING OBJECTS: According to Patrick Burn's Inferno Circle 2: Growing Objects, R programmers should avoid growing multi-dimensional objects like data frames inside a loop which can cause excessive copying in memory. Instead, build a list of data frames to rbind once outside the loop.
With that said, you can avoid any looping or listing needs by saving your data frame as a database table then joined to final table for a filtered, join query import. This assumes your database user has CREATE TABLE and DROP TABLE privileges.
# CONVERT DATE FIELDS TO DATE TYPE
df <- within(df, {
Start_Date = as.Date(Start_Date, format="%m/%d/%Y")
End_Date = as.Date(End_Date, format="%m/%d/%Y")
})
# SAVE DATA FRAME TO DATABASE
sqlSave(dbconnection, df, "myRData", rownames = FALSE, append = FALSE)
# IMPORT JOINED AND DATE FILTERED QUERY
q <- "SELECT ID, Date, Account_Balance
FROM Table t
INNER JOIN myRData r
ON r.ID = t.ID
AND t.Date BETWEEN r.Start_Date AND r.End_Date"
final_df <- sqlQuery(dbconnection, q)

Shiny + SQLite - why is Shiny extremely slow?

We have developing a Shiny app for a few months now. But our Shiny app is extremely slow when it tries to load a huge amount of data. We even use the reactive function to reuse the data. But it is still slow as before when we request different sets of data.
We have a log file and it shows that Shiny takes at least 30.12672 seconds or 52.24799 seconds each time to load the data from our database.
What are the reasons make Shiny so slow? Is it the server or the database? What can we do to speed it up?
We are using SQLite database. Is it the reason that makes Shiny slow?
If so, what other types of database system should we go for to process huge amount of data sets? Cassandra? HBase? Apache Spark?
EDIT:
For instace,
query <- "SELECT
s.timestamp,
s.particle_concentration as `PM2.5`,
n.code as site
FROM speckdata AS s
LEFT JOIN nodes AS n
ON n.nid = s.nid
AND n.datatype = 'speck'
WHERE strftime('%Y', s.localdate) = 'YEAR'
"
# Match the pattern and replace it.
dataQuery <- sub("YEAR", as.character(year), query)
# Store the result in data1.
data = dbGetQuery(DB, dataQuery)
if(nrow(data) > 0) {
# Convert timestamp to date and bind it to the data.
data$date <- as.POSIXct(as.numeric(as.character(data$timestamp)), origin="1970-01-01", tz="GMT")
}
# Chosen to group the data in one panel.
timePlot(
data,
pollutant = c(species, condition),
avg.time = avg_time,
lwd = 2,
lty = 1,
name.pol = c(species_text_value, condition_text_value),
type = "site",
group = TRUE,
auto.text = FALSE
)
That is extremely slow in Shiny.
But when we query the data set using the SQLite manager, it only takes 1.9 seconds for 4719282 rows!
I would suggest testing the performance of your SQLite query directly off the database. If that's your slow point you will want to optimize your query to make it more efficient. Before I can help further it would be good to know exactly where the performance issues are.

RSQLite query with user specified variable in the WHERE field [duplicate]

This question already has answers here:
Dynamic "string" in R
(4 answers)
Closed 5 years ago.
I am using a the RSQLite library in R in to manage a data set that is too large for RAM. For each regression I query the database to retrieve a fiscal year at a time. Now I have the fiscal year hard-coded:
data.annual <- dbGetQuery(db, "SELECT * FROM annual WHERE fyear==2008")
I would like to make the fiscal year (2008 above) to make changes a bit easier (and fool-proof). Is there a way that I can pass a variable into SQL query string? I would love to use:
fiscal.year <- 2008
data.annual <- dbGetQuery(db, "SELECT * FROM annual WHERE fyear==fiscal.year")
SQLite will only see the string passed down for the query, so what you do is something like
sqlcmd <- paste("SELECT * FROM annual WHERE fiscal=", fiscal.year, sep="")
data.annual <- dbGetQuery(db, sqlcmd)
The nice thing is that you can use this the usual way to unwind loops. Forgetting for a second that you have ram restrictions, conceptually you can do
years <- seq(2000,2010)
data <- lapply(years, function(y) {
dbGetQuery(db, paste("SELECT * FROM annual WHERE fiscal=", y, sep="")
}
and now data is a list containing all your yearly data sets. Or you could keep the data, run your regression and only store the summary object.
Dirk's answer is spot on. One little thing I try to do is change the formatting for easy testing. Seems I have to cut and paste the SQL text into an SQL editor many times. So I format like this:
sqlcmd <- paste("
SELECT *
FROM annual
WHERE fiscal=
", fiscal.year, sep="")
data.annual <- dbGetQuery(db, sqlcmd)
This just makes it easier to cut and paste the SQL bits in/out for testing in your DB query environment. No big deal with a short query, but can get cumbersome with a longer SQL string.

Resources