RSQLite query with user specified variable in the WHERE field [duplicate] - r

This question already has answers here:
Dynamic "string" in R
(4 answers)
Closed 5 years ago.
I am using a the RSQLite library in R in to manage a data set that is too large for RAM. For each regression I query the database to retrieve a fiscal year at a time. Now I have the fiscal year hard-coded:
data.annual <- dbGetQuery(db, "SELECT * FROM annual WHERE fyear==2008")
I would like to make the fiscal year (2008 above) to make changes a bit easier (and fool-proof). Is there a way that I can pass a variable into SQL query string? I would love to use:
fiscal.year <- 2008
data.annual <- dbGetQuery(db, "SELECT * FROM annual WHERE fyear==fiscal.year")

SQLite will only see the string passed down for the query, so what you do is something like
sqlcmd <- paste("SELECT * FROM annual WHERE fiscal=", fiscal.year, sep="")
data.annual <- dbGetQuery(db, sqlcmd)
The nice thing is that you can use this the usual way to unwind loops. Forgetting for a second that you have ram restrictions, conceptually you can do
years <- seq(2000,2010)
data <- lapply(years, function(y) {
dbGetQuery(db, paste("SELECT * FROM annual WHERE fiscal=", y, sep="")
}
and now data is a list containing all your yearly data sets. Or you could keep the data, run your regression and only store the summary object.

Dirk's answer is spot on. One little thing I try to do is change the formatting for easy testing. Seems I have to cut and paste the SQL text into an SQL editor many times. So I format like this:
sqlcmd <- paste("
SELECT *
FROM annual
WHERE fiscal=
", fiscal.year, sep="")
data.annual <- dbGetQuery(db, sqlcmd)
This just makes it easier to cut and paste the SQL bits in/out for testing in your DB query environment. No big deal with a short query, but can get cumbersome with a longer SQL string.

Related

RODBC sqlQuery returns different number of rows every time I run it

I have a simple SQL query that should return 74m rows. I know so because I ran the same query in SSMS using count(*). However, when using RODBC sqlQuery in R, it returned less rows and I tried running the same code many times and found that the returned dataframe contains different number of rows every time (ranging from 1.1m to 20m). I cannot replicate the sql query exactly here due to confidentiality but conceptually it is similar to the below.
con <- odbcConnect("db_name", uid="my_user_id", pwd="my_password")
df_2 <- sqlQuery(con, paste("SELECT *
FROM table_name
WHERE CountryName <> 'US' and Month in (1,2,3)"),
stringsAsFactors = FALSE)
Why would an exact same code return different results every time? I tried running this more than 10 times.

creating a looped SQL QUERY using RODBC in R

First and foremost - thank you for taking your time to view my question, regardless of if you answer or not!
I am trying to create a function that loops through my df and queries in the necessary data from SQL using the RODBC package in R. However, I am having trouble setting up the query, since the parameter of the query change through each iteration (example below)
So my df looks like this:
ID Start_Date End_Date
1 2/2/2008 2/9/2008
2 1/1/2006 1/1/2007
1 5/7/2010 5/15/2010
5 9/9/2009 10/1/2009
How would I go about specifying the start date and end date in my sql program?
here's what i have so far:
data_pull <- function(df) {
a <- data.frame()
b <- data.frame()
for (i in df$id)
{
dbconnection <- odbcDriverConnect(".....")
query <- paste("Select ID, Date, Account_Balance from Table where ID = (",i,") and Date > (",df$Start_Date,") and Date <= (",df$End_Date,")")
a <- sqlQuery(dbconnection, paste(query))
b <- rbind(b,a)
}
return(b)
}
However, this doesn't query in anything. I believe it has something to do with how I am specifying the start and the end date for the iteration.
If anyone can help on this it would be greatly appreciated. If you need further explanation, please don't hesitate to ask!
A couple of syntax issues arise from current setup:
LOOP: You do not iterate through all rows of data frame but only the atomic ID values in the single column, df$ID. In that same loop you are passing the entire vectors of df$Start_Date and df$End_Date into query concatenation.
DATES: Your date formats do not align to most data base date formats of 'YYYY-MM-DD'. And still some others like Oracle, you require string to data conversion: TO_DATE(mydate, 'YYYY-MM-DD').
A couple of aforementioned performance / best practices issues:
PARAMETERIZATION: While parameterization is not needed for security reasons since your values are not generated by user input who can inject malicious SQL code, for maintainability and readability, parameterized queries are advised. Hence, consider doing so.
GROWING OBJECTS: According to Patrick Burn's Inferno Circle 2: Growing Objects, R programmers should avoid growing multi-dimensional objects like data frames inside a loop which can cause excessive copying in memory. Instead, build a list of data frames to rbind once outside the loop.
With that said, you can avoid any looping or listing needs by saving your data frame as a database table then joined to final table for a filtered, join query import. This assumes your database user has CREATE TABLE and DROP TABLE privileges.
# CONVERT DATE FIELDS TO DATE TYPE
df <- within(df, {
Start_Date = as.Date(Start_Date, format="%m/%d/%Y")
End_Date = as.Date(End_Date, format="%m/%d/%Y")
})
# SAVE DATA FRAME TO DATABASE
sqlSave(dbconnection, df, "myRData", rownames = FALSE, append = FALSE)
# IMPORT JOINED AND DATE FILTERED QUERY
q <- "SELECT ID, Date, Account_Balance
FROM Table t
INNER JOIN myRData r
ON r.ID = t.ID
AND t.Date BETWEEN r.Start_Date AND r.End_Date"
final_df <- sqlQuery(dbconnection, q)

looping a gsub to pull from hana based on a table of values

I hope my title makes sense! If not feel free to edit it.
I have a table in R that contains unique dates. Sometimes this table may have one date at other times it may have multiple dates .I would like to loop these unique dates into SQL query I have created to pull data and append to px_tbl. I am at a loss however where to start. Below is what I have so far and obviously works when I have only 1 unique date however when the table contains 2 dates it doesn't pull.
unique_dates_df
DATE
2016-12-15
2017-02-15
2017-03-02
2017-03-09
sqlCMD_px <- 'SELECT *
FROM "_SYS_BIC"."My.Table/PRICE"
(\'PLACEHOLDER\' = (\'$$P_EFF_DATE$$\',\'%D\'))'
sqlCMD_px <- gsub("%D", unique_dates_tbl, sqlCMD_px)##<- the gsub is needed
so that the dates are formatted correctly for the SQL pull
px_tbl <- sqlQuery(myconn, sqlCMD_px)
I am convinced that an apply function will work in one form or another but haven't been able to figure it out. Thanks for the help!
This should work:
#SQL command template
sqlCmdTemp <- 'SELECT *
FROM '_SYS_BIC'.'My.Table/PRICE'
(\'PLACEHOLDER\' = (\'$$P_EFF_DATE$$\',\'%D\'))'
#Dates as character
unique_dates <- c("2017-03-08","2017-03-09", "2017-03-10")
#sapply command
res<-sapply(unique_dates, function(d) { sqlQuery(conn, gsub("%D",d,sqlCmdTemp))},simplify=F)
#bind rows
tbl.df<-do.call(rbind,res)

rmongodb is very slow in creating data.frame

I am using MongoDB do to tick data analysis in R. Initially I used MySQL, which worked fine, but I wanted to test MongoDB for this purpose. The data set contains about 200 million entries at the moment. Using RODBC I could get the query result into a data.frame very quickly using sqlQuery(conn, "select * from td where prd = 'TY' and date = '2012-01-03'")
In MongoDB I have documents like Document{{_id=5537ca647a3ad42a84374f0a, prd=TY, time=1325661600043, px=130.6875, sz=11}}
In Java I can retrieve a days worth of tick data - roughly 100,000 entries, create Tick objects and add them to an array, all in less than 2 seconds.
Using rmongodb, the below takes forever. Any ideas how to improve this?
query <- mongo.bson.from.list( list(product = "TY", date = as.POSIXct("2012-01-04")) )
res.cursor <- mongo.find(mongo, db.coll, query, limit = 100e3, options=mongo.find.exhaust)
resdf <- mongo.cursor.to.data.frame(res.cursor)
Using find.all is equally slow.

Add a dynamic value into RMySQL getQuery [duplicate]

This question already has answers here:
Dynamic "string" in R
(4 answers)
Closed 5 years ago.
Is it possible to pass a value into the query in dbGetQuery from the RMySQL package.
For example, if I have a set of values in a character vector:
df <- c('a','b','c')
And I want to loop through the values to pull out a specific value from a database for each.
library(RMySQL)
res <- dbGetQuery(con, "SELECT max(ID) FROM table WHERE columna='df[2]'")
When I try to add the reference to the value I get an error. Wondering if it is possible to add a value from an R object in the query.
One option is to manipulate the SQL string within the loop. At the moment you have a string literal, the 'df[2]' is not interpreted by R as anything other than characters. There are going to be some ambiguities in my answer, because df in your Q is patently not a data frame (it is a character vector!). Something like this will do what you want.
Store the output in a numeric vector:
require(RMySQL)
df <- c('a','b','c')
out <- numeric(length(df))
names(out) <- df
Now we can loop over the elements of df to execute your query three times. We can set the loop up two ways: i) with i as a number which we use to reference the elements of df and out, or ii) with i as each element of df in turn (i.e. a, then b, ...). I will show both versions below.
## Version i
for(i in seq_along(df)) {
SQL <- paste("SELECT max(ID) FROM table WHERE columna='", df[i], "';", sep = "")
out[i] <- dbGetQuery(con, SQL)
dbDisconnect(con)
}
OR:
## Version ii
for(i in df) {
SQL <- paste("SELECT max(ID) FROM table WHERE columna='", i, "';", sep = "")
out[i] <- dbGetQuery(con, SQL)
dbDisconnect(con)
}
Which you use will depend on personal taste. The second (ii) version requires you to set names on the output vector out that are the same as the data inside out.
Having said all that, assuming your actual SQL Query is similar to the one you post, can't you do this in a single SQL statement, using the GROUP BY clause, to group the data before computing max(ID)? Doing simple things in the data base like this will likely be much quicker. Unfortunately, I don't have a MySQL instance around to play with and my SQL-fu is weak currently, so I can't given an example of this.
You could also use the sprintf command to solve the issue (it's what I use when building Shiny Apps).
df <- c('a','b','c')
res <- dbGetQuery(con, sprintf("SELECT max(ID) FROM table WHERE columna='%s'"),df())
Something along those lines should work.

Resources