creating a looped SQL QUERY using RODBC in R - r

First and foremost - thank you for taking your time to view my question, regardless of if you answer or not!
I am trying to create a function that loops through my df and queries in the necessary data from SQL using the RODBC package in R. However, I am having trouble setting up the query, since the parameter of the query change through each iteration (example below)
So my df looks like this:
ID Start_Date End_Date
1 2/2/2008 2/9/2008
2 1/1/2006 1/1/2007
1 5/7/2010 5/15/2010
5 9/9/2009 10/1/2009
How would I go about specifying the start date and end date in my sql program?
here's what i have so far:
data_pull <- function(df) {
a <- data.frame()
b <- data.frame()
for (i in df$id)
{
dbconnection <- odbcDriverConnect(".....")
query <- paste("Select ID, Date, Account_Balance from Table where ID = (",i,") and Date > (",df$Start_Date,") and Date <= (",df$End_Date,")")
a <- sqlQuery(dbconnection, paste(query))
b <- rbind(b,a)
}
return(b)
}
However, this doesn't query in anything. I believe it has something to do with how I am specifying the start and the end date for the iteration.
If anyone can help on this it would be greatly appreciated. If you need further explanation, please don't hesitate to ask!

A couple of syntax issues arise from current setup:
LOOP: You do not iterate through all rows of data frame but only the atomic ID values in the single column, df$ID. In that same loop you are passing the entire vectors of df$Start_Date and df$End_Date into query concatenation.
DATES: Your date formats do not align to most data base date formats of 'YYYY-MM-DD'. And still some others like Oracle, you require string to data conversion: TO_DATE(mydate, 'YYYY-MM-DD').
A couple of aforementioned performance / best practices issues:
PARAMETERIZATION: While parameterization is not needed for security reasons since your values are not generated by user input who can inject malicious SQL code, for maintainability and readability, parameterized queries are advised. Hence, consider doing so.
GROWING OBJECTS: According to Patrick Burn's Inferno Circle 2: Growing Objects, R programmers should avoid growing multi-dimensional objects like data frames inside a loop which can cause excessive copying in memory. Instead, build a list of data frames to rbind once outside the loop.
With that said, you can avoid any looping or listing needs by saving your data frame as a database table then joined to final table for a filtered, join query import. This assumes your database user has CREATE TABLE and DROP TABLE privileges.
# CONVERT DATE FIELDS TO DATE TYPE
df <- within(df, {
Start_Date = as.Date(Start_Date, format="%m/%d/%Y")
End_Date = as.Date(End_Date, format="%m/%d/%Y")
})
# SAVE DATA FRAME TO DATABASE
sqlSave(dbconnection, df, "myRData", rownames = FALSE, append = FALSE)
# IMPORT JOINED AND DATE FILTERED QUERY
q <- "SELECT ID, Date, Account_Balance
FROM Table t
INNER JOIN myRData r
ON r.ID = t.ID
AND t.Date BETWEEN r.Start_Date AND r.End_Date"
final_df <- sqlQuery(dbconnection, q)

Related

R query database tables iteratively without for loop with lambda or vectorized function for Shiny app

I am connecting to a SQL Server database through the ODBC connection in R. I have two potential methods to get data, and am trying to determine which would be more efficient. The data is needed for a Shiny dashboard, so the data needs to be pulled while the app is loading rather than querying on the fly as the user is using the app.
Method 1 is to use over 20 stored procedures to query all of the needed data and store them for use. Method 2 is to query all of the tables individually.
Here is the method I used to query one of the stored procedures:
get_proc_data <- function(proc_name, url, start_date, end_date){
dbGetQuery(con, paste0(
"EXEC dbo.", proc_name, " ",
"#URL = N'", url, "', ",
"#Startdate = '", start_date, "', ",
"#enddate = '", end_date, "' "
))
}
data <- get_proc_data(proc_name, url, today(), today() %m-% years(5))
However, each of the stored procedures has a slightly different setup for the parameters, so I would have to define each of them separately.
I have started to implement Method 2, but have run into issues with iteratively querying each table.
# use dplyr create list of table names
db_tables <- dbGetQuery(con, "SELECT * FROM [database_name].INFORMATION_SCHEMA.TABLES;") %>% select(TABLE_NAME)
# use dplyr pull to create list
table_list <- pull(db_tables , TABLE_NAME)
# get a quick look at the first few rows
tbl(con, "[TableName]") %>% head() %>% glimpse()
# iterate through all table names, get the first five rows, and export to .csv
for (table in table_list){
write.csv(
tbl(con, table) %>% head(), str_glue("{getwd()}/00_exports/tables/{table}.csv")
)
}
selected_tables <- db_tables %>% filter(TABLE_NAME == c("TableName1","TableName2"))
Ultimately this method was just to test how long it would take to iterate through the ~60 tables and perform the required function. I have tried putting this into a function instead but have not been able to get it to iterate through while also pulling the name of the table.
Pro/Con for Method 1: The stored procs are currently powering a metrics plug-in that was written in C++ and is displaying metrics on the webpage. This is for internal use to monitor website performance. However, the stored procedures are not all visible to me and the client needs me to extend their current metrics. I also do not have a DBA at my disposal to help with the SQL Server side, and the person who wrote the procs is unavailable. The procs are also using different logic than each other, so joining the results of two different procs gives drastically different values. For example, depending on the proc, each date will list total page views for each day or already be aggregated at the weekly or monthly scale then listed repeatedly. So joining and grouping causes drastic errors in actual page views.
Pro/Con for Method 2: I am familiar with dplyr and would be able to join the tables together to pull the data I need. However, I am not as familiar with SQL and there is no Entity-Relationship Diagram (ERD) of any sort to refer to. Otherwise, I would build each query individually.
Either way, I am trying to come up with a way to proceed with either a named function, lambda function, or vectorized method for iterating. It would be great to name each variable and assign them appropriately so that I can perform the data wrangling with dplyr.
Any help would be appreciated, I am overwhelmed with which direction to go. I researched the equivalent to Python list comprehension in R but have not been able get the function in R to perform similarly.
> db_table_head_to_csv <- function(table) {
+ write.csv(
+ tbl(con, table) %>% head(), str_glue("{getwd()}/00_exports/bibliometrics_tables/{table}.csv")
+ )
+ }
>
> bibliometrics_tables %>% db_table_head_to_csv()
Error in UseMethod("as.sql") :
no applicable method for 'as.sql' applied to an object of class "data.frame"
Consider storing all table data in a named list (counterpart to Python dictionary) using lapply (counterpart to Python's list/dict comprehension). And if you use its sibling, sapply, the character vector passed in will return as names of elements:
# RETURN VECTOR OF TABLE NAMES
db_tables <- dbGetQuery(
con, "SELECT [TABLE_NAME] FROM [database_name].INFORMATION_SCHEMA.TABLES"
)$TABLE_NAME
# RETURN NAMED LIST OF DATA FRAMES FOR EACH DB TABLE
df_list <- sapply(db_tables, function(t) dbReadTable(conn, t), simplify = FALSE)
You can extend the lambda function for multiple steps like write.csv or use a defined method. Just be sure to return a data frame as last line. Below uses the new pipe, |> in base R 4.1.0+:
db_table_head_to_csv <- function(table) {
head_df <- dbReadTable(con, table) |> head()
write.csv(
head_df,
file.path(
"00_exports", "bibliometrics_tables", paste0(table, ".csv")
)
)
return(head_df)
}
df_list <- sapply(db_tables, db_table_head_to_csv, simplify = FALSE)
You lose no functionality of data frame object if stored in a list and can extract with $ or [[ by name:
# EXTRACT SPECIFIC ELEMENT
head(df_list$table_1)
tail(df_list[["table_2"]])
summary(df_list$`table_3`)

R: how to query from a database multiple times based on different dates

I have an R Markup file in which I establish a database connection, query data, and store the data in an csv file. The query is based on a specific date range. How can I automated make multiple queries, so that one after another e.g. every week is queried from the database? I cannot make a query for e.g. the whole year, but I need to store the data separately for each week. I could make a data frame, in which I have two columns for the start and end date, which I would like to use for the query.
But how can I automatically run the queries multiple times depending on the date data frame?
My code so far:
#load libraries
drv <- PostgreSQL()
db_con <- dbConnect(drv, host=my_host, user=my_user, dbname=my_name, port=my_port, password=my_password)
start = "2015-01-01"
end = "2015-01-02"
result <- dbGetQuery(
db_con,
"SELECT * FROM table WHERE date >= start AND date <= end;")
st_write(result, pathname)
Consider parameterization using DBI::sqlInterpolate with a Map (wrapper to mapply) iteration:
db_con <- dbConnect(
PostgreSQL(), host=my_host, user=my_user, dbname=my_name,
port=my_port, password=my_password)
)
# ALL WEEKLY DATES IN 2015
dates_df <- data.frame(
start=seq.Date(as.Date("2015-01-01"), as.Date("2016-01-01"), by="week"),
end=seq.Date(as.Date("2015-01-08"), as.Date("2016-01-08"), by="week")
)
# USER DEFINED METHOD TO QUERY AND WRITE DATA
query_db <- function(s, e) {
# PREPARED STATEMENT WITH PLACEHOLDERS
sql <- "SELECT * FROM table WHERE date >= ?start AND date <= ?end;"
# BIND PARAMETERS AND QUERY
stmt <- DBI::sqlInterpolate(db_con, sql, start=s, end=e)
result <- dbGetQuery(db_con, stmt)
# WRITE DATA TO DISK
st_write(result, pathname)
# RETURN QUERY RESULTSET
return(result)
}
# WRITE AND STORE DATA IN MEMORY
df_list <- Map(query_db, dates_df$start, dates_df$end)
I suggest DBI::dbBind and a frame (similar to #Parfait's answer).
For demonstration, I have a "sessions" table on my pg instance that has a field ScheduledStart. In this case, it's a TIMESTAMPTZ column, not a Date column, so I need to take one more step in my demo (convert from R's Date to POSIXt classes).
# pg <- DBI::dbConnect(...)
ranges <- data.frame(
start = seq(as.Date("2020-03-01"), length.out = 4, by = "week")
)
ranges$end <- ranges$start + 6
# this line is only necessary because of my local table
ranges[] <- lapply(ranges, as.POSIXct)
Here is the bulk of the "query multiple weeks":
res <- DBI::dbSendQuery(pg, "select count(*) as n from Sessions where ScheduledStart between ? and ?")
DBI::dbBind(res, ranges)
out <- DBI::dbFetch(res)
DBI::dbClearResult(res)
out
# n
# 1 8
# 2 1
# 3 0
# 4 0
While sqlInterpolate is much better than forming your own query strings (e.g., with sprintf or paste), using dbBind allows for internal iteration like above, and allows the DBMS to optimize the query with the binding parameter ? instead of actual data. (Using sqlInterpolate, the DBMS would see four different queries. Using dbBind, it sees one query, optimizes it, and uses it four times.)
That query was really boring (select * ... works, too), but I think it gets the point across. The only downside of this method is that while it makes it really easy to get all of the data, there is nothing here that inherently tells you which of your queries a particular row came from. I suspect that you can determine that from your data, that your main intent on breaking it down by-week is the amount retrieved per-query.
Side note: I often use code like this in functions, where it is feasible that something between dbSendQuery and dbClearResult might interrupt operation. In that case, I tend to reorder my code a little, like this:
somefunc <- function(...) {
# ...
res <- DBI::dbSendQuery(pg, "select count(*) as n from Sessions where ScheduledStart between ? and ?")
on.exit({
suppressWarnings(DBI::dbClearResult(res))
}, add = TRUE)
DBI::dbBind(res, ranges)
DBI::dbClearResult(res)
return(out)
}

looping a gsub to pull from hana based on a table of values

I hope my title makes sense! If not feel free to edit it.
I have a table in R that contains unique dates. Sometimes this table may have one date at other times it may have multiple dates .I would like to loop these unique dates into SQL query I have created to pull data and append to px_tbl. I am at a loss however where to start. Below is what I have so far and obviously works when I have only 1 unique date however when the table contains 2 dates it doesn't pull.
unique_dates_df
DATE
2016-12-15
2017-02-15
2017-03-02
2017-03-09
sqlCMD_px <- 'SELECT *
FROM "_SYS_BIC"."My.Table/PRICE"
(\'PLACEHOLDER\' = (\'$$P_EFF_DATE$$\',\'%D\'))'
sqlCMD_px <- gsub("%D", unique_dates_tbl, sqlCMD_px)##<- the gsub is needed
so that the dates are formatted correctly for the SQL pull
px_tbl <- sqlQuery(myconn, sqlCMD_px)
I am convinced that an apply function will work in one form or another but haven't been able to figure it out. Thanks for the help!
This should work:
#SQL command template
sqlCmdTemp <- 'SELECT *
FROM '_SYS_BIC'.'My.Table/PRICE'
(\'PLACEHOLDER\' = (\'$$P_EFF_DATE$$\',\'%D\'))'
#Dates as character
unique_dates <- c("2017-03-08","2017-03-09", "2017-03-10")
#sapply command
res<-sapply(unique_dates, function(d) { sqlQuery(conn, gsub("%D",d,sqlCmdTemp))},simplify=F)
#bind rows
tbl.df<-do.call(rbind,res)

R to SQL server. Make a query joining a database table with a R dataframe

I have an ODBC connection to SQL server database. From R, I want to query a table with lots of data, but I want to get only those records that match my dataframe in R by certain columns (INNER JOIN). I do currently linking ODBC tables in MS ACCESS 2003 (linked tables "dbo_name") and then doing relational queries, without downloading the entire table. I need to reproduce this process in R avoiding downloading the entire table (avoid SQLFetch ()).
I have read the information from ODBC, DBI, rsqlserver packages without success. Is there any package or way to fix this?
If you can't write a table to the database, there is another trick you can use. You essentially make a giant WHERE statement. Let's say you want to join table table in the database to your data.frame called a on the column id. You could say:
ids <- paste0(a$id,collapse=',')
# If a$id is a character, you'll have to surround this in quotes:
# ids <- paste0(paste0("'",a$id,"'"),collapse=',')
dbGetQuery(con, paste0('SELECT * FROM table where id in (',paste(ids,collapse=','),')'))
From your comment, it seems that SQL Server has a problem with a query of that size. I suspect that you may have to "chunk" the query into smaller bits, and then join them all together. Here is an example of splitting the ids into 1000 chunks, querying, and then combining.
id.chunks <- split(a$ids,seq(1000))
result.list <- lapply(id.chunks, function(ids)
dbGetQuery(con,
paste0('SELECT * FROM table where id in (',ids,')')))
combined.resuls <- do.call(rbind,result.list)
The problem was solved. Ids vector was divided into groups of 1000, and then querying to the server each. I show the unorthodox code. Thanks nograpes!
# "lani1" is the vector with 395.474 ids
id.chunks<-split(lani1,seq(1000))
for (i in 1:length(id.chunks)){
idsi<-paste0(paste0("'",as.vector(unlist(id.chunks[i])),"'"),collapse=',')
if(i==1){ani<-sqlQuery(riia,paste0('SELECT * FROM T_ANIMALES WHERE an_id IN (',idsi,')'))
}
else {ani1<-sqlQuery(riia,paste0('SELECT * FROM T_ANIMALES WHERE an_id IN (',idsi,')'))
ani<-rbind(ani,ani1)
}
}
I adapted the answer above and the following worked for me without needing SQL syntax. The table I used was from the adventureworks SQL Server database.
lazy_dim_customer <- dplyr::tbl(conn, dbplyr::in_schema("dbo", "DimCustomer"))
# Create data frame of customer ids
adv_customers <- dplyr::tbl(conn, "DimCustomer")
query1 <- adv_customers %>%
filter(CustomerKey < 20000) %>%
select(CustomerKey)
d00df_customer_keys <- query1 %>% dplyr::collect()
# Chunk customer ids, filter, collect and bind
id.chunks <- split(d00df_customer_keys$CustomerKey, seq(10))
result.list <- lapply(id.chunks, function(ids)
lazy_dim_customer %>%
filter(CustomerKey %in% ids) %>%
select(CustomerKey, FirstName, LastName) %>%
collect() )
combined.results <- do.call(rbind, result.list)

How to allocate/append a large column of Date objects to a data-frame

I have a data-frame (3 cols, 12146637 rows) called tr.sql which occupies 184Mb.
(it's backed by SQL, it is the contents of my dataset which I read in via read.csv.sql)
Column 2 is tr.sql$visit_date. SQL does not allow natively representing dates as an R Date object, this is important for how I need to process the data.
Hence I want to copy the contents of tr.sql to a new data-frame tr
(where the visit_date column can be natively represented as Date (chron::Date?). Trust me, this makes exploratory data analysis easier, for now this is how I want to do it - I might use native SQL eventually but please don't quibble that for now.)
Here is my solution (thanks to gsk and everyone) + workaround:
tr <- data.frame(customer_id=integer(N), visit_date=integer(N), visit_spend=numeric(N))
# fix up col2's class to be Date
class(tr[,2]) <- 'Date'
then workaround copying tr.sql -> tr in chunks of (say) N/8 using a for-loop, so that the temporary involved in the str->Date conversion does not out-of-memory, and a garbage-collect after each:
for (i in 0:7) {
from <- floor(i*N/8)
to <- floor((i+1)*N/8) -1
if (i==7)
to <- N
print(c("Copying tr.sql$visit_date",from,to," ..."))
tr$visit_date[from:to] <- as.Date(tr.sql$visit_date[from:to])
gc()
}
rm(tr.sql)
memsize_gc() ... # only 321 Mb in the end! (was ~1Gb during copying)
The problem is allocating then copying the visit_date column.
Here is the dataset and code, I am having multiple separate problems with this, explanation below:
'training.csv' looks like...
customer_id,visit_date,visit_spend
2,2010-04-01,5.97
2,2010-04-06,12.71
2,2010-04-07,34.52
and code:
# Read in as SQL (for memory-efficiency)...
library(sqldf)
tr.sql <- read.csv.sql('training.csv')
gc()
memory.size()
# Count of how many rows we are about to declare
N <- nrow(tr.sql)
# Declare a new empty data-frame with same columns as the source d.f.
# Attempt to declare N Date objects (fails due to bad qualified name for Date)
# ... does this allocate N objects the same as data.frame(colname = numeric(N)) ?
tr <- data.frame(visit_date = Date(N))
tr <- tr.sql[0,]
# Attempt to assign the column - fails
tr$visit_date <- as.Date(tr.sql$visit_date)
# Attempt to append (fails)
> tr$visit_date <- append(tr$visit_date, as.Date(tr.sql$visit_date))
Error in `$<-.data.frame`(`*tmp*`, "visit_date", value = c("14700", "14705", :
replacement has 12146637 rows, data has 0
The second line that tries to declare data.frame(visit_date = Date(N)) fails, I don't know the correct qualified name with namespace for Date object (tried chron::Date , Dates::Date? don't work)
Both the attempt to assign and append fail. Not even sure whether it is legal, or efficient, to use append on a single large column of a data-frame.
Remember these objects are big, so avoid using temporaries.
Thanks in advance...
Try this ensuring that you are using the most recent version of sqldf (currently version 0.4-1.2).
(If you find you are running out of memory try putting the database on disk by adding the dbname = tempfile() argument to the read.csv.sql call. If even that fails then its so large in relation to available memory that its unlikely you are going to be able to do much analysis with it anyways.)
# create test data file
Lines <-
"customer_id,visit_date,visit_spend
2,2010-04-01,5.97
2,2010-04-06,12.71
2,2010-04-07,34.52"
cat(Lines, file = "trainingtest.csv")
# read it back
library(sqldf)
DF <- read.csv.sql("trainingtest.csv", method = c("integer", "Date2", "numeric"))
It doesn't look to me like you've got a data.frame there (N is a vector of length 1). Should be simple:
tr <- tr.sql
tr$visit_date <- as.Date(tr.sql$visit_date)
Or even more efficient:
tr <- data.frame(colOne = tr.sql[,1], visit_date = as.Date(tr.sql$visit_date), colThree = tr.sql[,3])
As a side note, your title says "append" but I don't think that's the operation you want. You're making the data.frame wider, not appending them on to the end (making it longer). Conceptually, this is a cbind() operation.
Try this:
tr <- data.frame(visit_date= as.Date(tr.sql$visit_date, origin="1970-01-01") )
This will succeed if your format is YYYY-MM-DD or YYYY/MM/DD. If not one of those formats then post more details. It will also succeed if tr.sql$visit_date is a numeric vector equal to the number of days after the origin. E.g:
vdfrm <- data.frame(a = as.Date(c(1470, 1475, 1480), origin="1970-01-01") )
vdfrm
a
1 1974-01-10
2 1974-01-15
3 1974-01-20

Resources