I have a rather simple question.
On a daily basis, I perform data analysis in R using the RODBC package. I connect it to our data warehouse using SQL and move it into the R environment
dbhandle <- odbcDriverConnect('driver={SQL Server};server=SQLSERVER;database=MYDATABASE;trusted_connection=true')
degrees <- sqlQuery(dbhandle, "select Inst, ID, DegreeDate, Degree from DEGREE where FY = ('2015') group by Inst, ID, DegreeDate, Degree order by Inst, ID, DegreeDate, Degree", as.is=TRUE)
You know how in MS Access, you can have a window pop up that asks you what FY for example? You put in 2015 and you'll get all the degress from that fiscal year.
Is there any way to do it in R? The parameter query questions I see on Stack Overflow deal with changing the data in the SQL database and I'm not interested in that. I just want to set some pretty basic limits so I can rerun code.
Some may wonder "why can't you just change the 5 to a 6?" That's a fair point but I'm concerned that, with more complicated queries, users may miss a part in the SQL query to change the 5 to a 6 and that would mess the analysis up or slow it down.
Thank you!
Walker
The Input Parameter pop-up box is strictly an MSAccess.exe GUI feature. If running MS Access as a backend database (outside of the MS Office software) via ODBC, query with unknown parameter will fail and error raised on the script making ODBC call.
Therefore, you will need to create a similar GUI pop-up box in R for this need using libraries such as GWidgets or Shiny, then pass user's input value into query. And do so with actual parameterization using RODBCext (extension of RODBC) in case a malicious user runs SQL injection and potentially wipes data or destroys your SQL Server database.
Below is an example using GWidgets2 with a combo box for Fiscal Years (screenshot below).
Libraries
library(RDOBC)
library(RODBCext)
library(gWidgets2)
library(gWidgets2tcltk)
options(guiToolkit="tcltk")
GUI Function (create the R and SQLServer gif image beforehand)
mainWindow <- function(){
# TOP OF WINDOW
win <- gWidgets2::gwindow("Fiscal Year User Input", height = 200, width = 300)
tbl <- glayout(cont=win, spacing = 8, expand=TRUE)
# IMAGE
tbl[1,1] <- gimage(filename = "RSQLServerGUI.gif",
dirname = "/path/to/gif/image", container = tbl)
# LABEL
tbl[2,1] <- glabel("Fiscal Year Selection: ", container = tbl)
font(tbl[2,1]) <- list(size=12, family="Arial")
# COMBO BOX OF FISCAL YEARS
tbl[3,1, expand=TRUE] <- fiscal_year_cbo <- gcombobox(as.character(c(2012:2018)),
selected = 1, editable = TRUE,
index=TRUE, container = tbl)
font(tbl[3,1]) <- list(size=16, family="Arial")
# COMBO BOX CHANGE FUNCTION (2012 - 2018)
addHandlerChanged(fiscal_year_cbo, handler=function(...) {
fiscal_year_value <- svalue(fiscal_year_cbo) # GET USER SELECTED VALUE
gmessage(paste("You selected FY:", fiscal_year_value))
degrees <- getDegreesData(fiscal_year_value) # GET DATABASE DATA
dispose(win) # CLOSE WINDOW
})
}
Query Function (called in combo box change handler above)
getDegreesData <- function(fy_param) {
dbhandle <- odbcDriverConnect('driver={SQL Server};server=SQLSERVER;database=MYDATABASE;trusted_connection=true')
# PREPARED STATEMENT (NO CONCATENATED DATA)
strSQL <- "select Inst, ID, DegreeDate, Degree
from DEGREE
where FY = ?
group by Inst, ID, DegreeDate, Degree
order by Inst, ID, DegreeDate, Degree"
# PASS PARAMETER TO RETURN DATAFRAME
sql_df <- sqlExecute(dbhandle, strSQL, fy_param, fetch=TRUE)
odbcClose(dbHandle)
return(sql_df)
}
Run GUI
m <- mainWindow()
Screenshot
You can create a input variable at the start and pass it to your query.
For example:
# Change your FY here
input_FY <- 2016
dbhandle <- odbcDriverConnect('driver={SQL Server};server=SQLSERVER;database=MYDATABASE;trusted_connection=true')
degrees <- sqlQuery(dbhandle, paste0("
select Inst, ID, DegreeDate, Degree
from DEGREE
where FY = ('", input_FY, "')
group by Inst, ID, DegreeDate, Degree
order by Inst, ID, DegreeDate, Degree"),
as.is=TRUE)
So for any complicated queries you can still pass the same input_FY variable or any other variable that you have declared at the start of code for a quick/easy update.
Related
I have an R Markup file in which I establish a database connection, query data, and store the data in an csv file. The query is based on a specific date range. How can I automated make multiple queries, so that one after another e.g. every week is queried from the database? I cannot make a query for e.g. the whole year, but I need to store the data separately for each week. I could make a data frame, in which I have two columns for the start and end date, which I would like to use for the query.
But how can I automatically run the queries multiple times depending on the date data frame?
My code so far:
#load libraries
drv <- PostgreSQL()
db_con <- dbConnect(drv, host=my_host, user=my_user, dbname=my_name, port=my_port, password=my_password)
start = "2015-01-01"
end = "2015-01-02"
result <- dbGetQuery(
db_con,
"SELECT * FROM table WHERE date >= start AND date <= end;")
st_write(result, pathname)
Consider parameterization using DBI::sqlInterpolate with a Map (wrapper to mapply) iteration:
db_con <- dbConnect(
PostgreSQL(), host=my_host, user=my_user, dbname=my_name,
port=my_port, password=my_password)
)
# ALL WEEKLY DATES IN 2015
dates_df <- data.frame(
start=seq.Date(as.Date("2015-01-01"), as.Date("2016-01-01"), by="week"),
end=seq.Date(as.Date("2015-01-08"), as.Date("2016-01-08"), by="week")
)
# USER DEFINED METHOD TO QUERY AND WRITE DATA
query_db <- function(s, e) {
# PREPARED STATEMENT WITH PLACEHOLDERS
sql <- "SELECT * FROM table WHERE date >= ?start AND date <= ?end;"
# BIND PARAMETERS AND QUERY
stmt <- DBI::sqlInterpolate(db_con, sql, start=s, end=e)
result <- dbGetQuery(db_con, stmt)
# WRITE DATA TO DISK
st_write(result, pathname)
# RETURN QUERY RESULTSET
return(result)
}
# WRITE AND STORE DATA IN MEMORY
df_list <- Map(query_db, dates_df$start, dates_df$end)
I suggest DBI::dbBind and a frame (similar to #Parfait's answer).
For demonstration, I have a "sessions" table on my pg instance that has a field ScheduledStart. In this case, it's a TIMESTAMPTZ column, not a Date column, so I need to take one more step in my demo (convert from R's Date to POSIXt classes).
# pg <- DBI::dbConnect(...)
ranges <- data.frame(
start = seq(as.Date("2020-03-01"), length.out = 4, by = "week")
)
ranges$end <- ranges$start + 6
# this line is only necessary because of my local table
ranges[] <- lapply(ranges, as.POSIXct)
Here is the bulk of the "query multiple weeks":
res <- DBI::dbSendQuery(pg, "select count(*) as n from Sessions where ScheduledStart between ? and ?")
DBI::dbBind(res, ranges)
out <- DBI::dbFetch(res)
DBI::dbClearResult(res)
out
# n
# 1 8
# 2 1
# 3 0
# 4 0
While sqlInterpolate is much better than forming your own query strings (e.g., with sprintf or paste), using dbBind allows for internal iteration like above, and allows the DBMS to optimize the query with the binding parameter ? instead of actual data. (Using sqlInterpolate, the DBMS would see four different queries. Using dbBind, it sees one query, optimizes it, and uses it four times.)
That query was really boring (select * ... works, too), but I think it gets the point across. The only downside of this method is that while it makes it really easy to get all of the data, there is nothing here that inherently tells you which of your queries a particular row came from. I suspect that you can determine that from your data, that your main intent on breaking it down by-week is the amount retrieved per-query.
Side note: I often use code like this in functions, where it is feasible that something between dbSendQuery and dbClearResult might interrupt operation. In that case, I tend to reorder my code a little, like this:
somefunc <- function(...) {
# ...
res <- DBI::dbSendQuery(pg, "select count(*) as n from Sessions where ScheduledStart between ? and ?")
on.exit({
suppressWarnings(DBI::dbClearResult(res))
}, add = TRUE)
DBI::dbBind(res, ranges)
DBI::dbClearResult(res)
return(out)
}
I'm writing a query using RPostgreSQL meant to fill a table one time. I really have no intention to do anything else with the data within R. I just need it for is to run the function to fill that table.
library(data.table)
library (RPostgreSQL)
MakeAndGetQuery <- function(id) {
q <- paste0("INSERT INTO table_a SELECT * FROM table_c WHERE client_id = ",
id,
" AND event_date = CURRENT_DATE - 1")
as.data.table(dbGetQuery(conn2, q))
}
all_yer_data <- rbindlist(lapply(generate_id$client_id, MakeAndGetQuery))
setkey(all_yer_data, id, ...)
So my question is, will not doing anything with the dataframe within R impact if it runs successfully? In theory, that SQL statement shouldn't even produce any results within R. It's using INSERT INTO from Redshift, so if I ran that in Redshift, it wouldn't return any results, just a message saying it was successfull and "5 Rows affected"
I'd like to create and deploy a model into SQL Server in order to use it with new PREDICT() in-built function.
However, it seems I'm stuck with RxSqlServerData method in R.
Whenever I run my script, I get this error:
Error in rxExecJob(rxCallInfo(matchCall, .rxDeprecated =
"covariance"), : Data must be an RxSqlServerData data source for
this compute context.
Here's my code so far:
#Logistic plain select sql query
#input_query = 'SELECT app.ClientAgeToApplicationDate AS Age, IIF(conc.FirstInstallmentDelay>60,1,0) AS FPD60 FROM dim.Application app JOIN dim.Contract con ON app.ApplicationID = con.ApplicationID JOIN dim.Contract_Calculated conc ON con.ContractID = conc.ContractId'
#LinReg aggregated query
input_query = '
*SQL QUERY, too long to paste...*
'
connStr <- paste("Driver=SQL Server; Server=", "czphaddwh01\\dev",
";Database=", "DWH_Staging", ";Trusted_Connection=true", sep = "");
#Set compute context to SQL Server. Does not load any data into a memory of the local client. OBDC can't.
cc <- RxInSqlServer(connectionString = connStr);
rxSetComputeContext(cc)
input_data <- RxSqlServerData(sqlQuery = input_query, connectionString = connStr)
risk <- rxImport(input_data)
#head(risk)
#Binary regression for non-aggregated sql query
#logit_model <- rxLogit(Age ~ FPD60, data = risk)
#LinReg for aggregated sql query
LinReg_model <- rxLinMod(RiskFPD60 ~ Age, data = risk)
I'm new to R. Any help would be greatly appreciated.
When you run
cc <- RxInSqlServer(connectionString = connStr);
rxSetComputeContext(cc)
you tell R to run any Microsoft analytics functions (basically those starting with rx) in the SQL compute context. This means that all the processing will be handled inside the database. R essentially is acting as a shell to SQL.
Naturally, this requires that the dataset you're working with has to be actually in the database: a table, view, or query returning a resultset.
When you then run
risk <- rxImport(input_data)
LinReg_model <- rxLinMod(RiskFPD60 ~ Age, data = risk)
you import your data into a local data frame, and then try to fit a model on it. But you previously told R to do the number-crunching in-database, and your data is local. So it will complain.
The solution is to pass your RxSqlServerData object directly to rxLinMod:
LinReg_model <- rxLinMod(RiskFPD60 ~ Age, data = input_data)
I have downloaded the Corine 2012 data about land coverage (available here), in order to use it for creating online a map via Shiny and Leaflet. I have uploaded the data in my PostgreSQL database and wish to use queries of data parts for my Shiny application. I began trying getting the data but it is quite slow and my main query leads to a 80 MB dataframe. How can I approach this differently in order to speed up getting the data and reduce its size? My code snippet (getting data for areas with coniferous plants) is:
library(RPostgreSQL)
library(postGIStools)
drv <- dbDriver("PostgreSQL") # loads the PostgreSQL driver
con <- dbConnect(drv, dbname = mydbname, host = myhost, port = myport,
user = myuser, password = mypassword)
# Getting data
coniferous <- get_postgis_query(con, "SELECT id, geom from mycorine WHERE code='312'",geom_name = "geom")
Thank you in advance!
i've had quite a lot of joy using rpostgis and sf when extracting large amounts of vector data from postgis into R. Also incorporate ST_Simplify to speed up geometry displays in Leaflet:
# set up connection
conn <- dbConnect("PostgreSQL",user="user",password="mypass",port=5432,dbname="postgis_name")
# dummy query (obviously), including a spatial subset and ST_Simplify to simplify geometry
qry <- "SELECT ST_Simplify(geom,60) AS geom FROM mytable WHERE geom && ST_MakeEnvelope(xmin, ymin, xmax, ymax, EPSG)"
the.data = st_read_db(conn, query=qry, geom="geom")
this will return simplified sf objects, which are read in as a data frame and are very quick to read into R.
The above query was against 600,000 polygons and subset by a bounding box that read in about 8,000 of them. It took 0.4 seconds. Obviously this could be done by attribute instead of spatial bounding box (query times may differ though).
https://cran.r-project.org/web/packages/sf/sf.pdf
You should always take into account how much data is reasonable to display and what level of geometrical detail is acceptable at your zoom level etc.
We have developing a Shiny app for a few months now. But our Shiny app is extremely slow when it tries to load a huge amount of data. We even use the reactive function to reuse the data. But it is still slow as before when we request different sets of data.
We have a log file and it shows that Shiny takes at least 30.12672 seconds or 52.24799 seconds each time to load the data from our database.
What are the reasons make Shiny so slow? Is it the server or the database? What can we do to speed it up?
We are using SQLite database. Is it the reason that makes Shiny slow?
If so, what other types of database system should we go for to process huge amount of data sets? Cassandra? HBase? Apache Spark?
EDIT:
For instace,
query <- "SELECT
s.timestamp,
s.particle_concentration as `PM2.5`,
n.code as site
FROM speckdata AS s
LEFT JOIN nodes AS n
ON n.nid = s.nid
AND n.datatype = 'speck'
WHERE strftime('%Y', s.localdate) = 'YEAR'
"
# Match the pattern and replace it.
dataQuery <- sub("YEAR", as.character(year), query)
# Store the result in data1.
data = dbGetQuery(DB, dataQuery)
if(nrow(data) > 0) {
# Convert timestamp to date and bind it to the data.
data$date <- as.POSIXct(as.numeric(as.character(data$timestamp)), origin="1970-01-01", tz="GMT")
}
# Chosen to group the data in one panel.
timePlot(
data,
pollutant = c(species, condition),
avg.time = avg_time,
lwd = 2,
lty = 1,
name.pol = c(species_text_value, condition_text_value),
type = "site",
group = TRUE,
auto.text = FALSE
)
That is extremely slow in Shiny.
But when we query the data set using the SQLite manager, it only takes 1.9 seconds for 4719282 rows!
I would suggest testing the performance of your SQLite query directly off the database. If that's your slow point you will want to optimize your query to make it more efficient. Before I can help further it would be good to know exactly where the performance issues are.