I'd like to create and deploy a model into SQL Server in order to use it with new PREDICT() in-built function.
However, it seems I'm stuck with RxSqlServerData method in R.
Whenever I run my script, I get this error:
Error in rxExecJob(rxCallInfo(matchCall, .rxDeprecated =
"covariance"), : Data must be an RxSqlServerData data source for
this compute context.
Here's my code so far:
#Logistic plain select sql query
#input_query = 'SELECT app.ClientAgeToApplicationDate AS Age, IIF(conc.FirstInstallmentDelay>60,1,0) AS FPD60 FROM dim.Application app JOIN dim.Contract con ON app.ApplicationID = con.ApplicationID JOIN dim.Contract_Calculated conc ON con.ContractID = conc.ContractId'
#LinReg aggregated query
input_query = '
*SQL QUERY, too long to paste...*
'
connStr <- paste("Driver=SQL Server; Server=", "czphaddwh01\\dev",
";Database=", "DWH_Staging", ";Trusted_Connection=true", sep = "");
#Set compute context to SQL Server. Does not load any data into a memory of the local client. OBDC can't.
cc <- RxInSqlServer(connectionString = connStr);
rxSetComputeContext(cc)
input_data <- RxSqlServerData(sqlQuery = input_query, connectionString = connStr)
risk <- rxImport(input_data)
#head(risk)
#Binary regression for non-aggregated sql query
#logit_model <- rxLogit(Age ~ FPD60, data = risk)
#LinReg for aggregated sql query
LinReg_model <- rxLinMod(RiskFPD60 ~ Age, data = risk)
I'm new to R. Any help would be greatly appreciated.
When you run
cc <- RxInSqlServer(connectionString = connStr);
rxSetComputeContext(cc)
you tell R to run any Microsoft analytics functions (basically those starting with rx) in the SQL compute context. This means that all the processing will be handled inside the database. R essentially is acting as a shell to SQL.
Naturally, this requires that the dataset you're working with has to be actually in the database: a table, view, or query returning a resultset.
When you then run
risk <- rxImport(input_data)
LinReg_model <- rxLinMod(RiskFPD60 ~ Age, data = risk)
you import your data into a local data frame, and then try to fit a model on it. But you previously told R to do the number-crunching in-database, and your data is local. So it will complain.
The solution is to pass your RxSqlServerData object directly to rxLinMod:
LinReg_model <- rxLinMod(RiskFPD60 ~ Age, data = input_data)
Related
I am connecting to a SQL Server database through the ODBC connection in R. I have two potential methods to get data, and am trying to determine which would be more efficient. The data is needed for a Shiny dashboard, so the data needs to be pulled while the app is loading rather than querying on the fly as the user is using the app.
Method 1 is to use over 20 stored procedures to query all of the needed data and store them for use. Method 2 is to query all of the tables individually.
Here is the method I used to query one of the stored procedures:
get_proc_data <- function(proc_name, url, start_date, end_date){
dbGetQuery(con, paste0(
"EXEC dbo.", proc_name, " ",
"#URL = N'", url, "', ",
"#Startdate = '", start_date, "', ",
"#enddate = '", end_date, "' "
))
}
data <- get_proc_data(proc_name, url, today(), today() %m-% years(5))
However, each of the stored procedures has a slightly different setup for the parameters, so I would have to define each of them separately.
I have started to implement Method 2, but have run into issues with iteratively querying each table.
# use dplyr create list of table names
db_tables <- dbGetQuery(con, "SELECT * FROM [database_name].INFORMATION_SCHEMA.TABLES;") %>% select(TABLE_NAME)
# use dplyr pull to create list
table_list <- pull(db_tables , TABLE_NAME)
# get a quick look at the first few rows
tbl(con, "[TableName]") %>% head() %>% glimpse()
# iterate through all table names, get the first five rows, and export to .csv
for (table in table_list){
write.csv(
tbl(con, table) %>% head(), str_glue("{getwd()}/00_exports/tables/{table}.csv")
)
}
selected_tables <- db_tables %>% filter(TABLE_NAME == c("TableName1","TableName2"))
Ultimately this method was just to test how long it would take to iterate through the ~60 tables and perform the required function. I have tried putting this into a function instead but have not been able to get it to iterate through while also pulling the name of the table.
Pro/Con for Method 1: The stored procs are currently powering a metrics plug-in that was written in C++ and is displaying metrics on the webpage. This is for internal use to monitor website performance. However, the stored procedures are not all visible to me and the client needs me to extend their current metrics. I also do not have a DBA at my disposal to help with the SQL Server side, and the person who wrote the procs is unavailable. The procs are also using different logic than each other, so joining the results of two different procs gives drastically different values. For example, depending on the proc, each date will list total page views for each day or already be aggregated at the weekly or monthly scale then listed repeatedly. So joining and grouping causes drastic errors in actual page views.
Pro/Con for Method 2: I am familiar with dplyr and would be able to join the tables together to pull the data I need. However, I am not as familiar with SQL and there is no Entity-Relationship Diagram (ERD) of any sort to refer to. Otherwise, I would build each query individually.
Either way, I am trying to come up with a way to proceed with either a named function, lambda function, or vectorized method for iterating. It would be great to name each variable and assign them appropriately so that I can perform the data wrangling with dplyr.
Any help would be appreciated, I am overwhelmed with which direction to go. I researched the equivalent to Python list comprehension in R but have not been able get the function in R to perform similarly.
> db_table_head_to_csv <- function(table) {
+ write.csv(
+ tbl(con, table) %>% head(), str_glue("{getwd()}/00_exports/bibliometrics_tables/{table}.csv")
+ )
+ }
>
> bibliometrics_tables %>% db_table_head_to_csv()
Error in UseMethod("as.sql") :
no applicable method for 'as.sql' applied to an object of class "data.frame"
Consider storing all table data in a named list (counterpart to Python dictionary) using lapply (counterpart to Python's list/dict comprehension). And if you use its sibling, sapply, the character vector passed in will return as names of elements:
# RETURN VECTOR OF TABLE NAMES
db_tables <- dbGetQuery(
con, "SELECT [TABLE_NAME] FROM [database_name].INFORMATION_SCHEMA.TABLES"
)$TABLE_NAME
# RETURN NAMED LIST OF DATA FRAMES FOR EACH DB TABLE
df_list <- sapply(db_tables, function(t) dbReadTable(conn, t), simplify = FALSE)
You can extend the lambda function for multiple steps like write.csv or use a defined method. Just be sure to return a data frame as last line. Below uses the new pipe, |> in base R 4.1.0+:
db_table_head_to_csv <- function(table) {
head_df <- dbReadTable(con, table) |> head()
write.csv(
head_df,
file.path(
"00_exports", "bibliometrics_tables", paste0(table, ".csv")
)
)
return(head_df)
}
df_list <- sapply(db_tables, db_table_head_to_csv, simplify = FALSE)
You lose no functionality of data frame object if stored in a list and can extract with $ or [[ by name:
# EXTRACT SPECIFIC ELEMENT
head(df_list$table_1)
tail(df_list[["table_2"]])
summary(df_list$`table_3`)
I am trying to use R to access the clinicaltrials.gov AACT database to create a list of facility_investigators for a specific topic.
The following code is an example of how to get a list of all clinical trials on the topic TP53
library(dplyr)
library(RPostgreSQL)
aact = src_postgres(dbname = 'aact',
host = "aact-db.ctti-clinicaltrials.org",
user = 'aact',
password = 'aact')
study_tbl = tbl(src=aact, 'studies')
x = study_tbl %>% filter(official_title %like% '%TP53%') %>% collect()
Similarly, if I want a list of principal investigators,
library(dplyr)
library(RPostgreSQL)
aact = src_postgres(dbname = 'aact',
host = "aact-db.ctti-clinicaltrials.org",
user = 'aact',
password = 'aact')
study_tbl = tbl(src=aact, 'facility_investigators')
I am unable to make a list on only TP53 facility_investigators. Something like TP53 & facility_investigators. Any help would be appreciated
This is a link where some explanation is provided, but my problem is not resolved - http://www.cancerdatasci.org/post/2017/03/approaches-to-accessing-clinicaltrials.gov-data/
Is this what your asking...Your pulling from two different tables in the same database the first one is 'studies' and the second one is 'facilities investigators'. What you need to do is run the head() command for each of the tables (or run glimpse() or run str()) and see if the two tables have a common variable you can merge on after you load them into R. If they do then you would do something like this:
library(dplyr)
merged_table <- inner_join(x, study_table, by = "common column")
If the columns have different names it would like:
library(dplyr)
merged_table <- inner_join(x, study_table, by = c("x_column_name" = "study_table_column_name"))
From there you can limit your dataset to just facility investigators that have the characteristics you want.
If you want to do it in one PostgreSQL query you can do it like so. For more information about this syntax in particular see page 18:
con <- dbConnect() # use the same parameters you use above to connect
query <- dbSendQuery(con,
'select s.*, fi.*
from (select * from studies where official_title like "%TP53%")
as s
inner join facility_investigators as fi
on s."joining column" = fi."joining column"'
)
r_dataset <- fetch(query)
# I would just close the connection in RStudio using the connection tab.
The above query has an inner join in the main query and a subquery in the from statement. The subquery performs the filtering you where trying to do in R. It essentially allows you to select only from the table where the results are already filtered. An inner join combines all the records the two tables have in common and puts them into one table. If you need to join on more than one column add an 'and' between the two statements in the on statement.
This is a newbie question.
How do I define a (classification) tsk that uses data from a (sqlite) database? The mlr3db example seems to write data from memory first. In my case, the data is already in the database. What is maybe a bigger problem, the target data and the features are in different tables.
What I tried:
con <- DBI::dbConnect(RSQLite::SQLite(), dbname = "my_data.db")
my_features <- dplyr::tbl(con, "my_features")
my_target <- dplyr::tbl(con, "my_targets")
task <- mlr3::TaskClassif$new("my_task", backend=my_features, target="???")
and then I don't know how to specify the target argument.
Maybe a solution would be to create a VIEW in the database that joins features and targets?
Having the data split into (1) multiple tables or (2) multiple data bases is possible. In your case, it looks like the data is just split into multiple tables, but you can use the same DBI connection to access them.
All you need is a key column to join the two tables. In the following example I'm using a simple integer key column and an inner_join() to merge the two tables into a single new table, but this somewhat depends on your database scheme.
library(mlr3)
library(mlr3db)
# base data set
data = iris
data$row_id = 1:nrow(data)
# create data base with two tables, split data into features and target and
# keep key column `row_id` in both tables
path = tempfile()
con = DBI::dbConnect(RSQLite::SQLite(), dbname = path)
DBI::dbWriteTable(con, "features", subset(data, select = - Species))
DBI::dbWriteTable(con, "target", subset(data, select = c(row_id, Species)))
DBI::dbDisconnect(con)
# re-open table
con = DBI::dbConnect(RSQLite::SQLite(), dbname = path)
# access tables with dplyr
tbl_features = dplyr::tbl(con, "features")
tbl_target = dplyr::tbl(con, "target")
# join tables with an inner_join
tbl_joined = dplyr::inner_join(tbl_features, tbl_target, by = "row_id")
# convert to a backend and create the task
backend = as_data_backend(tbl_joined, primary_key = "row_id")
mlr3::TaskClassif$new("my_task", backend, target = "Species")
I have a rather simple question.
On a daily basis, I perform data analysis in R using the RODBC package. I connect it to our data warehouse using SQL and move it into the R environment
dbhandle <- odbcDriverConnect('driver={SQL Server};server=SQLSERVER;database=MYDATABASE;trusted_connection=true')
degrees <- sqlQuery(dbhandle, "select Inst, ID, DegreeDate, Degree from DEGREE where FY = ('2015') group by Inst, ID, DegreeDate, Degree order by Inst, ID, DegreeDate, Degree", as.is=TRUE)
You know how in MS Access, you can have a window pop up that asks you what FY for example? You put in 2015 and you'll get all the degress from that fiscal year.
Is there any way to do it in R? The parameter query questions I see on Stack Overflow deal with changing the data in the SQL database and I'm not interested in that. I just want to set some pretty basic limits so I can rerun code.
Some may wonder "why can't you just change the 5 to a 6?" That's a fair point but I'm concerned that, with more complicated queries, users may miss a part in the SQL query to change the 5 to a 6 and that would mess the analysis up or slow it down.
Thank you!
Walker
The Input Parameter pop-up box is strictly an MSAccess.exe GUI feature. If running MS Access as a backend database (outside of the MS Office software) via ODBC, query with unknown parameter will fail and error raised on the script making ODBC call.
Therefore, you will need to create a similar GUI pop-up box in R for this need using libraries such as GWidgets or Shiny, then pass user's input value into query. And do so with actual parameterization using RODBCext (extension of RODBC) in case a malicious user runs SQL injection and potentially wipes data or destroys your SQL Server database.
Below is an example using GWidgets2 with a combo box for Fiscal Years (screenshot below).
Libraries
library(RDOBC)
library(RODBCext)
library(gWidgets2)
library(gWidgets2tcltk)
options(guiToolkit="tcltk")
GUI Function (create the R and SQLServer gif image beforehand)
mainWindow <- function(){
# TOP OF WINDOW
win <- gWidgets2::gwindow("Fiscal Year User Input", height = 200, width = 300)
tbl <- glayout(cont=win, spacing = 8, expand=TRUE)
# IMAGE
tbl[1,1] <- gimage(filename = "RSQLServerGUI.gif",
dirname = "/path/to/gif/image", container = tbl)
# LABEL
tbl[2,1] <- glabel("Fiscal Year Selection: ", container = tbl)
font(tbl[2,1]) <- list(size=12, family="Arial")
# COMBO BOX OF FISCAL YEARS
tbl[3,1, expand=TRUE] <- fiscal_year_cbo <- gcombobox(as.character(c(2012:2018)),
selected = 1, editable = TRUE,
index=TRUE, container = tbl)
font(tbl[3,1]) <- list(size=16, family="Arial")
# COMBO BOX CHANGE FUNCTION (2012 - 2018)
addHandlerChanged(fiscal_year_cbo, handler=function(...) {
fiscal_year_value <- svalue(fiscal_year_cbo) # GET USER SELECTED VALUE
gmessage(paste("You selected FY:", fiscal_year_value))
degrees <- getDegreesData(fiscal_year_value) # GET DATABASE DATA
dispose(win) # CLOSE WINDOW
})
}
Query Function (called in combo box change handler above)
getDegreesData <- function(fy_param) {
dbhandle <- odbcDriverConnect('driver={SQL Server};server=SQLSERVER;database=MYDATABASE;trusted_connection=true')
# PREPARED STATEMENT (NO CONCATENATED DATA)
strSQL <- "select Inst, ID, DegreeDate, Degree
from DEGREE
where FY = ?
group by Inst, ID, DegreeDate, Degree
order by Inst, ID, DegreeDate, Degree"
# PASS PARAMETER TO RETURN DATAFRAME
sql_df <- sqlExecute(dbhandle, strSQL, fy_param, fetch=TRUE)
odbcClose(dbHandle)
return(sql_df)
}
Run GUI
m <- mainWindow()
Screenshot
You can create a input variable at the start and pass it to your query.
For example:
# Change your FY here
input_FY <- 2016
dbhandle <- odbcDriverConnect('driver={SQL Server};server=SQLSERVER;database=MYDATABASE;trusted_connection=true')
degrees <- sqlQuery(dbhandle, paste0("
select Inst, ID, DegreeDate, Degree
from DEGREE
where FY = ('", input_FY, "')
group by Inst, ID, DegreeDate, Degree
order by Inst, ID, DegreeDate, Degree"),
as.is=TRUE)
So for any complicated queries you can still pass the same input_FY variable or any other variable that you have declared at the start of code for a quick/easy update.
I am trying to query my MongoDB database from R.
I think I lost part of it in the process.
Does R have any limit, and how can I ensure all my records are loaded into R?
Code:
# inspect number of record in mongodb
db.complaints.count()
>395 853
# write a query to load data into R
library(dplyr)
complaints = data.frame(stringsAsFactors = FALSE)
db = "customers.complaints"
cursor = mongo.find(mongo, db)
i = 1
while (mongo.cursor.next(cursor))
{
tmp = mongo.bson.to.list(mongo.cursor.value(cursor))
tmp.df = as.data.frame(t(unlist(tmp)), stringsAsFactors=F)
complaints = rbind.fill(complaints, tmp.df)
}
I get [1] 47077 15 after checking the loading in R with dim(complaints).
How can make sure I get all my collections in R?
http://www.analyticbridge.com/profiles/blogs/time-issue-in-creating-a-huge-data-frame-from-mongodb-collection
the above code using environment variables might help you! Please do comment over here if you get a solution.