parlapply on sqlQuery from RODBC - r

R Version : 2.14.1 x64
Running on Windows 7
Connecting to a database on a remote Microsoft SQL Server 2012
I have an unordered vectors of names, say:
names<-c(“A”, “B”, “A”, “C”,”C”)
each of which have an id in a table in my db. I need to convert the names to their corresponding ids.
I currently have the following code to do it.
###
names<-c(“A”, “B”, “A”, “C”,”C”)
dbConn<-odbcDriverConnect(connection=”connection string”) #successfully connects
nameToID<-function(name, dbConn){
#dbConn : active db connection formed via odbcDriverConnect
#name : a char string
sqlQuery(dbConn, paste(“select id from table where name=’”, name, “’”, sep=””))
}
sapply(names, nameToID, dbConn=dbConn)
###
Barring better ways to do this, which could involve loading the table into R then working with the problem there (which is possible), I understand why the following doesn’t work, but I cannot seem to find a solution. Attempting to use parallelization via the package ‘parallel’ :
###
names<-c(“A”, “B”, “A”, “C”,”C”)
dbConn<-odbcDriverConnect(connection=”connection string”) #successfully connects
nameToID<-function(name, dbConn){
#dbConn : active db connection formed via odbcDriverConnect
#name : a char string
sqlQuery(dbConn, paste(“select id from table where name=’”, name, “’”, sep=””))
}
mc<-detectCores()
cl<-makeCluster(mc)
clusterExport(cl, c(“sqlQuery”, “dbConn”))
parSapply(cl, names, nameToID, dbConn=dbConn) #incorrect passing of nameToID’s second argument
###
As in the comment, this is not the correct way to assign the second argument to nameToID.
I have also tried the following:
parSapply(cl, names, function(x) nameToID(x, dbConn))
in place of the previous parSapply call, but that also does not work, with the error being thrown saying “the first parameter is not an open RODBC connection”, presumably referring to the first parameter of the sqlQuery(). dbConn remains open though
The following code does work with parallization.
###
names<-c(“A”, “B”, “A”, “C”,”C”)
dbConn<-odbcDriverConnect(connection=”connection string”) #successfully connects
nameToID<-function(name){
#name : a char string
dbConn<-odbcDriverConnect(connection=”string”)
result<-sqlQuery(dbConn, paste(“select id from table where name=’”, name, “’”, sep=””))
odbcClose(dbConn)
result
}
mc<-detectCores()
cl<-makeCluster(mc)
clusterExport(cl, c(“sqlQuery”, “odbcDriverConnect”, “odbcClose”, “dbConn”, “nameToID”)) #throwing everything in
parSapply(cl, names, nameToID)
###
But the constant opening and closing of the connection ruins the gains from parallelization, and seems just a bit silly.
So the overall question would be how to pass the second parameter (the open db connection) to the function within parSapply, in much the same way as it is done in the regular apply? In general, how does one pass a second, third, nth parameter to a function within a parallel routine?
Thanks and if you need any more information let me know.
-DT

Database connection objects can't be exported or passed as function arguments because they contain socket connections. If you try, it will be serialized, sent to the workers and deserialized, but it won't work correctly since the socket connection won't be valid.
The solution is to create the database connection on each worker before calling parSapply. I often do that using clusterEvalQ:
clusterEvalQ(cl, {
library(RODBC)
dbConn <- odbcDriverConnect(connection="connection string")
NULL
})
Now the worker function can be written as:
nameToID <- function(name) {
sqlQuery(dbConn, paste("select id from table where name='", name, "'", sep=""))
}
and called with:
parSapply(cl, names, nameToID)
Also note that since RODBC is loaded on each of the workers you don't have to export functions defined in it, which I think is good programming practice.

Related

R - sql query stored as object name does not work with r dbGetquery

Need a little help with the following R code. I’ve got quite a number of data to load from a Microsoft sql database. I tried to do a few things to make the sql queries manageable.
1) Stored the query as object names with unique prefix
2) Using search to return a vector of the object names with unique prefix
3) using for loop to loop through the vector to load data <- this part didn’t work.
Library(odbc)
Library(tidyverse)
Library(stringer)
#setting up dB connection, odbc pkg
db<- DBI::dbConnect(odbc::odbc(),Driver =‘SQL Server’, Server=‘Server_name’, Database=‘Datbase name’, UID=‘User ID’, trusted_connection=‘yes’)
#defining the sql query
Sql_query1<-“select * from db1”
Sql_query2<-“select top 100 * from db2”
#the following is to store the sql query object name in a vector by searching for object names with prefix sql_
Sql_list <- ls()[str_detect(ls(),regex(“sql_”,ignore_case=TRUE))]
#This is the part where the code didn’t work
For (i in Sql_list){ i <- dbGetQuery(db, i)}
The error I’ve got is “Error: ‘Sql_query1’ nanodb.cpp:1587: 42000: [Microsoft][ODBC SQL Server Driver][SQL Server]Could not find stored procedure ‘Sql_query1’
However, if i don’t use the loop, no error occurred! It may be feasible if I’ve only got 2 -3 queries to manage... unfortunately I’ve 20 of them!
dbGetquery(db,Sql_query1)
Can anyone help? Thank you!
#Rohits solution written down:
first part from your side is fine
#setting up dB connection, odbc pkg
db<- DBI::dbConnect(odbc::odbc(),Driver =‘SQL Server’, Server=‘Server_name’, Database=‘Datbase name’, UID=‘User ID’, trusted_connection=‘yes’)
But then it would be more convenient to do something like this:
A more verbose version:
sqlqry_lst <- vector(mode = 'list', length = 2)#create a list to hold queries the in real life length = 20
names(sqlqry_lst) <- paste0('Sql_query', 1:2)#assign names to your list again jut use 1:20 here in your real life example
#put the SQL code into the list elements
sqlqry_lst['Sql_query1'] <- "select * from db1"
sqlqry_lst['Sql_query2'] <- "select top 100 * from db2"
#if you really want to use for loops
res <- vector(mode = 'list', length(sqlqry_lst))#result list
for (i in length(sqlqry_lst)) res[[i]] <- dbGetquery(db,sqlqry_lst[[i]])
Or as a two liner, a bit more R stylish and imho elegant:
sqlqry_lst <- list(Sql_query1="select * from db1", Sql_query2="select top 100 * from db2")
res <- lapply(sqlqry_lst, FUN = dbGetQuery, conn=db)
I suggest you mix and mingle the verbose eg for creating or more precisely for naming the query list and the short version for running the queries against the database as it suits u best.

R - doRedis - Overwrite getTask to control the order of execution in parallel foreach loops

Problem: I need to control the order of execution in which tasks are processed in parallel by a foreach loop. Unfortunately, this is not supported by foreach.
Solution in mind: Using doRedis to use the database to hold all tasks, that are executed in the foreach loop. To control the order I want to overwrite getTask by setGetTask to get the tasks based on pre-specified order. Though I could not find to much documentation on how to do this.
Additional Information:
There is a small paragraph on setGetTask with an example in the redis documentation.
getTask <- function ( queue , job_id , ...)
{
key <- sprintf("
redisEval("local x=redis.call('hkeys',KEYS[1])[1];
if x==nil then return nil end;
local ans=redis.call('hget',KEYS[1],x);
redis.call('hdel',KEYS[1],x);i
return ans",key)
}
setGetTask(getTask)
I though think the code in the documentation is syntactically not correct (missing imho a " and a closing bracket ")"). I thought this is not possible on CRAN, as the code for the documentation is executed on submission.
Changing the getTask function does not change anything in regard of the workers getting tasks (even if introducing obvious non-sense into the redisEval like changing it to redisEval("dddddddddd(((")
I only had access to the setGetTask function after installing the package from source (which I downloaded from the official CRAN package page of version 1.1.1 (which imho should make no difference than installing it directly from CRAN)
Data: The Dataframe of tasks to execute looks the following:
taskName;taskQueuePosition;parameter1;paramterN
taskT;1;val1;10
taskK;2;val2;8
taskP;3;val3;7
taskA;4;val4;7
I want to use 'taskQueuePosition' to control the order, tasks with lower numbers should be executed first.
Questions:
Does anybody know any sources where I can get more information on doing this with doRedis or on setGetTask?
Does anybody know how I need to change getTask to achieve the above described?
Any other smart ideas to control the order of execution in a foreach loop? Preferably so that at some point I can use doRedis as parallel back end (changing this would mean a major change in the processing due to complicated technical infrastructure reasons).
Code (for easy reproduction):
The following assumes that the redis-server is started on the local machine.
Redis DB Filling:
library(doRedis)
library(foreach)
options('redis:num'=TRUE) # needed for proper execution
REDIS_JOB_QUEUE = "jobs"
registerDoRedis(REDIS_JOB_QUEUE)
# filling up the data frame
taskDF = data.frame(taskName=c("taskT","taskK","taskP","taskA"),
taskQueuePosition=c(1,2,3,4),
parameter1=c("val1","val2","val3","val4"),
parameterN=c(10,8,7,7))
foreach(currTask=iter(taskDF, by='row'),
.verbose = T
) %dopar% {
print(paste("Executing task: ",currTask$taskName))
Sys.sleep(currTask$parameterN)
}
removeQueue(REDIS_JOB_QUEUE)
Worker:
library(doRedis)
REDIS_JOB_QUEUE = "jobs"
startLocalWorkers(n=1, queue=REDIS_JOB_QUEUE)
I could solve the problem and now can control the order of task execution.
Additional information:
1. There seems to be a typo in the documentation, that renders the getTask example not working. By considering the form of the default_getTask function from the file task.R in the package, it should look probably something like:
getTaskDefault <- function ( queue , job_id , ...)
{
key <- sprintf("%s:%s",queue, job_id)
return(redisEval("local x=redis.call('hkeys',KEYS[1])[1];
if x==nil then return nil end;
local ans=redis.call('hget',KEYS[1],x);
redis.call('set', KEYS[1] .. '.start.' .. x, x);
redis.call('hdel',KEYS[1],x);
return ans",key))
}
It seems that the letters behind first percent sign in the first line of the function got lost. This would explain the uneven number of brackets and quotes.
2) setGetTask still does not have any effect for me. When I set the getTask function though through .option while the DB is filled (like it is described in the vignette of the package) it is successfully called.
3) The information on 2) means that I do not need the getTask function, so I can use the package from CRAN.
----- Questions -----
1) The doRedis vignette describes how a custom getTask can be successfully set.
2 and 3) When the LUA script in getTask function is modified like below, the tasks are drawn from the database in the way they are submitted. This is not exactly what I was asking for, but due to time restraints and the fact I have (or better had) not the first idea about LUA script, it is imho a satisfying solution to control the order of submission by the taskQueuePosition column.
getTaskInOrder <- function ( queue , job_id , ...)
{
key <- sprintf("%s:%s",queue, job_id)
return(redisEval("
local tasks=redis.call('hkeys',KEYS[1]); -- get all tasks
local x=tasks[1]; -- get first task available task
if x==nil then -- if there are no tasks left, stop processing
return nil
end;
local xMin = 65535; -- if we have more tasks than 65535, getting the
-- task with the lowest taskID is not guaranteed to be the first one
local i = 1;
-- local iMinFound = -1;
while (x ~= nil) do -- search the array until there are no tasks left
-- print('x: ',x)
local xNum = tonumber(x);
if(xNum<xMin) then
xMin = xNum;
-- iMinFound = i;
end
i=i+1;
-- print('i is now: ',i);
x=tasks[i];
end
-- print('Minimum is task number',xMin,' found at i ', iMinFound)
x=tostring(xMin) -- convert it back to a string (maybe it would
-- be better to keep the original string somewhere,
-- in case we loose some information whilst converting to number)
-- print('x is now:',x);
-- print(KEYS[1] .. '.start.' .. x, x);
-- print('');
local ans=redis.call('hget',KEYS[1],x);
redis.call('set', KEYS[1] .. '.start.' .. x, x);
redis.call('hdel',KEYS[1],x);
return ans",key))
}
Important note: I noticed that if a task is aborted, the order is screwed up and the resubmitted task (even though the task number remains the same), will be executed after the originally submitted tasks. This is okay for me.
------ Code (for easy reproduction):------
This leads to the following code example (with 12 entries in the task data frame, instead the original 4):
Redis DB Filling:
library(doRedis)
library(foreach)
options('redis:num'=TRUE) # needed for proper execution
REDIS_JOB_QUEUE = "jobs"
getTaskInOrder <- function ( queue , job_id , ...)
{
...like above
}
registerDoRedis(REDIS_JOB_QUEUE)
# filling up the data frame already in order of tasks to be executed
# otherwise the dataframe has to be sorted by taskQueuePosition
taskDF = data.frame(taskName=c("taskA","taskB","taskC","taskD","taskE","taskF","taskG","taskH","taskI","taskJ","taskK","taskL"),
taskQueuePosition=c(1,2,3,4,5,6,7,8,9,10,11,12),
parameter1=c("val1","val2","val3","val4","val1","val2","val3","val4","val1","val2","val3","val4"),
parameterN=c(5,5,5,4,4,4,4,3,3,3,2,2))
foreach(currTask=iter(taskDF, by='row'),
.verbose = T,
.options.redis = list(getTask = getTaskInOrder
) %dopar% {
print(paste("Executing task: ",currTask$taskName))
Sys.sleep(currTask$parameterN)
}
removeQueue(REDIS_JOB_QUEUE)
Worker:
library(doRedis)
REDIS_JOB_QUEUE = "jobs"
startLocalWorkers(n=1, queue=REDIS_JOB_QUEUE)
Another note: just in case you are processing long jobs, as I do, please notice a bug in redis 1.1.1 (the current version on CRAN), which leads to tasks being resubmitted (due to a timeout) despite the workers still working on them.

Create a stored procedure using RMySQL

Background: I am developing a rscript that pulls data from a mysql database, performs a logistic regression and then inserts the predictions back into the database. I want the entire system to be self contained in the script in case of database failure. This includes all mysql stored procedures that the script depends on to aggregate the data on the backend since these would be deleted in such a database failure.
Question: I'm having trouble creating a stored procedure from an R script. I am running the following:
mySQLDriver <- dbDriver("MySQL")
connect <- dbConnect(mySQLDriver, group = connection)
query <-
"
DROP PROCEDURE IF EXISTS Test.Tester;
DELIMITER //
CREATE PROCEDURE Test.Tester()
BEGIN
/***DO DATA AGGREGATION***/
END //
DELIMITER ;
"
sendQuery <- dbSendQuery(connect, query)
dbClearResult(dbListResults(connect)[[1]])
dbDisconnect(connect)
I however get the following error that seems to involve the DELIMITER change.
Error in .local(conn, statement, ...) :
could not run statement: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'DELIMITER //
CREATE PROCEDURE Test.Tester()
BEGIN
/***DO DATA AGGREGATION***/
EN' at line 2
What I've Done: I have spent quite a bit of time searching for the answer, but have come up with nothing. What am I missing?
Just wanted to follow up on this string of comments. Thank you for your thoughts on this issue. I have a couple Python scripts that need to have this functionality and I began researching the same topic for Python. I found this question that indicates the answer. The question states:
"The DELIMITER command is a MySQL shell client builtin, and it's recognized only by that program (and MySQL Query Browser). It's not necessary to use DELIMITER if you execute SQL statements directly through an API.
The purpose of DELIMITER is to help you avoid ambiguity about the termination of the CREATE FUNCTION statement, when the statement itself can contain semicolon characters. This is important in the shell client, where by default a semicolon terminates an SQL statement. You need to set the statement terminator to some other character in order to submit the body of a function (or trigger or procedure)."
Hence the following code will run in R:
mySQLDriver <- dbDriver("MySQL")
connect <- dbConnect(mySQLDriver, group = connection)
query <-
"
CREATE PROCEDURE Test.Tester()
BEGIN
/***DO DATA AGGREGATION***/
END
"
sendQuery <- dbSendQuery(connect, query)
dbClearResult(dbListResults(connect)[[1]])
dbDisconnect(connect)

Show all open RODBC connections

Does anyone know how to do this? showConnections won't list any open connections from odbcConnect.
You can narrow down your search in the following way, which will return all variables of your environment currently of the class RODBC.
envVariables<-ls()
bools<-sapply(envVariables, function(string){
class(get(string))=="RODBC"
})
rodbcObj<-envVariables[bools]
Closed connections are still of the class RODBC though, so there is still a little work to be done here.
We can define a function, using trycatch, that will try to get the connection info of the associated RODBC object. If it is an open connection, then this command will run fine, and we return the string of the variable name.
If the RODBC object is not an open connection, this will throw an error, which we catch and, in the way I've implemented, return NA. You could return any number of things here.
openConns<-function(string){
tryCatch({
result<-odbcGetInfo(get(string))
string
}, error = function(e){
NA
})
}
We then remove the return value that corresponds to the error. In my case, NA, so I do na.omit on the return.
na.omit(sapply(rodbcObj, openConns))
Or alternatively
result<-sapply(rodbcObj, openConns)
result[!is.na(result)]
Any questions or comments on it let me know
-DMT

SQL query error with ODBC connection in R using Informix driver

With functionality from the RODBC package, I have successfully created an ODBC but receive error messages when I try to query the database. I am using the INFORMIX 3.31 32 bit driver (version 3.31.00.10287).
channel <- odbcConnect("exampleDSN")
unclass(channel)
[1] 3
attr(,"connection.string")
[1] "DSN=exampleDSN;UID=user;PWD=****;DB=exampleDB;HOST=exampleHOST;SRVR=exampleSRVR;SERV=exampleSERV;PRO=onsoctcp ... (more parameters)"
attr(,"handle_ptr")
<pointer: 0x0264c098>
attr(,"case")
[1] "nochange"
attr(,"id")
[1] 4182
attr(,"believeNRows")
[1] TRUE
attr(,"colQuote")
[1] "\""
attr(,"tabQuote")
[1] "\""
attr(,"interpretDot")
[1] TRUE
attr(,"encoding")
[1] ""
attr(,"rows_at_time")
[1] 100
attr(,"isMySQL")
[1] FALSE
attr(,"call")
odbcDriverConnect(connection = "DSN=exampleDSN")
When I try to query and investigate the structure of the returned object, I receive an error message 'chr [1:2] "42000 -201 [Informix][Informix ODBC Driver][Informix]A syntax error has occurred." ...'
Specifically, I wrote an expression to loop through all tables in the database, retrieve 10 rows, and investigate the structure of the returned object.
for (i in 1:153){res <- sqlFetch(channel, sqlTables(channel, tableType="TABLE")$TABLE_NAME[i], max=10); str(res)}
Each iteration returns the same error message. Any ideas where to start?
ADDITIONAL INFO: When I return the object 'res', I receive the following -
> res
[1] "42000 -201 [Informix][Informix ODBC Driver][Informix]A syntax error has occurred."
[2] "[RODBC] ERROR: Could not SQLExecDirect 'SELECT * FROM \"exampleTABLE\"'"
The error message you quote is:
"[RODBC] ERROR: Could not SQLExecDirect 'SELECT * FROM \"exampleTABLE\"'"
Informix only recognizes table names enclosed in double quotes if the environment DELIMIDENT is set in the environment, either of the server or the client (or both). It doesn't much matter what it is set to; I use DELIMIDENT=1 when I want delimited identifiers.
How did you create the table in the Informix database? Unless you created the table with DELIMIDENT set, the table name will not be case sensitive; you do not need the quotes around the table name.
The fact that you're getting error -201 means you've got through the connection process; that is a good start, and simplifies what follows.
I'm not sure whether you're on a Unix machine or a Windows machine - it often helps to indicate that. On Windows, you might have to set the environment with SETNET32 (an Informix program), or there may be a way to specify the DELIMIDENT in the connect string. On Unix, you probably set it in your environment and the R software picks it up. However, there might be problems if you launch R via some sort of menu button or option in a GUI environment; the chances are that the profile is not executed before the R program is.
You can try using the sqlQuery() function in RODBC to retrieve your results. This is the function I use at work and have never had a problem with it:
sqlQuery(channel, "select top 10 * from exampleTABLE")
You should be able to put all of your queries into a list and iterate through them as you were before:
dat <- lapply(queries, function(x) sqlQuery(channel, x))
where queries is your list of queries and channel is your open ODBC connection. I guess I should also encourage you to close said connection when your done with odbcCloseAll()

Resources