Create a stored procedure using RMySQL - r

Background: I am developing a rscript that pulls data from a mysql database, performs a logistic regression and then inserts the predictions back into the database. I want the entire system to be self contained in the script in case of database failure. This includes all mysql stored procedures that the script depends on to aggregate the data on the backend since these would be deleted in such a database failure.
Question: I'm having trouble creating a stored procedure from an R script. I am running the following:
mySQLDriver <- dbDriver("MySQL")
connect <- dbConnect(mySQLDriver, group = connection)
query <-
"
DROP PROCEDURE IF EXISTS Test.Tester;
DELIMITER //
CREATE PROCEDURE Test.Tester()
BEGIN
/***DO DATA AGGREGATION***/
END //
DELIMITER ;
"
sendQuery <- dbSendQuery(connect, query)
dbClearResult(dbListResults(connect)[[1]])
dbDisconnect(connect)
I however get the following error that seems to involve the DELIMITER change.
Error in .local(conn, statement, ...) :
could not run statement: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'DELIMITER //
CREATE PROCEDURE Test.Tester()
BEGIN
/***DO DATA AGGREGATION***/
EN' at line 2
What I've Done: I have spent quite a bit of time searching for the answer, but have come up with nothing. What am I missing?

Just wanted to follow up on this string of comments. Thank you for your thoughts on this issue. I have a couple Python scripts that need to have this functionality and I began researching the same topic for Python. I found this question that indicates the answer. The question states:
"The DELIMITER command is a MySQL shell client builtin, and it's recognized only by that program (and MySQL Query Browser). It's not necessary to use DELIMITER if you execute SQL statements directly through an API.
The purpose of DELIMITER is to help you avoid ambiguity about the termination of the CREATE FUNCTION statement, when the statement itself can contain semicolon characters. This is important in the shell client, where by default a semicolon terminates an SQL statement. You need to set the statement terminator to some other character in order to submit the body of a function (or trigger or procedure)."
Hence the following code will run in R:
mySQLDriver <- dbDriver("MySQL")
connect <- dbConnect(mySQLDriver, group = connection)
query <-
"
CREATE PROCEDURE Test.Tester()
BEGIN
/***DO DATA AGGREGATION***/
END
"
sendQuery <- dbSendQuery(connect, query)
dbClearResult(dbListResults(connect)[[1]])
dbDisconnect(connect)

Related

What is causing the syntax error in my sql stored procedure?

I'm pretty new at stored procedures and I get the following syntax error in my code:
You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'DELIMITER //CREATE PROCEDURE sp_create_probe(IN matrix_id INT, IN oligo_id IN...' at line 2
Here's the code for my procedure:
DELIMITER //
CREATE PROCEDURE sp_create_probe(IN matrix_id INT, IN oligo_id INT)
BEGIN
SET FOREIGN_KEY_CHECKS=0;
INSERT INTO Probes (oligo, microarray) VALUES (oligo_id, matrix_id);
SET FOREIGN_KEY_CHECKS=1;
END //
DELIMITER ;
Hope someone can help me out
EDIT:
Well I ended up fixing it. Classic noob mistake.
I had this line before the code you see above:
DROP PROCEDURE IF EXISTS sp_create_probe
Didn't copy that line into the post for some reason but it's missing the ";" at the end. That fixed it.

SQL Server - running daily jobs with r scipts

I was wondering if anybody had any ideas on how to set up a SQL Server Job to run r scripts? Here is what I have so far in terms of the SQL code: I have to extract data out of a database (ETL) and want that data to be aggregated/analyzed by R by a specified date. After that, the database would run automatically. Does anybody have any ideas where the SQL ETL Code (from the database) will go and where the R script procedure would go that would eventually run automatically? Thanks!
DATABASE -> ETL Code from database (generating own dataset) -> Using ONLY that dataset where an R script can then manipulate/transform it.
DECLARE #job_name NVARCHAR(128),
#description NVARCHAR(512),
#owner_login_name NVARCHAR(128),
#database_name NVARCHAR(128);
SET #job_name = N'Some Title';
SET #description = N'Periodically do something';
SET #owner_login_name = N'login';
SET #database_name = N'DATABASE';
-- Delete job if it already exists:
IF EXISTS(SELECT job_id FROM msdb.dbo.sysjobs WHERE (name = #job_name))
BEGIN
EXEC msdb.dbo.sp_delete_job
#job_name = #job_name;
END
EXEC msdb.dbo.sp_add_job
#job_name=#job_name,
#enabled=1,
#notify_level_eventlog=0,
#notify_level_email=2,
#notify_level_netsend=2,
#notify_level_page=2,
#delete_level=0,
#description=#description,
#category_name=N'[Uncategorized (Local)]',
#owner_login_name=#owner_login_name;
-- Add server:
EXEC msdb.dbo.sp_add_jobserver #job_name=#job_name;
-- Add step to execute SQL:
EXEC msdb.dbo.sp_add_jobstep
#job_name=#job_name,
#step_name=N'Execute SQL',
#step_id=1,
#cmdexec_success_code=0,
#on_success_action=1,
#on_fail_action=2,
#retry_attempts=0,
#retry_interval=0,
#os_run_priority=0,
#subsystem=N'TSQL',
#command=N'EXEC my_stored_procedure; -- OR ANY SQL STATEMENT',
#database_name=#database_name,
#flags=0;
-- Update job to set start step:
EXEC msdb.dbo.sp_update_job
#job_name=#job_name,
#enabled=1,
#start_step_id=1,
#notify_level_eventlog=0,
#notify_level_email=2,
#notify_level_netsend=2,
#notify_level_page=2,
#delete_level=0,
#description=#description,
#category_name=N'[Uncategorized (Local)]',
#owner_login_name=#owner_login_name,
#notify_email_operator_name=N'',
#notify_netsend_operator_name=N'',
#notify_page_operator_name=N'';
-- Schedule job:
EXEC msdb.dbo.sp_add_jobschedule
#job_name=#job_name,
#name=N'Daily',
#enabled=1,
#freq_type=4,
#freq_interval=1,
#freq_subday_type=1,
#freq_subday_interval=0,
#freq_relative_interval=0,
#freq_recurrence_factor=1,
#active_start_date=20170101, --YYYYMMDD
#active_end_date=99991231, --YYYYMMDD (this represents no end date)
#active_start_time=010000, --HHMMSS
#active_end_time=235959; --HHMMSS
Sql-server supports running R and python since sql-server 2014. Take a look on this.
Method before sql-server 2014
However if you are using version before 2014, you need to have some other way else. My suggestion is using R to kick start the sql (if you can).
Using library like taskscheduleR. You first make a schedule task of R for computing your data. And then connect to your sql-server by some library like obdc, dbi, etc. to update what you want.
ps. I didn't try this before, but I can make some test if you still have further problem about it.

Sqlite (within SqliteStudio): invalid command name "parray"

I am discovering writing functions in TCL for Sqlite (https://github.com/pawelsalawa/sqlitestudio/wiki/ScriptingTcl).
I wanted to play a basic exemple found in the official page of sqlite(http://sqlite.org/tclsqlite.html):
db eval {SELECT * FROM MyTable ORDER BY MyID} values {
parray values
puts ""
}
I get the following error:
Error while requesting the database « -- » : invalid command name "parray"
Help is very welcome :)
SqliteStudio does not seem to fully initialise Tcl, as you would expect it from a non-embedded installation:
Using external Tcl packages or modules is not possible, because Tcl
interpreters are not initialized with "init.tcl".
See Wiki.
Background
Standard Tcl sources init.tcl, early as part of an Tcl interpreter's initialisation. init.tcl, in turn, registers a number of Tcl procs for autoloading. parray is one of those lazily acquired procs.
Ways forward
I am not familiar with SqliteStudio. Why not stick with sqlite's standard Tcl frontend, which gives you full Tcl and comes with Tcl distributions free house? But this certainly depends on your requirements.
That said, you could attempt to force-load init.tcl in SqliteStudio's embedded Tcl, but I don't know (and can't test) whether the distribution has not pruned these scripts or whether they were effectively relocated. From the top of my head (untested):
source [file join $tcl_library init.tcl]
# ...
db eval {SELECT * FROM MyTable ORDER BY MyID} values {
parray values
puts ""
}

parlapply on sqlQuery from RODBC

R Version : 2.14.1 x64
Running on Windows 7
Connecting to a database on a remote Microsoft SQL Server 2012
I have an unordered vectors of names, say:
names<-c(“A”, “B”, “A”, “C”,”C”)
each of which have an id in a table in my db. I need to convert the names to their corresponding ids.
I currently have the following code to do it.
###
names<-c(“A”, “B”, “A”, “C”,”C”)
dbConn<-odbcDriverConnect(connection=”connection string”) #successfully connects
nameToID<-function(name, dbConn){
#dbConn : active db connection formed via odbcDriverConnect
#name : a char string
sqlQuery(dbConn, paste(“select id from table where name=’”, name, “’”, sep=””))
}
sapply(names, nameToID, dbConn=dbConn)
###
Barring better ways to do this, which could involve loading the table into R then working with the problem there (which is possible), I understand why the following doesn’t work, but I cannot seem to find a solution. Attempting to use parallelization via the package ‘parallel’ :
###
names<-c(“A”, “B”, “A”, “C”,”C”)
dbConn<-odbcDriverConnect(connection=”connection string”) #successfully connects
nameToID<-function(name, dbConn){
#dbConn : active db connection formed via odbcDriverConnect
#name : a char string
sqlQuery(dbConn, paste(“select id from table where name=’”, name, “’”, sep=””))
}
mc<-detectCores()
cl<-makeCluster(mc)
clusterExport(cl, c(“sqlQuery”, “dbConn”))
parSapply(cl, names, nameToID, dbConn=dbConn) #incorrect passing of nameToID’s second argument
###
As in the comment, this is not the correct way to assign the second argument to nameToID.
I have also tried the following:
parSapply(cl, names, function(x) nameToID(x, dbConn))
in place of the previous parSapply call, but that also does not work, with the error being thrown saying “the first parameter is not an open RODBC connection”, presumably referring to the first parameter of the sqlQuery(). dbConn remains open though
The following code does work with parallization.
###
names<-c(“A”, “B”, “A”, “C”,”C”)
dbConn<-odbcDriverConnect(connection=”connection string”) #successfully connects
nameToID<-function(name){
#name : a char string
dbConn<-odbcDriverConnect(connection=”string”)
result<-sqlQuery(dbConn, paste(“select id from table where name=’”, name, “’”, sep=””))
odbcClose(dbConn)
result
}
mc<-detectCores()
cl<-makeCluster(mc)
clusterExport(cl, c(“sqlQuery”, “odbcDriverConnect”, “odbcClose”, “dbConn”, “nameToID”)) #throwing everything in
parSapply(cl, names, nameToID)
###
But the constant opening and closing of the connection ruins the gains from parallelization, and seems just a bit silly.
So the overall question would be how to pass the second parameter (the open db connection) to the function within parSapply, in much the same way as it is done in the regular apply? In general, how does one pass a second, third, nth parameter to a function within a parallel routine?
Thanks and if you need any more information let me know.
-DT
Database connection objects can't be exported or passed as function arguments because they contain socket connections. If you try, it will be serialized, sent to the workers and deserialized, but it won't work correctly since the socket connection won't be valid.
The solution is to create the database connection on each worker before calling parSapply. I often do that using clusterEvalQ:
clusterEvalQ(cl, {
library(RODBC)
dbConn <- odbcDriverConnect(connection="connection string")
NULL
})
Now the worker function can be written as:
nameToID <- function(name) {
sqlQuery(dbConn, paste("select id from table where name='", name, "'", sep=""))
}
and called with:
parSapply(cl, names, nameToID)
Also note that since RODBC is loaded on each of the workers you don't have to export functions defined in it, which I think is good programming practice.

Multiple query execution in cloudera impala

Is it possible to execute multiple queries at the same time in impala ? If yes, how does impala handle it?
I would certainly do some tests on your own, but I was not able to get multiple queries to execute:
I was using Impala connection, and reading query from a .sql file. This works for single commands.
from impala.dbapi import connect
# actual server and port changed for this post for security
conn=connect(host='impala server', port=11111,auth_mechanism="GSSAPI")
cursor = conn.cursor()
cursor.execute((open("sandbox/z_temp.sql").read()))
This is the error I received.
HiveServer2Error: AnalysisException: Syntax error in line 2:
This is what the SQL looked like in the .sql file.
Select * FROM database1.table1;
Select * FROM database1.table2;
I was able to run multiple commands with the SQL commands in separate .sql files iterating over all .sql files in a specified folder.
#Create list of file names for recon .sql files this will be sorted
#Numbers at begining of filename are important to sort so that files will be executed in correct order
file_names = glob.glob('folder/.sql')
asc_names = sorted(file_names, reverse = False)
filename = ""
for file_name in asc_names:
str_filename = str(file_name)
print(filename)
query = (open(str_filename).read())
cursor = conn.cursor()
# creates an error log dataframe to print, or write to file at end of job.
try:
# Each SQL command must be executed seperately
cursor.execute(query)
df_id= pd.DataFrame([{'test_name': str_filename[-40:], 'test_status': 'PASS'}])
df_log = df_log.append(df_id, ignore_index=True)
except:
df_id= pd.DataFrame([{'test_name': str_filename[-40:], 'test_status': 'FAIL'}])
df_log = df_log.append(df_id, ignore_index=True)
continue
Another way to do this would be to have all of the SQL statements in one .sql file separated by ; then loop thru the .sql file splitting statements out by ; running one at a time.
from impala.dbapi import connect
from impala.util import as_pandas
conn=connect(host='impalaserver', port=11111, auth_mechanism='GSSAPI')
cursor = conn.cursor()
# split SQL statements from one file seperated by ';', Note: last command will not have semicolon at end.
sql_file = open("sandbox/temp.sql").read()
sql = sql_file.split(';')
for cmd in sql:
# This gets rid of the non printing characters you may have
cmd = cmd.replace('/r','')
cmd = cmd.replace('/n','')
# This runs your SQL commands one at a time.
cursor.execute(cmd)
print(cmd)
Impala can execute multiple queries at the same time as long as it doesn't hit the memory cap.
You can issue a command like impala-shell -f <<file_name>>, where the file has multiple queries each complete query separated by a semi colon (;)
If you are a python geek, you can even try the impyla package to create multiple connections and run all your queries at once.
pip install impyla

Resources