SQL Server - running daily jobs with r scipts - r

I was wondering if anybody had any ideas on how to set up a SQL Server Job to run r scripts? Here is what I have so far in terms of the SQL code: I have to extract data out of a database (ETL) and want that data to be aggregated/analyzed by R by a specified date. After that, the database would run automatically. Does anybody have any ideas where the SQL ETL Code (from the database) will go and where the R script procedure would go that would eventually run automatically? Thanks!
DATABASE -> ETL Code from database (generating own dataset) -> Using ONLY that dataset where an R script can then manipulate/transform it.
DECLARE #job_name NVARCHAR(128),
#description NVARCHAR(512),
#owner_login_name NVARCHAR(128),
#database_name NVARCHAR(128);
SET #job_name = N'Some Title';
SET #description = N'Periodically do something';
SET #owner_login_name = N'login';
SET #database_name = N'DATABASE';
-- Delete job if it already exists:
IF EXISTS(SELECT job_id FROM msdb.dbo.sysjobs WHERE (name = #job_name))
BEGIN
EXEC msdb.dbo.sp_delete_job
#job_name = #job_name;
END
EXEC msdb.dbo.sp_add_job
#job_name=#job_name,
#enabled=1,
#notify_level_eventlog=0,
#notify_level_email=2,
#notify_level_netsend=2,
#notify_level_page=2,
#delete_level=0,
#description=#description,
#category_name=N'[Uncategorized (Local)]',
#owner_login_name=#owner_login_name;
-- Add server:
EXEC msdb.dbo.sp_add_jobserver #job_name=#job_name;
-- Add step to execute SQL:
EXEC msdb.dbo.sp_add_jobstep
#job_name=#job_name,
#step_name=N'Execute SQL',
#step_id=1,
#cmdexec_success_code=0,
#on_success_action=1,
#on_fail_action=2,
#retry_attempts=0,
#retry_interval=0,
#os_run_priority=0,
#subsystem=N'TSQL',
#command=N'EXEC my_stored_procedure; -- OR ANY SQL STATEMENT',
#database_name=#database_name,
#flags=0;
-- Update job to set start step:
EXEC msdb.dbo.sp_update_job
#job_name=#job_name,
#enabled=1,
#start_step_id=1,
#notify_level_eventlog=0,
#notify_level_email=2,
#notify_level_netsend=2,
#notify_level_page=2,
#delete_level=0,
#description=#description,
#category_name=N'[Uncategorized (Local)]',
#owner_login_name=#owner_login_name,
#notify_email_operator_name=N'',
#notify_netsend_operator_name=N'',
#notify_page_operator_name=N'';
-- Schedule job:
EXEC msdb.dbo.sp_add_jobschedule
#job_name=#job_name,
#name=N'Daily',
#enabled=1,
#freq_type=4,
#freq_interval=1,
#freq_subday_type=1,
#freq_subday_interval=0,
#freq_relative_interval=0,
#freq_recurrence_factor=1,
#active_start_date=20170101, --YYYYMMDD
#active_end_date=99991231, --YYYYMMDD (this represents no end date)
#active_start_time=010000, --HHMMSS
#active_end_time=235959; --HHMMSS

Sql-server supports running R and python since sql-server 2014. Take a look on this.
Method before sql-server 2014
However if you are using version before 2014, you need to have some other way else. My suggestion is using R to kick start the sql (if you can).
Using library like taskscheduleR. You first make a schedule task of R for computing your data. And then connect to your sql-server by some library like obdc, dbi, etc. to update what you want.
ps. I didn't try this before, but I can make some test if you still have further problem about it.

Related

Using Beeline as an example (vs hive cli)?

I have a sqoop job ran via oozie coordinator. After a major upgrade we can no longer use hive cli and were told to use beeline. I'm not sure how to do this? Here is the current process:
I have a hive file: hive_ddl.hql
use schema_name;
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions=100000;
SET hive.exec.max.dynamic.partitions.pernode=100000;
SET mapreduce.map.memory.mb=16384;
SET mapreduce.map.java.opts=-Xmx16G;
SET hive.exec.compress.output=true;
SET mapreduce.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
drop table if exists 'table_name_stg' purge;
create external table if not exists 'table_name_stg'
(
col1 string,
col2 string,
...
)
row format delimited
fields terminated by '\001'
stored as textfile
location 'my/location/table_name_stg';
drop table if exists 'table_name' purge;
create table if not exists 'table_name'
stored as parquet
tblproperties('parquet.compress'='snappy') as
select * from schema.tablename_stg
drop table if exists 'table_name_stg' purge;
This is pretty straight forward, make a stage table, then use that to make the final table stuff...
it's then called in a .sh file as such:
hive cli -f $HOME/my/path/hive_ddl.hql
I'm new to most of this and not sure what beeline is, and couldn't find any examples of how to use it to accomplish the same thing my hivecli is. I'm hoping it's as simple as calling the hive_ddl.hql file differently, versus having to rewrite everything.
Any help is greatly appreciated.
Beeline is a command line shell supported in hive. In your case you can replace hive cli with a beeline command in the same .sh file. Would look roughly like the one given below.
beeline -u hiveJDBCUrl and -f test.hql
You can explore more about the beeline command options by going to the below link
https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-Beeline%E2%80%93CommandLineShell

Sqlite (within SqliteStudio): invalid command name "parray"

I am discovering writing functions in TCL for Sqlite (https://github.com/pawelsalawa/sqlitestudio/wiki/ScriptingTcl).
I wanted to play a basic exemple found in the official page of sqlite(http://sqlite.org/tclsqlite.html):
db eval {SELECT * FROM MyTable ORDER BY MyID} values {
parray values
puts ""
}
I get the following error:
Error while requesting the database « -- » : invalid command name "parray"
Help is very welcome :)
SqliteStudio does not seem to fully initialise Tcl, as you would expect it from a non-embedded installation:
Using external Tcl packages or modules is not possible, because Tcl
interpreters are not initialized with "init.tcl".
See Wiki.
Background
Standard Tcl sources init.tcl, early as part of an Tcl interpreter's initialisation. init.tcl, in turn, registers a number of Tcl procs for autoloading. parray is one of those lazily acquired procs.
Ways forward
I am not familiar with SqliteStudio. Why not stick with sqlite's standard Tcl frontend, which gives you full Tcl and comes with Tcl distributions free house? But this certainly depends on your requirements.
That said, you could attempt to force-load init.tcl in SqliteStudio's embedded Tcl, but I don't know (and can't test) whether the distribution has not pruned these scripts or whether they were effectively relocated. From the top of my head (untested):
source [file join $tcl_library init.tcl]
# ...
db eval {SELECT * FROM MyTable ORDER BY MyID} values {
parray values
puts ""
}

Issue in executing a batch file using PeopleCode in Application engine program

I want to execute a batch file using People code in Application Engine Program. But The program have an issue returning Exec code as a non zero value (Value - 1).
Below is people code snippet below.
Global File &FileLog;
Global string &LogFileName, &Servername, &commandline;
Local string &Footer;
If &Servername = "PSNT" Then
&ScriptName = "D: && D:\psoft\PT854\appserv\prcs\RNBatchFile.bat";
End-If;
&commandline = &ScriptName;
/* Need to commit work or Exec will fail */
CommitWork();
&ExitCode = Exec("cmd.exe /c " | &commandline, %Exec_Synchronous + %FilePath_Absolute);
If &ExitCode <> 0 Then
MessageBox(0, "", 0, 0, ("Batch File Call Failed! Exit code returned by script was " | &ExitCode));
End-If;
Any help how to resolve this issue.
Best bet is to do a trace of the execution.
Thoughts:
Can you log on the the process scheduler you are running this on and execute the script OK?
Is the AE being scheduled or called at run-time?
You should not need to change directory as you are using a fully qualified path to the script.
you should not need to call "cmd /c" as this will create an additional shell for you application to run within, making debuging harder, etc.
Run a trace, and drop us the output. :) HTH
What about changing the working directory to D: inside of the script instead? You are invoking two commands and I'm wondering what the shell is returning to exec. I'm assuming you wrote your script to give the appropriate return code and that isn't the problem.
I couldn't tell from the question text, but are you looking for a negative result, such as -1? I think return codes are usually positive. 0 for success, some other positive number for failure. Negative numbers may be acceptable, but am wondering if Exec doesn't like negative numbers?
Perhaps the PeopleCode ChDir function still works as an alternative to two commands in one line? I haven't tried it for a LONG time.
Another alternative that gives you significant control over the process is to use java.lang.Runtime.exec from PeopleCode: http://jjmpsj.blogspot.com/2010/02/exec-processes-while-controlling-stdin.html.

Create a stored procedure using RMySQL

Background: I am developing a rscript that pulls data from a mysql database, performs a logistic regression and then inserts the predictions back into the database. I want the entire system to be self contained in the script in case of database failure. This includes all mysql stored procedures that the script depends on to aggregate the data on the backend since these would be deleted in such a database failure.
Question: I'm having trouble creating a stored procedure from an R script. I am running the following:
mySQLDriver <- dbDriver("MySQL")
connect <- dbConnect(mySQLDriver, group = connection)
query <-
"
DROP PROCEDURE IF EXISTS Test.Tester;
DELIMITER //
CREATE PROCEDURE Test.Tester()
BEGIN
/***DO DATA AGGREGATION***/
END //
DELIMITER ;
"
sendQuery <- dbSendQuery(connect, query)
dbClearResult(dbListResults(connect)[[1]])
dbDisconnect(connect)
I however get the following error that seems to involve the DELIMITER change.
Error in .local(conn, statement, ...) :
could not run statement: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'DELIMITER //
CREATE PROCEDURE Test.Tester()
BEGIN
/***DO DATA AGGREGATION***/
EN' at line 2
What I've Done: I have spent quite a bit of time searching for the answer, but have come up with nothing. What am I missing?
Just wanted to follow up on this string of comments. Thank you for your thoughts on this issue. I have a couple Python scripts that need to have this functionality and I began researching the same topic for Python. I found this question that indicates the answer. The question states:
"The DELIMITER command is a MySQL shell client builtin, and it's recognized only by that program (and MySQL Query Browser). It's not necessary to use DELIMITER if you execute SQL statements directly through an API.
The purpose of DELIMITER is to help you avoid ambiguity about the termination of the CREATE FUNCTION statement, when the statement itself can contain semicolon characters. This is important in the shell client, where by default a semicolon terminates an SQL statement. You need to set the statement terminator to some other character in order to submit the body of a function (or trigger or procedure)."
Hence the following code will run in R:
mySQLDriver <- dbDriver("MySQL")
connect <- dbConnect(mySQLDriver, group = connection)
query <-
"
CREATE PROCEDURE Test.Tester()
BEGIN
/***DO DATA AGGREGATION***/
END
"
sendQuery <- dbSendQuery(connect, query)
dbClearResult(dbListResults(connect)[[1]])
dbDisconnect(connect)

Multiple query execution in cloudera impala

Is it possible to execute multiple queries at the same time in impala ? If yes, how does impala handle it?
I would certainly do some tests on your own, but I was not able to get multiple queries to execute:
I was using Impala connection, and reading query from a .sql file. This works for single commands.
from impala.dbapi import connect
# actual server and port changed for this post for security
conn=connect(host='impala server', port=11111,auth_mechanism="GSSAPI")
cursor = conn.cursor()
cursor.execute((open("sandbox/z_temp.sql").read()))
This is the error I received.
HiveServer2Error: AnalysisException: Syntax error in line 2:
This is what the SQL looked like in the .sql file.
Select * FROM database1.table1;
Select * FROM database1.table2;
I was able to run multiple commands with the SQL commands in separate .sql files iterating over all .sql files in a specified folder.
#Create list of file names for recon .sql files this will be sorted
#Numbers at begining of filename are important to sort so that files will be executed in correct order
file_names = glob.glob('folder/.sql')
asc_names = sorted(file_names, reverse = False)
filename = ""
for file_name in asc_names:
str_filename = str(file_name)
print(filename)
query = (open(str_filename).read())
cursor = conn.cursor()
# creates an error log dataframe to print, or write to file at end of job.
try:
# Each SQL command must be executed seperately
cursor.execute(query)
df_id= pd.DataFrame([{'test_name': str_filename[-40:], 'test_status': 'PASS'}])
df_log = df_log.append(df_id, ignore_index=True)
except:
df_id= pd.DataFrame([{'test_name': str_filename[-40:], 'test_status': 'FAIL'}])
df_log = df_log.append(df_id, ignore_index=True)
continue
Another way to do this would be to have all of the SQL statements in one .sql file separated by ; then loop thru the .sql file splitting statements out by ; running one at a time.
from impala.dbapi import connect
from impala.util import as_pandas
conn=connect(host='impalaserver', port=11111, auth_mechanism='GSSAPI')
cursor = conn.cursor()
# split SQL statements from one file seperated by ';', Note: last command will not have semicolon at end.
sql_file = open("sandbox/temp.sql").read()
sql = sql_file.split(';')
for cmd in sql:
# This gets rid of the non printing characters you may have
cmd = cmd.replace('/r','')
cmd = cmd.replace('/n','')
# This runs your SQL commands one at a time.
cursor.execute(cmd)
print(cmd)
Impala can execute multiple queries at the same time as long as it doesn't hit the memory cap.
You can issue a command like impala-shell -f <<file_name>>, where the file has multiple queries each complete query separated by a semi colon (;)
If you are a python geek, you can even try the impyla package to create multiple connections and run all your queries at once.
pip install impyla

Resources