R Automate the execution of a script on Linux - r

I would like to run an R script on a Linux server (CentOS) in an automated way. This should be done once a day (if possible several times a day). I would like to download stock prices using R (and later enter them into a database).
For example, the R script looks like this:
library(tidyquant)
library(lubridate)
data<-tq_get("AAPL", from="2021-01-01", to=today())
How should I write a job so that I can run the script automatically within a certain interval?
Can anyone help me?
Many thanks in advance!

you might would like to create a service. Depends on the CentOS version what type of service Systemd or init deamon
Full information of a timed service and the workings here.
Simple tutorial of how to create services here
This lets you create a service with the desired conditions and run your application/script.
Service example:
services are located # /etc/systemd/system/
for example open cli sudo touch /etc/systemd/system/updatestockdb.service
go into file and write your service sudo vim /etc/systemd/system/updatestockdb.service
[unit]
Description=Update stock price DB
Type=simple
[Timer]
OnCalendar=daily
AccuracySec=12h
Persistent=true
ExecStart=/opt/scripts/fetch_Stonks.sh --full --to-external
Restart=on-failure
PIDFile=/tmp/yourservice.pid
[Install]
WantedBy=multi-user.target

Related

How to use R libraries in Azure Databricks without depending on CRAN server connectivity?

We are using a few R libraries in Azure Databricks which do not come preinstalled. To install these libraries during Job Runs on Job Clusters, we use an init script to install them.
sudo R --vanilla -e 'install.packages("package_name",
repos="https://mran.microsoft.com/snapsot/YYYY-MM-DD")'
During one of our production runs, the Microsoft Server was down (could the timing be any worse?) and the job failed.
As a workaround, we now install libraries in /dbfs/folder_x and when we want to use them, we include the following block in our R code:
.libpaths('/dbfs/folder_x')
library("libraryName")
This does work for us, but what is the ideal solution to this? Since, if we want to update a library to another version, remove a library or add one, we have to go through the following steps everytime and there is a chance of forgetting this during code promotions:
install.packages("xyz")
system("cp -R /databricks/spark/R/lib/xyz /dbfs/folder_x/xyz")
It is a very simple and workable solution, but not ideal.

Airflow user issues

We have installed airflow from service account say 'ABC' using sudo root in virtual environment, but we are facing few issues.
Calling python script using bash operator. Python script uses some
environmental variables from unix account 'ABC'.While running from
airflow, environmental variables are not picked. In order to find the
user of airflow, created dummy dag with bashoperator command
'whoami', it returns the ABC user. So airflow is using the same 'ABC'
user. Then why environmental variables are not picked?
Then tried sudo -u ABC python script. Environmental variables are not picked, due to sudo usage. Did the workaround without environmental variables and it ran well in development environment without issues. But while moving to different environment, got the below error and we don't have permission to edit sudoers file. Admin policy didn't comply.
sudo: sorry, you must have a tty to run sudo
Then used 'impersonation=ABC' option in .cfg file and ran the airflow without sudo. This time, bash command fails for environmental variables and it's asking all the packages used in script in virtual environment.
My Questions:
Airflow is installed through ABC after sudoing root. Why ABC was not
treated while running the script.
Why ABC environmental variables are not picked?
Even Impersonation option is not picking the environmental
variables?
Can airflow be installed without virtual environment?
Which is the best approach to install airflow? Using separate user
and sudoing root? We are using dedicated user for running python
script.Experts kindly clarify.
It's always a good idea to use virtualenv for installing any python packages. So, you should always prefer installing airflow in a virtaulenv.
You can use systemd or supervisord and create programs for airflow webserver and scheduler. Example configuration for supervisor:
[program:airflow-webserver]
command=sh /home/airflow/scripts/start-airflow-webserver.sh
directory=/home/airflow
autostart=true
autorestart=true
startretries=3
stderr_logfile=/home/airflow/supervisor/logs/airflow-webserver.err.log
stdout_logfile=/home/airflow/supervisor/logs/airflow-webserver.log
user=airflow
environment=AIRFLOW_HOME='/home/airflow/'
[program:airflow-scheduler]
command=sh /home/airflow/scripts/start-airflow-scheduler.sh
directory=/home/airflow
autostart=true
autorestart=true
startretries=3
stderr_logfile=/home/airflow/supervisor/logs/airflow-scheduler.err.log
stdout_logfile=/home/airflow/supervisor/logs/airflow-scheduler.log
user=airflow
environment=AIRFLOW_HOME='/home/airflow/'
We got the same issue as.
sudo: sorry, you must have a tty to run sudo
The solution we got is,
su ABC python script

Accessing environmental variables on rundeck node

This is a two-part question. I am running a script using rundeck that depends on access to environmental variables system-wide on the node I'm executing the script on that I have set in /etc/environment.
First, how do I get rundeck to ingest the system environment? I can't find any option in rundeck to do this.
Second, why doesn't this happen by default? I'm under the impression that rundeck works through ssh; shouldn't the system environment be loaded every time it logs in to the node?
First, how do I get rundeck to ingest the system environment? I can't
find any option in rundeck to do this.
I succeeded to perform this by adding the following lines to:
set -a
. /etc/environment
. /etc/profile
1) put those lines into the file: /etc/rundeck/profile
2) put those lines into a script step
Remark: I'm using only script steps in my rundeck and I'm always put this lines in the first line of the script step:
#!/usr/bin/env bash
Second, why doesn't this happen by default? I'm under the impression
that rundeck works through ssh; shouldn't the system environment be
loaded every time it logs in to the node?
I think that you need to cinfigre something in the ssh_config file.
check this link: Rundeck not setting up environment variable for remote execution with different ssh port

How do I use Monit to keep an R script running?

I have an R script that I want to have running continuously on Ubuntu 10.10. I'm trying to setup Monit to ensure that it doesn't go down. As the script starts, it creates a pid file with the lines:
pid <- max(system("pgrep -x R", intern = TRUE))
write(pid, "/var/run/myscript.pid")
Then I've set up Monit with the lines:
check process myscript with pidfile /var/run/myscript.pid
start program = "/usr/bin/R --vanilla < /home/me/myscript.R > /home/me/myscript.out 2>&1"
Monit starts fine, but when I kill the R process, the R process is not started up again. I'm obviously doing something wrong. Is it in the syntax for starting the process? I noticed that the documentation says Monit first tries to stop the programme and I don't know any commands for stopping an R process.
Perhaps of relevance is that the above line for starting the program works when it is in the crontab for the root user, but not when started from my user crontab.
Any guidance greatly appreciated.
I can't comment about Monit, but there is a good article by Andrew Robinson in R News about using linux/unix tools to monitor R . In particular, screen and mail might be useful for your application.

How can I copy data over the Amazon's EC2 and run a script?

I am a novice as far as using cloud computing but I get the concept and am pretty good at following instructions. I'd like to do some simulations on my data and each step takes several minutes. Given the hierarchy in my data, it takes several hours for each set. I'd like to speed this up by running it on Amazon's EC2 cloud.
After reading this, I know how to launch an AMI, connect to it via the shell, and launch R at the command prompt.
What I'd like help on is being able to copy data (.rdata files) and a script and just source it at the R command prompt. Then, once all the results are written to new .rdata files, I'd like to copy them back to my local machine.
How do I do this?
I don't know much about R, but I do similar things with other languages. What I suggest would probably give you some ideas.
Setup a FTP server on your local machine.
Create a "startup-script" that you launch with your instance.
Let the startup script download the R files from your local machine, initialize R and do the calculations, then the upload the new files to your machine.
Start up script:
#!/bin/bash
set -e -x
apt-get update && apt-get install curl + "any packages you need"
wget ftp://yourlocalmachine:21/r_files > /mnt/data_old.R
R CMD BATCH data_old.R -> /mnt/data_new.R
/usr/bin/curl -T /mnt/data_new.r -u user:pass ftp://yourlocalmachine:21/new_r_files
Start instance with a startup script
ec2-run-instances --key KEYPAIR --user-data-file my_start_up_script ami-xxxxxx
first id use amazon S3 for storing the filesboth from your local machine and back from the instance
as stated before, you can create start up scripts, or even bundle your own customized AMI with all the needed settings and run your instances from it
so download the files from a bucket in S3, execute and process, finally upload the results back to the same/different bucket in S3
assuming the data is small (how big scripts can be) than S3 cost/usability would be very effective

Resources