How to run cronjob with computer shut off (EC2 instance) - r

I outlined my small project in a different post - to summarize it again quickly, I am trying to do the following:
Write an R script that pulls data from a website
Schedule the R script to automatically run daily at the same time
Write / append the R script's output to a database
I am familiar with R web-scraping packages (rvest, rselenium) for doing the first bullet. For the 2nd bullet, just today I learned how to create a crontab to run my script when I desire, however the crontab does not run the script when my computer is off, or so I've read.
How can I have it such that the crontab is run even with my computer off? I am somewhat (not really) familiar with EC2 instances, but if I have my R script in an EC2 instance, could I schedule a crontab for the script there and then it would run with my computer off?
Thanks in advance for help!

Since cron is a service that runs on the instance you can't have it start the EC2 instance for you - it's a catch-22.
You can treat EC2 instances as computers that run in someone else's cellar (most of the time at least). You wouldn't expect a computer to run code when it's not turned on and it's exactly the same for an EC2 instance.
I suggest you consider if this is really the setup you want, it sounds to me that you'd be better served using AWS Lambda combined with one of Amazon's hosted data stores (RDS, DynamoDB, SimpleDB, or even S3). The downside here is that you're limited to JavaScript, Python, and Java and as such can't use R (well, you can, but it's messy since you'll have to package everything you need in a JS/Python/Java app and start it from there).
If you really want to run your R script on the EC2 instance you can start the instance with a lambda and then shut it down from your script. Just make sure your instance isn't set to terminate on shutdown.
Regardless of what path you chose you will need to create a lambda and run it from a scheduled CloudWatch Event.
Then you just need to implement the lambda, either to run your script or to use the EC2 API to start the instance.
If you use the lambda to start the EC2 instance you should not use cron on the instance to run the script at a specific time, but run it on startup. Then you have your script shut down the instance when it's finished.
Here's an example Python script for starting an EC2 instance from a lambda to get your started:
import logging
import boto3
# Set up logging
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
# Set up a boto session to get credentials and region
session = boto3.session.Session()
# Set up EC2
ec2 = session.resource("ec2")
# The instance to start
instance_id = "i-1234567890abcd"
def lambda_handler(event, context):
logger.info('Start handling event.')
logger.info('Starting instance ' + instance_id)
instance = ec2.Instance(instance_id)
response = instance.start()
try:
current_state = response['StartingInstances'][0]['CurrentState']
except (KeyError, IndexError) as e:
logger.warn('Unexpected response when starting instance: {}'.format(response))
else:
if current_state not in ('pending', 'running'):
logger.warn('Instance {} is in unexpected state {} after starting'.format(id, current_state))
else:
logger.info('Started instance ' + instance_id)

Related

Is there an alternative to "dag_dir_list_interval" in Airflow to upload dags from storage to scheduler?

hope you are all doing well.
I am using an airflow instance deployed on Kubernetes using Helm Chart.
I setup my dag folder inside a rook nfs storage.
I need these dags to be processed instantly by the airflow scheduler.
Airflow provide an environment variable, namely "dag_dir_list_interval". In my configuration I set this variable to 1 which means that the scheduler will check every seconds if there is a new dag file inside the dag folder.
It works but as you can imagine it is very not efficiency as it costs a lot in terms of CPU Usage.
I wanted to know if there were any alternative to this environment variable, for example, let's say a call API that allows me to tell to the scheduler "hey there is a new dag to be processed" without checking every seconds for new file inside the nfs storage.
Thank you for your suggestions.

Do airflow workers share the same file system ? or are they isolated

I have a task in airflow which downloads a file from GitHub to the local file system. passes it to spark-submit and then deletes it. I wanted to know if this will create any issues.
Can this be possible that both the workers that are running the same task concurrently on two different dag runs are referencing the same file?
Sample code -->
def python_task_callback():
download_file(file_name='script.py')
spark_submit(path='/temp/script.py')
delete_file(path='/temp/script.py')
For your use case if you do all of the actions you mentioned (download, parse, delete) in a single task then you will have no problems regardless of which executor you are running.
If you are splitting the actions between several tasks then you should use a shared file system like S3, Google Storage etc. In that case it will also work regardless of which executor youa re using.
A possible workflow can be:
1st task: copy file from github to S3
2nd task: submit the file to processing
3rd task: delete the file from S3
As for your general question if tasks share disk - that depends on the executor that you are using.
In Local Executor you have only 1 worker thus all tasks run on the same machine and share it's disk.
In Celery Executor/ Kubernetes Executor/others tasks may run on different workers.
However as mentioned - don't assume that tasks share disk, if you will need to scale up the executor from Local to Celery you don't want to find yourself in a case where you need to refactor your code.

How to prevent "Execution failed:[Errno 32] Broken pipe" in Airflow

I just started using Airflow to coordinate our ETL pipeline.
I encountered the pipe error when I run a dag.
I've seen a general stackoverflow discussion here.
My case is more on the Airflow side. According to the discussion in that post, the possible root cause is:
The broken pipe error usually occurs if your request is blocked or
takes too long and after request-side timeout, it'll close the
connection and then, when the respond-side (server) tries to write to
the socket, it will throw a pipe broken error.
This might be the real cause in my case, I have a pythonoperator that will start another job outside of Airflow, and that job could be very lengthy (i.e. 10+ hours), I wonder if what is the mechanism in place in Airflow that I can leverage to prevent this error.
Can anyone help?
UPDATE1 20190303-1:
Thanks to #y2k-shubham for the SSHOperator, I am able to use it to set up a SSH connection successfully and am able to run some simple commands on the remote site (indeed the default ssh connection has to be set to localhost because the job is on the localhost) and am able to see the correct result of hostname, pwd.
However, when I attempted to run the actual job, I received same error, again, the error is from the jpipeline ob instead of the Airflow dag/task.
UPDATE2: 20190303-2
I had a successful run (airflow test) with no error, and then followed another failed run (scheduler) with same error from pipeline.
While I'd suggest you keep looking for a more graceful way of trying to achieve what you want, I'm putting up example usage as requested
First you've got to create an SSHHook. This can be done in two ways
The conventional way where you supply all requisite settings like host, user, password (if needed) etc from the client code where you are instantiating the hook. Im hereby citing an example from test_ssh_hook.py, but you must thoroughly go through SSHHook as well as its tests to understand all possible usages
ssh_hook = SSHHook(remote_host="remote_host",
port="port",
username="username",
timeout=10,
key_file="fake.file")
The Airflow way where you put all connection details inside a Connection object that can be managed from UI and only pass it's conn_id to instantiate your hook
ssh_hook = SSHHook(ssh_conn_id="my_ssh_conn_id")
Of course, if your'e relying on SSHOperator, then you can directly pass the ssh_conn_id to operator.
ssh_operator = SSHOperator(ssh_conn_id="my_ssh_conn_id")
Now if your'e planning to have a dedicated task for running a command over SSH, you can use SSHOperator. Again I'm citing an example from test_ssh_operator.py, but go through the sources for a better picture.
task = SSHOperator(task_id="test",
command="echo -n airflow",
dag=self.dag,
timeout=10,
ssh_conn_id="ssh_default")
But then you might want to run a command over SSH as a part of your bigger task. In that case, you don't want an SSHOperator, you can still use just the SSHHook. The get_conn() method of SSHHook provides you an instance of paramiko SSHClient. With this you can run a command using exec_command() call
my_command = "echo airflow"
stdin, stdout, stderr = ssh_client.exec_command(
command=my_command,
get_pty=my_command.startswith("sudo"),
timeout=10)
If you look at SSHOperator's execute() method, it is a rather complicated (but robust) piece of code trying to achieve a very simple thing. For my own usage, I had created some snippets that you might want to look at
For using SSHHook independently of SSHOperator, have a look at ssh_utils.py
For an operator that runs multiple commands over SSH (you can achieve the same thing by using bash's && operator), see MultiCmdSSHOperator

Terminate and restart Google DataLab instance?

I am finding when working with larger datasets that the kernel may die, something I also experiance on my local machine. Sometimes it comes back and sometimes not. So even the Tree panel won't react to terminate a errant Kernel. EG "restart" does not work and the server itself seems to die. So the tree view won't respond or refresh. On my local machine I just kill the terminal instance and start over.
What is the "proper" way to restart everything?
FWIW the instance seems pegged at 150% cpu utilization atm
Related: is there any way to allow long running stuff to work?
I am trying to use a report generator (pandas-profiling) on a 2mm record dataset.. Works on my local..
found it here: https://cloud.google.com/datalab/getting-started
FWIW These commands can be used in the new command line shell on the Cloud console page.see https://cloud.google.com/shell/docs/ .. Without the sdk on your machine.. You need to modify the commands slightly since you will be logged into your project already,
Stopping/starting VM instances
You may want to stop a Cloud Datalab managed VM instance to avoid incurring ongoing charges. To stop a Cloud Datalab managed machine instance, go to a command prompt, and run:
$ gcloud auth login
$ gcloud config set project <YOUR PROJECT ID>
$ gcloud preview app versions stop main
After confirming that you want to continue, wait for the command to complete, and make sure that the output indicates that the version has stopped. If you used a non-default instance name when deploying, please use that name instead of "main" in the stop command, above (and in the start command, below).
For restarting a stopped instance, run:
$ gcloud auth login
$ gcloud config set project <YOUR PROJECT ID>
$ gcloud preview app versions start main

have R halt the EC2 machine it's running on

I have a few work flows where I would like R to halt the Linux machine it's running on after completion of a script. I can think of two similar ways to do this:
run R as root and then call system("halt")
run R from a root shell script (could run the R script as any user) then have the shell script run halt after the R bit completes.
Are there other easy ways of doing this?
The use case here is for scripts running on AWS where I would like the instance to stop after script completion so that I don't get charged for machine time post job run. My instance I use for data analysis is an EBS backed instance so I don't want to terminate it, simply suspend. Issuing a halt command from inside the instance is the same effect as a stop/suspend from AWS console.
I'm impressed that works. (For anyone else surprised that an instance can stop itself, see notes 1 & 2.)
You can also try "sudo halt", as you wouldn't need to run as a root user, as long as the user account running R is capable of running sudo. This is pretty common on a lot of AMIs on EC2.
Be careful about what constitutes an assumption of R quitting - believe it or not, one can crash R. It may be better to have a separate script that watches the R pid and, once that PID is no longer active, terminates the instance. Doing this command inside of R means that if R crashes, it never reaches the call to halt. If you call it from within another script, that can be dangerous, too. If you know Linux well, what you're looking for is the PID from starting R, which you can pass to another script that checks ps, say every 1 second, and then terminates the instance once the PID is no longer running.
I think a better solution is to use the EC2 API tools (see: http://docs.amazonwebservices.com/AWSEC2/latest/APIReference/ for documentation) to terminate OR stop instances. There's a difference between the two of these, and it matters if your instance is EBS backed or S3 backed. You needn't run as root in order to terminate the instance - the fact that you have the private key and certificate shows Amazon that you're the BOSS, way above the hoi polloi who merely have root access on your instance.
Because these credentials can be used for mischief, be careful about running API tools from a given server, you'll need your certificate and private key on the server. That's a bad idea in the event that you have a security problem. It would be better to message to a master server and have it shut down the instance. If you have messaging set up in any way between instances, this can do all the work for you.
Note 1: Eric Hammond reports that the halt will only suspend an EBS instance, so you still have storage fees. If you happen to start a lot of such instances, this can clutter things up. Your original question seems unclear about whether you mean to terminate or stop an instance. He has other good advice on this page
Note 2: A short thread on the EC2 developers forum gives advice for Linux & Windows users.
Note 3: EBS instances are billed for partial hours, even when restarted. (See this thread from the developer forum.) Having an auto-suspend close to the hour mark can be useful, assuming the R process isn't working, in case one might re-task that instance (i.e. to save on not restarting). Other useful tools to consider: setTimeLimit and setSessionTimeLimit, and various checkpointing tools (I have a Q that mentions a couple). Using an auto-kill is useful if one has potentially badly behaved code.
Note 4: I recently learned of the shutdown command in package fun. This is multi-platform. See this blog post for commentary, and code is here. Dangerous stuff, but it could be useful if you want to adapt to Windows. I haven't tried it, though.
Update 1. Three more ideas:
You could use .Last() and runLast = TRUE for q() and quit(), which could shut down the instance.
If using littler or a script that invokes the script via Rscript, the same command line functions could be used.
My favorite package of today, tcltk2 has a neat timer mechanism, called tclTaskSchedule() that can be used to schedule the execution of an expression. You could then go crazy with the execution of stuff just before a hourly interval has elapsed.
system("echo 'rootpassword' | sudo halt")
However, the downside is having your root password in plain text in the script.
AFAIK those ways you mentioned are the only ones. In any case the script will have to run as root to be able to shut down the machine (if you find a way to do it without root that's possibly an exploit). You ask for an easier way but system("halt") is just an additional line at the end of your script.
sudo is an option -- it allows you to run certain commands without prompting for any password. Just put something like this in /etc/sudoers
<username> ALL=(ALL) PASSWD: ALL, NOPASSWD: /sbin/halt
(of course replacing with the name of user running R) and system('sudo halt') should just work.

Resources