Heartbeat print on running R script via Oozie - r

I was trying to run a R script on Oozie.
Though the R script was triggered, the Oozie job kept printing Heartbeat in the stdout file, without proceeding further into the R-code.
What may have caused it? Also, how to avoid the same?
Edit :
The script was supposed to read data from HDFS and a series of R scripts were supposed to be operating on the data to generate a json output.
The script was triggered by the oozie workflow from command line.
This oozie job is further planned to be exposed as an API.
I have tried changing the scheduler to fair scheduler as mentioned in this post, but still it didnt work.

Related

R: Learning How to Use Task Scheduler

I am working with the R programming language.
I am trying to learn how to schedule tasks using the Task Scheduler. I found this post Scheduling R Script and I am trying to follow the instructions from the answer provided by "user:: petermeissner":
Step 1:
I opened a notepad document and wrote a small R program that I want the task scheduler to run. In this program, I want to generate 100 random numbers, save these random numbers as an RDS file - and then repeat/overwrite many times:
# r program to run
a = rnorm(100,100,100)
saveRDS(a, "a.RDS")
Then, I saved this file as "myscript.exe"
Step 2:
I went to the "start menu" and typed in "Task Scheduler". I then clicked "Action" and "Create Task". Here, I created a new task uploaded the ".exe" file:
Step 3: Now, I added the details about this task - for example, I want this task to run every 5 minutes. I went to the "Triggers" tab and entered the following information:
My Problem: I waited for 1 hour and noticed that my task did not run even once. Furthermore, when I look at "All Running Tasks" - I noticed that the task I created did not even run!
Can someone please tell me what I am doing wrong and what I can do to fix this?
Thanks!
In a text editor create a batch file: "MyRscript.bat"
The batch file will contain one line:
C:\Program Files\R\R-4.2.2\bin\Rscript.exe C:\rscripts\myscript.R
Ensure that the path is correct to the Rscript.exe file and your script.
Now in task scheduler, schedule running "MyRscript.bat" at the desired time and frequency.
The advantage of creating the bat file is one can now edit this file after upgrading R or changing the script file without the hassle of working in the task scheduler.
Also it is good practice to define your working directory inside the script.
See this question for more information: Scheduling R Script

Editing Oozie Workflow after submitting through Command Line

After running an oozie workflow using the command line I am unable to edit it using the Hue Workflow editor or even find it in the list of workflows.
I have an R script that generates the workflow.xml and job.properties, and will run the commands necessary to run the scripts:
workflow.path <- "workflow.xml" # Hard coded for the sake of this example
system2("hadoop", args = c("fs -put -f ", workflow.path, "/User/service/Test/" ))
system("oozie job --oozie http://localhost:11000/oozie -config job.properties -run")
Moving the workflow into HDFS works fine, and I have verified it is a valid workflow using oozie. Running the job also works like a charm, however if I open up Hue, and navigate to the Workflow and find it, I cannot edit it, only rerun it.
Some background on what I am trying to do: We have a large amount of automated workflows and we are always adding more. They all follow the same pattern as well so automating the creation of the coordinator and workflows is simple. Sometimes these workflows have to modified by people and they need to be able use the web interface.
Any help would be appreciated.
Indeed, only workflows created via the Drag&Drop Editor can be edited.
Workflows submitted via the CLI can only be visualized.

Amazon Web Services - how to run a script daily

I have an R script that I run every day that scrapes data from a couple of different websites, and then writes the data scraped to a couple of different CSV files. Each day, at a specific time (that changes daily) I open RStudio, open the file, and run the script. I check that it runs correctly each time, and then I save the output to a CSV file. It is often a pain to have to do this everyday (takes ~10-15 minutes a day). I would love it if someway I could have this script run automatically at a pre-defined specific time, and a buddy of mine said AWS is capable of doing this?
Is this true? If so, what is the specific feature / aspect of AWS that is able to do this, this way I can look more into it?
Thanks!
Two options come to mind thinking about this:
Host a EC2 Instance with R on it and configure a CRON-Job to execute your R-Script regularly.
One easy way to get started: Use this AMI.
To execute the script R offers a CLI rscript. See e.g. here on how to set this up
Go Serverless: AWS Lambda is a hosted microservice. Currently R is not natively supported but on the official AWS Blog here they offer a step by step guid on how to run R. Basically you execute R from Python using the rpy2-Package.
Once you have this setup schedule the function via CloudWatch Events (~hosted cron-job). Here you can find a step by step guide on how to do that.
One more thing: You say that your function outputs CSV files: To save them properly you will need to put them to a file-storage like AWS-S3. You can do this i R via the aws.s3-package. Another option would be to use the AWS SDK for python which is preinstalled in the lambda-function. You could e.g. write a csv file to the /tmp/-dir and after the R script is done move the file to S3 via boto3's S3 upload_file function.
IMHO the first option is easier to setup but the second-one is more robust.
It's a bit counterintuitive but you'd use Cloudwatch with an event rule to run periodically. It can run a Lambda or send a message to an SNS topic or SQS queue. The challenge you'll have is that a Lambda doesn't support R so you'd either have to have a Lambda kick off something else or have something waiting on the SNS topic or SQS queue to run the script for you. It isn't a perfect solution as there are, potentially, quite a few moving parts.
#stdunbar is right about using CloudWatch Events to trigger a lambda function. You can set a frequency of the trigger or use a Cron. But as he mentioned, Lambda does not natively support R.
This may help you to use R with Lambda: R Statistics ready to run in AWS Lambda and x86_64 Linux VMs
If you are running windows, one of the easier solution is to write a .BAT script to run your R-script and then use Window's task scheduler to run as desired.
To call your R-script from your batch file use the following syntax:
C:\Program Files\R\R-3.2.4\bin\Rscript.exe" C:\rscripts\hello.R
Just verify the path to the "RScript" application and your R code is correct.
Dockerize your script (write a Dockerfile, build an image)
Push the image to AWS ECR
Create an AWS ECS cluster and AWS ECS task definition within the cluster that will run the image from AWS ECR every time it's spun-up
Use EventBridge to create a time-based trigger that will run the AWS ECS task definition
I recently gave a seminar walking through this at the Why R? 2022 conference.
You can check out the video here: https://www.youtube.com/watch?v=dgkm0QkWXag
And the GitHub repo here: https://github.com/mrismailt/why-r-2022-serverless-r-in-the-cloud

Is it possible to run a unix script using oozie outside hadoop cluster?

We have written a unix batch script and it is hosted on a unix server outside Hadoop Cluster. So is it possible to run that script via oozie?
If it is possible then how can this be achieved?
What is the script doing? If the script just needs to run regulary you can as well use a cronjob or something like that.
Besides this, Oozie has a action for SSH Actions on Remote hosts.
https://oozie.apache.org/docs/3.2.0-incubating/DG_SshActionExtension.html
Maybe you can work something out with that by loging into the remotehost, run the script, wait for completetion and work on from there.

oozie reading and writing in hdfs as mapred user

I am running a python script in oozie workflow. The python script reads file from hdfs manipulate and write it back to hdfs in new folder. I am not getting any error while running the oozie workflow. But manipulated data is not written in hdfs. I do see that new folder by default has the user a mapred. I am not much sure whether this is related to mapred user. I am running the oozie workflow as hdfs user. The python script when ran from shell script it runs successfully and gives the expected result.
Any help would be appreciated.
Thanks!
Oozie would be running your script as hdfs user.
Try logging in using hdfs and then browse HDFS to view your output folder.

Resources