Editing Oozie Workflow after submitting through Command Line - r

After running an oozie workflow using the command line I am unable to edit it using the Hue Workflow editor or even find it in the list of workflows.
I have an R script that generates the workflow.xml and job.properties, and will run the commands necessary to run the scripts:
workflow.path <- "workflow.xml" # Hard coded for the sake of this example
system2("hadoop", args = c("fs -put -f ", workflow.path, "/User/service/Test/" ))
system("oozie job --oozie http://localhost:11000/oozie -config job.properties -run")
Moving the workflow into HDFS works fine, and I have verified it is a valid workflow using oozie. Running the job also works like a charm, however if I open up Hue, and navigate to the Workflow and find it, I cannot edit it, only rerun it.
Some background on what I am trying to do: We have a large amount of automated workflows and we are always adding more. They all follow the same pattern as well so automating the creation of the coordinator and workflows is simple. Sometimes these workflows have to modified by people and they need to be able use the web interface.
Any help would be appreciated.

Indeed, only workflows created via the Drag&Drop Editor can be edited.
Workflows submitted via the CLI can only be visualized.

Related

Best way to go from RStudio project development to scheduled R Script

I am developing processes for collecting, cleaning and storing various data sets. The development is done with RStudio projects. I won't say I'm following every tidyverse/RStudio workflow recommendation but in general I'm using that framework-- relevant now is that I'm using standard subdirectories and the here package for referencing them.
Every project has a MAIN.R script that ultimately sources the functions from the other scripts-- one only needs to run MAIN.R to execute the process. I did this not only for simplicity but also because the long-term intent is to have this be a scheduled process.
For now at least my method for scheduling R Scripts is with Windows Task Scheduler. Getting an R Script scheduled and running is not a problem. The issue is the contextual assumptions of developing within a project: source(here("CODE", "some-file.R")) fails when I run MAIN.R outside of the scope of the project.
One obvious solution would be to hard-code the project location as one of the parameters. I would need to have two different MAIN.R files, one for development that uses the project and one that uses that parameter for scheduling. I don't hate that idea, don't love it as it someone nullifies the whole point of the project/here approach. Is there a more elegant solution that someone else has created that I couldn't find on Google, or better workaround ideas?
I ended up using the solution described here: https://community.rstudio.com/t/how-to-play-nice-with-taskscheduler-r-studio-projects-and-here/24406/2 .
I didn't have to make any changes to the MAIN.R script. Instead, I scheduled it directly but added the project directory to the "Starts In" argument of the Windows Task Scheduler task.

Heartbeat print on running R script via Oozie

I was trying to run a R script on Oozie.
Though the R script was triggered, the Oozie job kept printing Heartbeat in the stdout file, without proceeding further into the R-code.
What may have caused it? Also, how to avoid the same?
Edit :
The script was supposed to read data from HDFS and a series of R scripts were supposed to be operating on the data to generate a json output.
The script was triggered by the oozie workflow from command line.
This oozie job is further planned to be exposed as an API.
I have tried changing the scheduler to fair scheduler as mentioned in this post, but still it didnt work.

Oozie editor in HUE doesn't show workflows I submited by oozie command

I submited a workflow and run the job by oozie command:
oozie job -oozie http://node1:11000/oozie -config job.properties -submit
oozie job -oozie http://node1:11000/oozie -start [job_id]
it worked well.
And I wanted to edit the workflow by oozie editor in HUE but couldn't find it. What shall I do to make the workflow shown in oozie editor?
CDH version: 5.9
I would not recommend o use HUE for editing workflows. As it uses set of metadata stored in DB and owned by itself it will not allow you to introduce changes into workflow submitted from the outside.
Hue is pretty nice tool for prototyping and monitoring the workflow/coordinator progress, checking the logs and having a first view in case of any issue investigation. You can import your xml and allow editing via GUI - http://blog.cloudera.com/blog/2013/03/how-to-import-a-pre-existing-oozie-workflow-into-hue/ , but this is hard to manage and maintain in production like environments. For example running multiple instances of the workflow with various parameters.

Hue - Oozie Job failing - Unable to resolve parameters

I am using Hue to run my workflow which uses parameters. I would like the workflow to pickup parameter from job.properties file without prompting the user. I intend to generate/modify this job.properties before every run with new parameter values.
My current setup, I have manually created job.properties file in the same working directory as workflow.xml. I have not added parameters to the hive action since this results in prompt. But the Hive SQL uses the same parameter as specified in the job.properties file.
When I run the Workflow it fails for being unable to resolve the parameters. I believe it is not picking up my job.properties file for some reason.
Any pointers will realy help? Beating my head for almost 2 days now!
Are you using the Workflow Editor? At this time (Hue 3.7) job.properties is only picked up when submitting a workflow from File Browser.
Properties need to be entered as 'Oozie parameters' in the Properties section of the workflow. Would just doing this solve your problem?

Can oozie control jobs outside of Hadoop?

From documentation, it isn't very clear whether oozie can schedule and control jobs outside of Hadoop? Can someone shed some light on this? If not, is there any open source based workflow engine which can do that?
Try consider using chronos (from airbnb) advanced version of cron with a UI, built on top of mesos. airbnb.github.com/chronos/
Cheers.
I believe no. Because Oozie itself does not have a resource management policy, all it does is submitting jobs to Hadoop's job tracker at the right time. Besides, for each Oozie workflow, there will be one launcher job which is responsible for submitting the real jobs in the workflow to Hadoop. The launcher job is itself a Hadoop job. So, I think for the versions earlier than Oozie 3.2, the answer should be no.
You might consider trying azkaban by linked in. It was specifically built for hadoop. But unix commands can be specified in the job file of azkaban. So you may develop a workflow for any application(s) that can be run using command line.
I've been working on a new workflow engine called Soop. https://github.com/radixCSgeek/soop it is very lightweight and simple to setup and run using a cron-like syntax. It can run any Java POJO as well as running shell processes, so you can kick off a bash script or whatever.

Resources