Running hadoop job without creating a jar file - jar

I have wriiten a simple hadoop job. Now I want to run it without creating the jar file as opposed to lots of tutorials found on net.
I am calling it from a shell script on ubuntu platform which runs a cloudera CHD4 distribution of hadoop(2.0.0+91).
I can't create the jar file of the job because it depends on several other third party jars and configuration files which are already centrally deployed over my machine and are not accessible at the time of creating the jar. Hence I am looking out for a way where I can include these custom jar files and configuration files.
I also can't use -libjars and DistributedCache options because they only affect map/reduce phase but my driver class also is using these jar and configuration files. My job uses several in house utility code which internally uses these third party libraries and configuration files which I have only access to read from a centrally deployed location.
Here is how I am calling it from a shell script.
sudo -u hdfs hadoop x.y.z.MyJob /input /output
It shows me a
Caused by: java.lang.ClassNotFoundException: x.y.z.MyJob
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
My calling shell script successfully sets the hadoop classpath and contains all my required third party libraries and configuration files from a centrally deployed location.
I am sure that my class x.y.z.MyJob and all required libraries and configuration files are found in both the $CLASSPATH and $HADOOP_CLASSPATH environment varibales which I am setting before calling the hadoop job
Why at the time of running the script my program is not able to find the class.
Can't I run the job as a normal java class? All my other normal java programs are using the same classpath and they can always find the classes and configuration files without any problem.
Please let me know how can I access centrally deployed haddop job code and execute it.
EDIT: Here is my code to set the classpath
CLASSES_DIR=$BASE_DIR/classes/current
BIN_DIR=$BASE_DIR/bin/current
LIB_DIR=$BASE_DIR/lib/current
CONFIG_DIR=$BASE_DIR/config/current
DATA_DIR=$BASE_DIR/data/current
CLASSPATH=./
CLASSPATH=$CLASSPATH:$CLASSES_DIR
CLASSPATH=$CLASSPATH:$BIN_DIR
CLASSPATH=$CLASSPATH:$CONFIG_DIR
CLASSPATH=$CLASSPATH:$DATA_DIR
LIBPATH=`$BIN_DIR/lib.sh $LIB_DIR`
CLASSPATH=$CLASSPATH:$LIBPATH
export HADOOP_CLASSPATH=$CLASSPATH
lib.sh is the file to concatenate all third party files to a : separated format and CLASSES_DIR contains my job code x.y.z.MyJob class.
All my configuration files are unders CONFIG_DIR
When I print my CLASSPATH and HADOOP_CLASSPATH it shows me correct values. However whenever I call hadoop classpath just before executing the job, it shows me following output.
$ hadoop classpath
/etc/hadoop/conf:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:myname:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-0.20-mapreduce/./:/usr/lib/hadoop-0.20-mapreduce/lib/*:/usr/lib/hadoop-0.20-mapreduce/.//*
$
It obviously does not have any of those previously set $CLASSPATH and $HADOOP_CLASSPATH varibales appended. Where are these environment varibales.

Inside my shell script I was running the hadoop jar command with Cloudera's hdfs user
sudo -u hdfs hadoop jar x.y.z.MyJob /input /output
This code was actually being called from the script with a regular ubuntu user which was setting the CLASSPATH and HADOOP_CLASSPATH varibles as mentioned above. And at the time of execution the hadoop jar command was not called using the same regular ubuntu user. Hence there was an exception indicating that the class was not found.
So you have to run the job with the same user who is actually setting the CLASSPATH and HADOOP_CLASSPATH environment variables.
Thanks all for your time.

Related

(Dagster) Schedule my_hourly_schedule was started from a location that can no longer be found

I'm getting the following Warning message when trying to start the dagster-daemon:
Schedule my_hourly_schedule was started from a location Scheduler that can no longer be found in the workspace, or has metadata that has changed since the schedule was started. You can turn off this schedule in the Dagit UI from the Status tab.
I'm trying to automate some pipelines with dagster and created a new project using dagster new-project Scheduler where "Scheduler" is my project.
This command, as expected, created a diretory with some hello_world files. Inside of it I put the dagster.yaml file with configuration for a PostgreDB to which I want to right the logs. The whole thing looks like this:
However, whenever I run dagster-daemon run from the directory where the workspace.yaml file is located, I get the message above. I tried runnning running the daemon from other folders, but it then complains that it can't find any workspace.yaml files.
I guess, I'm running into a "beginner mistake", but could anyone help me with this?
I appreciate any counsel.
One thing to note is that the dagster.yaml file will not do anything unless you've set your DAGSTER_HOME environment variable to point at the directory that this file lives.
That being said, I think what's going on here is that you don't have the Scheduler package installed into the python environment that you're running your dagster-daemon in.
To fix this, you can run pip install -e . in the Scheduler directory, although the README.md inside that directory has more specific instructions for working with virtualenvs.

Jar file run on a server background with close putty session

I have tried the run spring boot jar file using putty. but the problem is after closed the putty session service was stopped.
then i tried up the jar file with following command. its working fine .
**nohup java -jar /web/server.jar **
You should avoid using nohup as it will just disassociate your terminal and the process. Instead, use the following command to run your process as a service.
sudo ln -s /path/to/your-spring-boot-app.jar /etc/init.d/your-spring-boot-app
This command creates a symbolic link to your JAR file. Which then, you can run as a service using the command sudo service your-spring-boot-app start. This will write console log to /var/log/your-spring-boot-app.log
Moreover, you can configure spring-boot/application.properties to write console logs at your specified location using logging.path=path-to-your-log-directoryor logging.file=path-to-your-log-file.txt. Also, it may be worth noting that logging.file takes priority over logging.path

What is mean by "Rnobody" in R script installation

I currently working with a Java based jetty server setup which involves servlet and HTML. The objective is to call R script from Java
In one of the custom configuration properties file inside WEB-INF/classes, I have encountered a statement as follows.
RScriptLocation=/usr/local/bin/Rnobody
This property file is not related to jetty server, it is the developer who created this properties file for storing constants.
I have installed R from cygwin setup, but I could not locate that particular executable, I only see /usr/bin/R
what is Rnobody and how to install it

SparkR job deal with dependencies

How to deal with dependencies in case of a (interactive) sparkR job?
I know java jobs can be submitted as a fat-Jar containing all the dependencies. For any other job the --packages option can be specified on the spark-submit command. But I would like to connect from R (Rstudio) using sparkR to my little cluster. (this works pretty straigth forward)
But I need some external packages e.g. to connect to a database (Mongo, Cassandra) or read a csv file. In local mode I can easily specify these packages on launch. This naturally does not work in the already running cluster.
https://github.com/andypetrella/spark-notebook provides a very convenient mode to load such external packages at runtime.
How can I similarly load maven-coordinate packages into the spark classpath either during runtime from my sparkR (interactive session) or during image creation of the dockerized cluster?
You can also try to configure these 2 variables : spark.driver.extraClassPath and spark.executor.extraClassPath in SPARK_HOME/conf/spark-default.conf file and specify the value of these variables as the path of the jar file. Ensure that the same path exists on worker nodes.
From No suitable driver found for jdbc in Spark

User in Unix not able to run hadoop command

I installed Hadoop and Created a user named hduser and changes owner of hadoop folder to hduser.
After installing Hadoop i try to execute the hadoop command to check whether it is installed or not but it is giving "hadoop" command not found.
Then i had given execute privilege to hduser on all the files inside hadoop folder include bin folder
But still output is same.
When i am trying the same hadoop command with root as a user its working fine.
I think it is related to unix commands. Please help me out to give my user the privilege to execute hadoop command.
One more thing if i switch to root then hadoop commands works fine.
It is not a problem of privileges. You can still execute hadoop, if you type /usr/local/hadoop/bin/hadoop, right?
The problem is that $PATH is user-specific.
You have to add your $HADOOP_HOME/bin to the $PATH, as hduser, not as root. Login as hduser first (or just type su hduser) and then export PATH=$PATH:/$HADOOP_HOME/bin, as #iamkristian suggests, where $HADOOP_HOME is the directory in which you have placed hadoop (usually /usr/local/hadoop).
I sounds like hadoop isn't in your path. You can test that with
which hadoop
If that gives you command not found the you probably just need to add it to your path. Depending on where you installed hadoop, you need to add this to your ~/.bashrc
export PATH=$PATH:/usr/local/hadoop/bin/
And then reopen your terminal

Resources