Apache spark-shell error import jars - jar

I have a local spark 1.5.2 (hadoop 2.4) installation on Windows as explained here.
I'm trying to import a jar file that I created in Java using maven (the jar is jmatrw that I uploaded on here on github). Note the jar does not include a spark program and it has no dependencies to spark. I tried the following steps, but no one seems to work in my installation:
I copied the library in "E:/installprogram/spark-1.5.2-bin-hadoop2.4/lib/jmatrw-v0.1-beta.jar"
Edit spark-env.sh and add SPARK_CLASSPATH="E:/installprogram/spark-1.5.2-bin-hadoop2.4/lib/jmatrw-v0.1-beta.jar"
In a command window I run > spark-shell --jars "E:/installprogram/spark-1.5.2-bin-hadoop2.4/lib/jmatrw-v0.1-beta.jar", but it says "Warning: skip remote jar"
In the spark shell I tried to do scala> sc.addJar("E:/installprogram/spark-1.5.2-bin-hadoop2.4/lib/jmatrw-v0.1-beta.jar"), it says "INFO: added jar ... with timestamp"
When I type scala> import it.prz.jmatrw.JMATData, spark-shell replies with error: not found: value it.
I spent lot of time searching on Stackoverflow and on Google, indeed a similar Stakoverflow question is here, but I'm still not able to import my custom jar.
Thanks

There are two settings in 1.5.2 to reference an external jar. You can add it for the driver or for the executor(s).
I'm doing this by adding settings to the spark-defaults.conf, but you can set these at spark-shell or in SparkConf.
spark.driver.extraClassPath /path/to/jar/*
spark.executor.extraClassPath /path/to/jar/*
I don't see anything really wrong with the way you are doing it, but you could try the conf approach above, or setting these using SparkConf
val conf = new SparkConf()
conf.set("spark.driver.extraClassPath", "/path/to/jar/*")
val sc = new SparkContext(conf)
In general, I haven't enjoyed working with Spark on Windows. Try to get onto Unix/Linux.

Related

Installing local .whl files on Databricks cluster

I am trying to connect to a databricks cluster and install a local python whl using DatabricksSubmitRunOperator on Airflow (v2.3.2) with following configuration. However, it doesn't work and throws a fileNotFound exception (I checked file path multiple times, file exists).
task1 = DatabricksSubmitRunOperator(
task_id = <task_id>,
job_name = <job_name>,
existing_cluster_id = <cluster_id>,
libraries=[
{"whl": "file:/<local_absolute_path>"}
]
)
While the official documentation states that, for .whl files, only DBFS and S3 storage is supported, in Airflow, I see the following error message when prefix file:/ is not attached:
Library installation failed for library due to user error.
Error messages: Python wheels must be stored in dbfs, s3, adls, gs or as a local file. Make sure the URI begins with 'dbfs:', 'file:', 's3:', 'abfss:', 'gs:'
Is it possible install local .whl files on a databricks cluster?
Alternative approach I tried is to copy .whl to dbfs storage and install it from there. The problem with that is that installation status is stuck at "pending".
Any help is appreciated.
You can directly install or upload .whl file as shown in the below image.
or
Follow this official document installing .whl packages.

pyinstaller ImportError: C extension: No module named np_datetime not built

I am running a virtual environment with Python 2.7 for my program.
There seems to be a problem after creating the executable file on windows.
I ran venv/Scripts/pyinstaller.exe -F main.py
everything seems fine. But when i click on the created executable main.exe.
There is an error.
Tried and tested
I have re-installed of pandas and pyinstaller
Implemented the hook-pandas.py to the hooks folder in the environment.
hook-pandas
Ensured the environment is activated.
Checked that the program is running fine before building executable.
Re-created the environment.
Yet after all that, I am prompted with this issue [see Importerror] when I run the executable file.
It is an extreme pain to debug this because the command prompt displaying the error will not pause but close almost immediately.
Similar issues
Looking for Suggestions
I am hoping for suggestions to troubleshoot Pyinstaller. Any resources to read up on would be nice.
Usually, I have no trouble with python as Pycharm has several handy debugging tools that will help me identify the problem
I ran into the same problem and found this thread, but I managed to solve it borrowing from the reference you posted (about pandas._libs.tslibs.timedeltas), so thank you for that!
In that article, the module that resulted in the ImportError was, in fact pandas._libs.tslibs.timedeltas, if you look at the poster's logs. But the error you and I ran into refers to np_datetime instead. So, from the traceback logs, I finally figured out that the code we have to write in hook-pandas.py should be the following:
hiddenimports = ['pandas._libs.tslibs.np_datetime']
Maybe that alone will solve your problem, HOWEVER, in my case, once I solved the np_datetime issue, other very similar ImportError problems arose (also related to hiddenimports regarding pandas), so, in case you run into the same issues, just define hiddenimports as follows:
hiddenimports = ['pandas._libs.tslibs.np_datetime','pandas._libs.tslibs.nattype','pandas._libs.skiplist']
TL;DR:
You can first try to write
hiddenimports = ['pandas._libs.tslibs.np_datetime']
into hook-pandas.py. However, if for some reason you run into the exact same issues I did afterwards, try
hiddenimports = ['pandas._libs.tslibs.np_datetime','pandas._libs.tslibs.nattype','pandas._libs.skiplist']
If you wish to dive deeper (or run into a different pandas ImportError than the ones I did), this is the code in pandas's __init__.py referenced in your traceback log (lines 23 to 35):
from pandas.compat.numpy import *
try:
from pandas._libs import (hashtable as _hashtable,
lib as _lib,
tslib as _tslib)
except ImportError as e: # pragma: no cover
# hack but overkill to use re
module = str(e).replace('cannot import name ', '')
raise ImportError("C extension: {0} not built. If you want to import "
"pandas from the source directory, you may need to run "
"'python setup.py build_ext --inplace --force' to build "
"the C extensions first.".format(module))
From that I went into the
C:\Python27\Lib\site-packages\pandas_libs
and
C:\Python27\Lib\site-packages\pandas_libs\tslibs
folders and found the exact names of the modules that resulted the errors.
I hope that solves your problem as it did mine.
Cheers!

multiple version of spark on CDH5.10 is failing to launch spark-submit

I have installed spark with 2.0 on CDH5.10 By following the link https://www.cloudera.com/documentation/spark2/latest/topics/spark2_installing.html
after all configuration when I hit spark2-submit --version it gives me correct version which is 2.0
however when I submit a spark job . First it says
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
This is clearly indicating that hadoop libs are not in classpath. My question is it something wrong with my installation of spark 2. ? also once we add jars with sparkExtralibCLasspath for driver and core then it says SPARK_HADOOP_CONF Is not set.
How can I verify my installation is correct ?
I am also trying to understand where are my spark2 conf dirs
I saw few previous question on stackoverflow like https://community.cloudera.com/t5/Cloudera-Manager-Installation/CHD-5-7-spark-shell-java-lang-ClassNotFoundException-org-apache/td-p/42209 and NoClassDefFoundError com.apache.hadoop.fs.FSDataInputStream when execute spark-shell but this doesnt help
I am using spark2-shell and spark2-submit command
some more investigation with https://community.cloudera.com/t5/Cloudera-Manager-Installation/CDH-5-5-pyspark-java-lang-NoClassDefFoundError-org-apache-hadoop/td-p/34424 shows might be If I can correctly set SPARK_EXTRA_LIB_PATH for spark2 then I can fix this issue. can somebody guide me please. Thanks

SBT configuration failed to load in Typesafe Activator

I'm currently trying to start a play-slick application through the Typesafe Activator, but it fails to load the SBT configuration and I get this error;
/play-slick/build.sbt:30: error: reference to fork is ambiguous;
it is imported twice in the same scope by
import _root_.play.Project._
and import Keys._
fork in run := true
^
Type error in expression
Failed to load project.
Does this mean I have SBT downloaded twice and what can I do to resolve it? Thanks.
Just wanted to say that I came across this exact same issue when trying to use the Play-Slick example linked from the Play Tutorials page.
The solution to get it working seems to have indeed been to follow the suggestion in the Github link that Seth Tisue included in a comment above, where corruptmemory suggested removing the following line from build.sbt:
fork in run := true
In my case, this was enough to convince IntelliJ to open the project and let me tinker with it. (Just in case this is the first result for anyone else coming across this problem)
just remove
fork in run := true
from build.sbt and hit activator clean run from cmd

Instantiating RInScala results in NoSuchMethodError

I'm trying to integrate R into Scala using JVMR. I am getting a NoSuchMethodError when attempting to instantiate RInScala.
I'm working on a Windows 7 machine with R installed under C:\Program Files\R\R-3.1.1 and Scala version 2.11.1 is installed under C:\Program Files (x86)\scala. I'm developing in IntelliJ with the Scala plugin and am using the Scala worksheet just to test this out as a POC. My Scala project does show JVMR 2.11-2.11.1.1.jar as an included library. The worksheet is very basic at present - just the import and the instantiation attempt.
import org.ddahl.jvmr.RInScala
val R = RInScala()
When running the worksheet in IntelliJ, I see the following output, so I can tell that it's successfully importing the class, but can't instantiate.
import org.ddahl.jvmr.RInScala
java.lang.NoSuchMethodError: scala.runtime.ObjectRef.create(Ljava/lang/Object;)Lscala/runtime/ObjectRef;
at org.ddahl.jvmr.RInScala$.findROnWindows(RInScalaTest.sc2318647708135405919.tmp:804)
at org.ddahl.jvmr.RInScala$.defaultExecutable$lzycompute(RInScalaTest.sc2318647708135405919.tmp:822)
at org.ddahl.jvmr.RInScala$.defaultExecutable(RInScalaTest.sc2318647708135405919.tmp:821)
at org.ddahl.jvmr.RInScala.<init>(RInScalaTest.sc2318647708135405919.tmp:28)
at org.ddahl.jvmr.RInScala$.apply(RInScalaTest.sc2318647708135405919.tmp:838)
at com.xxxx.r_in_scala.A$A1$A$A1.R$lzycompute(RInScalaTest.sc2318647708135405919.tmp:2)
at com.xxxx.r_in_scala.A$A1$A$A1.R(RInScalaTest.sc2318647708135405919.tmp:2)
at com.xxxx.r_in_scala.A$A1$A$A1.get$$instance$$R(RInScalaTest.sc2318647708135405919.tmp:2)
at #worksheet#.#worksheet#(RInScalaTest.sc2318647708135405919.tmp:10)
I've drilled into the code for findROnWindows and my installation should be found based on the values of the registry keys that are being read. I'm sure I'm missing something simple, but am at that "I've been looking at the problem for too long without figuring it out and just need a new set of eyes" stage.
To my knowledge this is currently some kind of bug in the worksheet implementation by IntelliJ or indeed a misconfiguration. To verify:
Put your code in an object
package starter
object Test extends App {
import org.ddahl.jvmr.RInScala
val R = RInScala()
println("works")
}
and try to run it as a scala app and not from the worksheet.

Resources