I came from R and I am trying to use scala to explore the possibilities to do data science. I don't have any background in programming or computer science, my background is pretty much statistical. So far I am only using scala from the REPL, which I like because it reminds my of the R console.
I am encountering problems when I am trying to import new libraries. In R, within the R console, I would just type
library(tidyverse)
In scala I am trying to do something similar, however it doesn't really work. Here what I see:
Welcome to Scala 2.12.6 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_172).
Type in expressions for evaluation. Or try :help.
scala> import org.apache.spark.mllib.linalg.vectors
<console>:11: error: object apache is not a member of package org
import org.apache.spark.mllib.linalg.vectors
^
What am I doing wrong?
Thanks
Apache Spark is not a simple package that you can import from the standard Scala library, but rather somewhat of an ecosystem on its own, consisting of JARs with Java/Scala API's, cluster managers, distributed file systems, various launcher scripts and interactive shells (for Scala, but e.g. also for Python).
It's not a single interactive script that you run on your computer. It's rather a complex conglomerate of cooperating programs running on a cluster.
You have several options:
Use SBT: declare spark as a dependency in build.sbt, run it in standalone-mode from the SBT console or as properly built project, with run
Essentially same as 1., but use Ammonite with $ivy imports for managing dependencies.
Just go to the Spark website and follow installation instructions there. Among many other things, it should sooner or later give you a script that starts an interactive Scala REPL with all the dependencies that are needed to run Spark jobs.
I'd suggest to go right to step 3. and download Spark from here.
Related
Is it possible to use R and Python using GraalVM / Truffle in a Java program on a standard OpenJDK / OracleJDK? I am able to run Javascript by simply including org.graalvm.js and org.graalvm.truffle in Maven, but I cannot find packages for R or Python to do the same.
Ideally the application would always run in GrallVM and it wouldn't be an issue, however I do not have any control over which actual VM my customers use for my application. Although I may recommend GraalVM, my customers might have valid reasons for using alternate JVMs since they will generally have to perform their own internal security audits and validations and would therefore accept only the specific VMs that they've certified for use in their environments which may or may not include GraalVM.
In short: GraalVM R and GraalVM Python, unlike GraalVM JavaScript, need more than just a jar file to run: the standard library and in case of R also native libraries. Python can do without native libraries, but using them improves compatibility.
You could bundle all that and configure the right options for your stock JDK, but for the time being this is not a scenario that is actively supported by those projects. Not saying it wouldn't work.
Details: https://github.com/oracle/graalpython/issues/96
I've created a program that runs in R that I plan on distributing among a lot of other people. Currently the R script is ran completely automatically and behind the scenes with one .sh script which is exactly how it is intended to be. I'm trying to make it so theres no need for client intervention. The R script itself loads the packages and installs them if they aren't present which takes away the task of them installing the packages themselves.
Is there a way I can provide a folder within my Application's folder that they already download that contains R-script and its dependencies so the code can use that location of Rscript to compile and run the R-program I have created. The goal is to be able to download it and run without the need of internet connection to download R and maybe even the programs required packages if possible.
Any help or ideas is appreciated.
I assume that process you want called "creating binary package". Binary is programs (like EXE files) which can run directly on target CPU without any interpreter software (like Python interpreter for python scripts, or Java VM for java applications). I'm not so familiar with packaging of R programs but I found some materials regarding this issue:
1 - Building binary package R
2 - https://seandavi.github.io/post/build-linux-r-binary-packages/
3 - https://support.rstudio.com/hc/en-us/articles/200486508-Building-Testing-and-Distributing-Packages
Second link assumes Linux as target system. Opposite to interpreted languages, binary files often OS dependent (Linux, Windows, or Mac). I, personally, don't know how compatible are packages between Linux systems with different library sets.
Please comment if you find some information misleading, I'll correct the answer.
We recently decided to make Julia Language available on our cluster systems. The cluster system is not able to connect to the internet.
Is there any way to download all Julia packages and make them available for our different users to install and use them offline?
Another option that we have is a system that can connect to the internet temporarily, but it is always connected to the main cluster system. Is there any way to use this system as a mirror for the Julia packages or not?
We want to use "Julia 1.0.1".
our cluster operation system is: "CentOS 5.5
notes: I have seen the question asked before here, but it is for Julia 0.6 and a single package that will be copied by hand. I want that user uses the Pkg.add <pkgName> command but instead of the internet, the package manager gets the packages from our offline system.
Thank you for your help and time.
Caution:
Side effects are not known!
May please be tested properly before put into production!
a) Collect the required packages along with their dependent packages in compiled form, put them in folder, stdlib (for example: /opt/julia/julia-1.1.0/shared/julia/stdlib/v1.1/)
b) add stdlib path to environment variables, JULIA_DEPOT_PATH and JULIA_LOAD_PATH
The following is a crosspost of https://stackoverflow.com/a/74800608/18431399
PackageCompiler.jl seems like the best tool for using modern Julia (v1.8) on secure systems. The following approach requires a build server with the same architecture as the deployment server, something your institution probably already uses for developing containers, etc.
Build a sysimage with PackageCompiler's create_sysimage()
Upload the build (sysimage and depot) along with the Julia binaries to the secure system
Alias a script to julia, similar to the following example:
#!/bin/bash
set -Eeu -o pipefail
unset JULIA_LOAD_PATH
export JULIA_PROJECT=/Path/To/Project
export JULIA_DEPOT_PATH=/Path/To/Depot
export JULIA_PKG_OFFLINE=true
/Path/To/julia -J/Path/To/sysimage.so "$#"
I've been able to run a research pipeline on my institution's secure system, for which there is a public version of the approach.
I'm developing a new R package to release to CRAN and would like to invoke the system() command directly within its source code. For example, I would like to use the gzip utility directly within my R package:
write.csv(mydat, "mydat.csv")
system("gzip mydat.csv", wait=FALSE)
Even more importantly, I would like to leverage other existing command-line utilities directly within my R package. And by command-line utilities, I mean actual large command-line software programs that are not trivial to rewrite in R.
So my question is: What are some best practices for specifying the usage of external (not R) command-line libraries during the development of an R package?
For example, the Imports and Depends fields in an R package DESCRIPTION file are only good for specifying the usage of existing R libraries within your R package. It would be a nuisance for users to have to manually install some existing non-R command-line library by using a package manager (e.g., brew), and this would go against best practices of self-contained work within an R Studio IDE. Besides, there is no guarantee that such a roundabout approach would work in a reproducible fashion, due to the difficulty of properly matching full paths to the command-line executable, coordinating with the R Studio IDE, etc.
Likewise, using tools such as https://cran.r-project.org/web/packages/ssh.utils/index.html will only serve basic command-line needs within the R environment, and hence does not apply to the needs of using large command-line software programs.
Note: The R package that I'm developing is not for personal use. It is intended for public release to CRAN and, hence, should comply with their checks. However, I could not find any specification from CRAN regarding the use of the system() command, particularly in the context of leveraging actual large command-line software programs that are not trivial to rewrite in R.
I would like to use the gzip utility directly within my R package
That is a code smell. Your package then needs to determine by means of configure (or similar) if such programs exist. So why bother? In this example, and on my box:
edd#don:~$ grep GZIP /etc/R/Renviron
R_GZIPCMD=${R_GZIPCMD-'/bin/gzip -n'}
edd#don:~$
You have access to it via most file-saving commands such as saveRDS(), the gzcon() and gzfile() functions and so on. See this older answer of mine.
For truly external programs you can rely on system(). See Christoph's seasonal package relying on our underlying x13binary binary package.
I've developed a .R script that works with a DB, does a bunch of processing and outputs graphs and tables. I can output that data as comma-separated values and pictures, to later import them on my software, that I have no issue.
The problem is how can I distribute my application without having to make a complete install of R on the client. I've seen things like RJava, but my app is on VB6 (yeah...) and I don't see any libraries, or ways to compile to exe. The compile package only makes compiled versions of any function you define, like what psyco used to do for Python (before Pypy).
Does anyone have some insight on compiling R to avoid having the user to install an entire additional software?
EDIT: Does an R compiler exist? This question relates deeply to mine, but I haven't seen how it can be used to make a full script an exe. You can just compile a main function and cat it to a file? Is that even possible?
The short answer is "no, that will not work".
There simply is no compiler that allows you to shrink-wrap your app. So your best best may be either
using the headless Rserve over the network, or
using the R (D)COM server used by RExcel et al