I'm beginner for sparklyr. When I try to use copy_to() function I had an error, how can I fix it?
My code:
library(sparklyr)
library(babynames)
sc <- spark_connect(master = "local", version = "2.0.1")
babynames_tbl <- copy_to(sc, babynames, "babynames")
Error:
Error: org.apache.spark.SparkException: Job aborted due to stage
failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost
task 0.0 in stage 1.0 (TID 1, localhost):
java.lang.NullPointerException at
java.lang.ProcessBuilder.start(ProcessBuilder.java:1012) at
org.apache.hadoop.util.Shell.runCommand(Shell.java:482) at
org.apache.hadoop.util.Shell.run(Shell.java:455) at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873) at
org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853) at
org.apache.spark.util.Utils$.fetchFile(Utils.scala:407) at
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:430)
at
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:422)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike
Related
I am trying to run a function containing a for loop in parallel using spark_apply in Databricks on Azure.
My function is:
distribution <- function(sims){
for (p in 1:100){
increment_value <- list()
profiles <- list()
samples <- list()
sample_num <- list()
for (i in 1:length(samp_seq)){
w <- sample(sims, size=batch)
z <- sum(w)
name3 <- as.character(z)
samples[[name3]] <- data.frame(value = z)
}
}
}
when I put the function in spark_apply like this:
sdf_len(sc,1) %>%
spark_apply(distribution)
I get the following error:
Error : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 305.0 failed 4 times, most recent failure: Lost task 0.3 in stage 305.0 (TID 297, 10.139.64.6, executor 0): java.lang.Exception: sparklyr worker rscript failure with status 255, check worker logs for details. Error : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 305.0 failed 4 times, most recent failure: Lost task 0.3 in stage 305.0 (TID 297, 10.139.64.6, executor 0): java.lang.Exception: sparklyr worker rscript failure with status 255, check worker logs for details.
I'm trying to install addmtoolbox on Rstudio. every things is fine but at last i got this error:
Error: Vignette re-building failed.
Execution halted
Error: Failed to install 'addmtoolbox' from GitHub:
System command 'Rcmd.exe' failed, exit status: 1, stdout + stderr (last 10 lines):
E> Quitting from lines 80-84 (addmtoolbox_modelfit_walkthrough.Rmd)
E> Error: processing vignette 'addmtoolbox_modelfit_walkthrough.Rmd' failed with diagnostics:
E> x is not a data.table
E> --- failed re-building 'addmtoolbox_modelfit_walkthrough.Rmd'
E>
E> SUMMARY: processing the following file failed:
E> 'addmtoolbox_modelfit_walkthrough.Rmd'
E>
E> Error: Vignette re-building failed.
E> Execution halted
also i'm trying to install package from zip file. installation is successful but when i run addm_preprocess function,
i got same error:
> my.dat = addm_preprocess(choice.dat = addm_data_choice,
+ eye.dat = addm_data_eye,
+ timestep = 10,
+ rtbinsize = 100)
Error in setkeyv(x, cols, verbose = verbose, physical = physical) :
x is not a data.table
would you please help me?
thanks.
addmtoolbox:
https://rdrr.io/github/AlexanderFengler/addmtoolbox/
EDIT:
i found error code:
eye$fixdur = timestep * round(eye$fixdur/timestep)
rts = eye %>% group_by(id) %>% summarize(rt = sum(fixdur))
setkey(rts,id)
choice = choice %>% select(-rt)
choice = choice[rts]
setkey(rts,id) return error:
> rts
# A tibble: 1 x 2
id rt
* <dbl> <dbl>
1 0 0
> setkey(rts,id)
Error in setkeyv(x, cols, verbose = verbose, physical = physical) :
x is not a data.table
solution by : hdkrgr
It looks like you're providing a tibble when a data.table is expected.
You should be able to convert the object with as.data.table.
Alternatively you could also use dtyplyr for unified tibble/data.table
objects. github.com/tidyverse/dtplyr
I have a problem with accessing SparkDataFrame during spark.lapplyfunction. The code is as follows:
df <- data.frame(x = c(1,2), y = c("a", "b"))
Sys.setenv(SPARK_HOME = "path/spark-2.0.0-bin-hadoop2.7/")
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "spark://host:7077",
appName = "SparkR",
sparkHome = Sys.getenv("SPARK_HOME"),
sparkConfig = list(spark.driver.memory = "2g"),
enableHiveSupport = TRUE)
spark_df <- as.DataFrame(df)
fun_to_distribute <- function(i){
data <- take(spark_df, 1)$x
return(data + i)
}
spark.lapply(1:2, fun_to_distribute)
sparkR.session.stop()
Unfortunately, I always receive an error:
[Stage 1:> (0 + 2) / 2]17/04/28 01:57:56 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 173.38.82.173): org.apache.spark.SparkException: R computation failed with
Error in callJMethod(x#sdf, "limit", as.integer(num)) :
Invalid jobj 14. If SparkR was restarted, Spark operations need to be re-executed.
at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/04/28 01:57:56 WARN TaskSetManager: Lost task 1.1 in stage 1.0 (TID 4, 173.38.82.175): org.apache.spark.SparkException: R computation failed with
Error in callJMethod(x#sdf, "limit", as.integer(num)) :
Invalid jobj 14. If SparkR was restarted, Spark operations need to be re-executed.
Error in callJMethod(x#sdf, "limit", as.integer(num)) :
Invalid jobj 14. If SparkR was restarted, Spark operations need to be re-executed.
at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/04/28 01:57:56 WARN TaskSetManager: Lost task 1.3 in stage 1.0 (TID 6, 173.38.82.175): org.apache.spark.SparkException: R computation failed with
Error in callJMethod(x#sdf, "limit", as.integer(num)) :
Invalid jobj 14. If SparkR was restarted, Spark operations need to be re-executed.
Error in callJMethod(x#sdf, "limit", as.integer(num)) :
Invalid jobj 14. If SparkR was restarted, Spark operations need to be re-executed.
Error in callJMethod(x#sdf, "limit", as.integer(num)) :
Invalid jobj 14. If SparkR was restarted, Spark operations need to be re-executed.
at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/04/28 01:57:56 ERROR TaskSetManager: Task 1 in stage 1.0 failed 4 times; aborting job
17/04/28 01:57:56 ERROR RBackendHandler: collect on 19 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 (TID 6, 173.38.82.175): org.apache.spark.SparkException: R computation failed with
Error in callJMethod(x#sdf, "limit", as.integer(num)) :
Invalid jobj 14. If SparkR was restarted, Spark operations need to be re-executed.
Error in callJMethod(x#sdf, "limit", as.integer(num)) :
Invalid jobj 14. If SparkR was restarted, Spark operations need to be re-executed.
Error in callJMethod(x#sdf, "limit", as.integer(num)) :
Invalid jobj 14. If SparkR was restarted, Spark operations need to be re-executed.
at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.s
Of course, I can provide function with the complex argument like list, but I would rather upload the data to spark cluster once and enable each of the executors to access it during the runtime.
Nesting of distributed contexts is not permitted in Apache Spark and you cannot access distributed data structures from within a task.
Furthermore SparkDataFrame is not intended as for cases where you need single item access, which is seems to be the desired behavior here. If you want to pass arguments to you should do it directly using standard R objects.
I get a bounds error when running a function in parallel that runs fine normally (sequentially) e.g. when I run:
parallelsol = #time pmap(i -> findividual(x,y,z), 1:50)
It gives me an error:
exception on 2: exception on exception on 16: 20exception on 5: : ERROR: BoundsError()
in getindex at array.jl:246 (repeats 2 times)
But when I run:
parallelsol = #time map(i -> findividual(prodexcint,firstrun,q,r,unempinc,VUnempperm,Indunempperm,i,VUnemp,poachedwagevec, mw,k,Vp,Vnp,reswage), 1:50)
It runs fine. Any ideas as to why this might be happening?
I am new to Spark as well as SparkR. I have successfully installed Spark and SparkR.
When I tried to build Logistic regression model with R and Spark over csv file stored in HDFS, I got the error "incorrect number of dimensions".
My Code is :
points <- cache(lapplyPartition(textFile(sc, "hdfs://localhost:54310/Henry/data.csv"), readPartition))
collect(points)
w <- runif(n=D, min = -1, max = 1)
cat("Initial w: ", w, "\n")
# Compute logistic regression gradient for a matrix of data points
gradient <- function(partition) {
partition = partition[[1]]
Y <- partition[, 1] # point labels (first column of input file)
X <- partition[, -1] # point coordinates
# For each point (x, y), compute gradient function
dot <- X %*% w
logit <- 1 / (1 + exp(-Y * dot))
grad <- t(X) %*% ((logit - 1) * Y)
list(grad)
}
for (i in 1:iterations) {
cat("On iteration ", i, "\n")
w <- w - reduce(lapplyPartition(points, gradient), "+")
}
Error Message is:
On iteration 1
Error in partition[, 1] : incorrect number of dimensions
Calls: do.call ... func -> FUN -> FUN -> Reduce -> <Anonymous> -> FUN -> FUN
Execution halted
14/09/27 01:38:13 ERROR Executor: Exception in task 0.0 in stage 181.0 (TID 189)
java.lang.NullPointerException
at edu.berkeley.cs.amplab.sparkr.RRDD.compute(RRDD.scala:125)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:701)
14/09/27 01:38:13 WARN TaskSetManager: Lost task 0.0 in stage 181.0 (TID 189, localhost): java.lang.NullPointerException:
edu.berkeley.cs.amplab.sparkr.RRDD.compute(RRDD.scala:125)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:701)
14/09/27 01:38:13 ERROR TaskSetManager: Task 0 in stage 181.0 failed 1 times; aborting job
Error in .jcall(getJRDD(rdd), "Ljava/util/List;", "collect") : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 181.0 failed 1 times, most recent failure: Lost task 0.0 in stage 181.0 (TID 189, localhost): java.lang.NullPointerException: edu.berkeley.cs.amplab.sparkr.RRDD.compute(RRDD.scala:125) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:701) Driver stacktrace:
Dimension of data (sample) :
data <- read.csv("/home/Henry/data.csv")
dim(data)
[1] 17 541
What could be the possible reason for this error?
The problem is that textFile() reads some text data and return a distributed collection of strings, each of which corresponds to a line of the text file. Therefore later in the program partition[, -1] fails. The program's real intent seems to be treating points as a distributed collection of data frames. We are working on providing data frame support in SparkR soon (SPARKR-1).
To resolve the issue, simply manipulate your partition using string operations to extract X, Y correctly. Some other ways include (I think you've probably seen this before) producing a different type of distributed collection from the beginning as is done here: examples/logistic_regression.R.