For Loop using Spark_Apply in Databricks - r

I am trying to run a function containing a for loop in parallel using spark_apply in Databricks on Azure.
My function is:
distribution <- function(sims){
for (p in 1:100){
increment_value <- list()
profiles <- list()
samples <- list()
sample_num <- list()
for (i in 1:length(samp_seq)){
w <- sample(sims, size=batch)
z <- sum(w)
name3 <- as.character(z)
samples[[name3]] <- data.frame(value = z)
}
}
}
when I put the function in spark_apply like this:
sdf_len(sc,1) %>%
spark_apply(distribution)
I get the following error:
Error : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 305.0 failed 4 times, most recent failure: Lost task 0.3 in stage 305.0 (TID 297, 10.139.64.6, executor 0): java.lang.Exception: sparklyr worker rscript failure with status 255, check worker logs for details. Error : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 305.0 failed 4 times, most recent failure: Lost task 0.3 in stage 305.0 (TID 297, 10.139.64.6, executor 0): java.lang.Exception: sparklyr worker rscript failure with status 255, check worker logs for details.

Related

Reconnect if connection fails using R

I have written small function but somehow does not work as expected.
I have connection to server and sometime the server is down so I cannot connect. The script is running in batch so I have to have it automatized.
The script should run the conn <- function(..) successfully (it means no error message) if not restart/re-check and restart again in approx. 1min time. This should run in loop until connection is established something like 12 hours. (approx.). The connection should be assigned to conn object so the object must return successful connection. (something like <Connection established, # 20180522 20:43:41 CET>
The function which does not work is here:
connect <- function(c) { message(paste("remaining run", c));
withRestarts(err <- tryCatch({ conn <- server.coonect(settings...) },
error=function(e) { invokeRestart("reconnect") }), reconnect = function() { message("re-connecting");
stopifnot(c > 0); for(i in 1:10) { Sys.sleep(6); cat(i) }; connect(c-1) }) }
connect(1000) # with approx. 1min sleep we get here over 12 hours running..
So the question is what is wrong and how to rewrite the function such it runs as expected. Thanks.
EDIT
It seems that the function should be:
connect <- function(c) { message(paste("remaining run", c));
withRestarts(err <- tryCatch({ server.coonect(settings...) },
error=function(e) { invokeRestart("reconnect") }), reconnect = function() { message("re-connecting");
stopifnot(c > 0); for(i in 1:10) { Sys.sleep(6); cat(i) } }) }
conn <- connect(1000)
EDIT 2
Here is comment for the above function I have tested:
I have tested the EDIT function by simulating the connection by first running the function without internet connection (now the function checks every 1:10 o 6sec, and after the function is running I connect to the internet, now I expect the function in next iteration pics up and connects to server if available...) what happens is that the function does not pick up the later possibility to connect...
If you only want to loop over the connection establishment this will work:
# simulate an instable connection
server.connect <- function(...) {
if (round(as.numeric(Sys.time())) %% 10 != 0) # about 90 % failed connections
stop("Connection error")
return(TRUE) # success
}
connect <- function(attempts = 1, sleep.seconds = 6) {
for (i in 1:attempts) {
res <- try(server.connect("my connection string"), silent = TRUE)
if (!("try-error" %in% class(res))) {
print("Connected...")
return(res)
}
print(paste0("Attempt #", i, " failed"))
Sys.sleep(sleep.seconds)
}
stop("Maximum number of connection attempts exceeded")
}
con <- connect(attempts = 10, sleep = 1)
Example execution log:
[1] "Attempt #1 failed"
[1] "Attempt #2 failed"
[1] "Attempt #3 failed"
[1] "Attempt #4 failed"
[1] "Attempt #5 failed"
[1] "Attempt #6 failed"
[1] "Connected..."

How to solve Iorg.apache.spark.SparkException in sparklyr?

I'm beginner for sparklyr. When I try to use copy_to() function I had an error, how can I fix it?
My code:
library(sparklyr)
library(babynames)
sc <- spark_connect(master = "local", version = "2.0.1")
babynames_tbl <- copy_to(sc, babynames, "babynames")
Error:
Error: org.apache.spark.SparkException: Job aborted due to stage
failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost
task 0.0 in stage 1.0 (TID 1, localhost):
java.lang.NullPointerException at
java.lang.ProcessBuilder.start(ProcessBuilder.java:1012) at
org.apache.hadoop.util.Shell.runCommand(Shell.java:482) at
org.apache.hadoop.util.Shell.run(Shell.java:455) at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873) at
org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853) at
org.apache.spark.util.Utils$.fetchFile(Utils.scala:407) at
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:430)
at
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:422)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike

foreach-loop (R/doParallel package) fails with big number of iterations

I have the following R-code:
library(doParallel)
cl <- makeCluster(detectCores()-4, outfile = "")
registerDoParallel(cl)
calc <- function(i){
...
#returns a dataframe
}
system.time(
res<- foreach( i = 1:106800, .verbose = TRUE) %dopar% calc(i)
)
stopCluster(cl)
If I run that code from 1:5, it finishes successfully.
The same happens if I run that code from 106000 - 106800.
But it fails if I run the full vector 1-106800, or even 100000-106800 (these are not the very exact numbers I am working with but better readable) with this error message:
...
got results for task 6813
numValues: 6814, numResults: 6813, stopped: TRUE
returning status FALSE
got results for task 6814
numValues: 6814, numResults: 6814, stopped: TRUE
calling combine function
evaluating call object to combine results:
fun(accum, result.6733, result.6734, result.6735, result.6736,
result.6737, result.6738, result.6739, result.6740, result.6741,
result.6742, result.6743, result.6744, result.6745, result.6746,
result.6747, result.6748, result.6749, result.6750, result.6751,
result.6752, result.6753, result.6754, result.6755, result.6756,
result.6757, result.6758, result.6759, result.6760, result.6761,
result.6762, result.6763, result.6764, result.6765, result.6766,
result.6767, result.6768, result.6769, result.6770, result.6771,
result.6772, result.6773, result.6774, result.6775, result.6776,
result.6777, result.6778, result.6779, result.6780, result.6781,
result.6782, result.6783, result.6784, result.6785, result.6786,
result.6787, result.6788, result.6789, result.6790, result.6791,
result.6792, result.6793, result.6794, result.6795, result.6796,
result.6797, result.6798, result.6799, result.6800, result.6801,
result.6802, result.6803, result.6804, result.6805, result.6806,
result.6807, result.6808, result.6809, result.6810, result.6811,
result.6812, result.6813, result.6814)
returning status TRUE
Error in calc(i) :
task 1 failed - "object of type 'S4' is not subsettable"
I have no clue why I get this error message. Unfortunately, I cannot provide a running example as I cannot reproduce it with some simple code. Is a single job failing? If yes, how can I find which one fails? Or any other ideas how to troubleshoot?

pmap bounds error: parallel julia

I get a bounds error when running a function in parallel that runs fine normally (sequentially) e.g. when I run:
parallelsol = #time pmap(i -> findividual(x,y,z), 1:50)
It gives me an error:
exception on 2: exception on exception on 16: 20exception on 5: : ERROR: BoundsError()
in getindex at array.jl:246 (repeats 2 times)
But when I run:
parallelsol = #time map(i -> findividual(prodexcint,firstrun,q,r,unempinc,VUnempperm,Indunempperm,i,VUnemp,poachedwagevec, mw,k,Vp,Vnp,reswage), 1:50)
It runs fine. Any ideas as to why this might be happening?

How to build logistic regression model in SparkR

I am new to Spark as well as SparkR. I have successfully installed Spark and SparkR.
When I tried to build Logistic regression model with R and Spark over csv file stored in HDFS, I got the error "incorrect number of dimensions".
My Code is :
points <- cache(lapplyPartition(textFile(sc, "hdfs://localhost:54310/Henry/data.csv"), readPartition))
collect(points)
w <- runif(n=D, min = -1, max = 1)
cat("Initial w: ", w, "\n")
# Compute logistic regression gradient for a matrix of data points
gradient <- function(partition) {
partition = partition[[1]]
Y <- partition[, 1] # point labels (first column of input file)
X <- partition[, -1] # point coordinates
# For each point (x, y), compute gradient function
dot <- X %*% w
logit <- 1 / (1 + exp(-Y * dot))
grad <- t(X) %*% ((logit - 1) * Y)
list(grad)
}
for (i in 1:iterations) {
cat("On iteration ", i, "\n")
w <- w - reduce(lapplyPartition(points, gradient), "+")
}
Error Message is:
On iteration 1
Error in partition[, 1] : incorrect number of dimensions
Calls: do.call ... func -> FUN -> FUN -> Reduce -> <Anonymous> -> FUN -> FUN
Execution halted
14/09/27 01:38:13 ERROR Executor: Exception in task 0.0 in stage 181.0 (TID 189)
java.lang.NullPointerException
at edu.berkeley.cs.amplab.sparkr.RRDD.compute(RRDD.scala:125)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:701)
14/09/27 01:38:13 WARN TaskSetManager: Lost task 0.0 in stage 181.0 (TID 189, localhost): java.lang.NullPointerException:
edu.berkeley.cs.amplab.sparkr.RRDD.compute(RRDD.scala:125)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:701)
14/09/27 01:38:13 ERROR TaskSetManager: Task 0 in stage 181.0 failed 1 times; aborting job
Error in .jcall(getJRDD(rdd), "Ljava/util/List;", "collect") : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 181.0 failed 1 times, most recent failure: Lost task 0.0 in stage 181.0 (TID 189, localhost): java.lang.NullPointerException: edu.berkeley.cs.amplab.sparkr.RRDD.compute(RRDD.scala:125) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:701) Driver stacktrace:
Dimension of data (sample) :
data <- read.csv("/home/Henry/data.csv")
dim(data)
[1] 17 541
What could be the possible reason for this error?
The problem is that textFile() reads some text data and return a distributed collection of strings, each of which corresponds to a line of the text file. Therefore later in the program partition[, -1] fails. The program's real intent seems to be treating points as a distributed collection of data frames. We are working on providing data frame support in SparkR soon (SPARKR-1).
To resolve the issue, simply manipulate your partition using string operations to extract X, Y correctly. Some other ways include (I think you've probably seen this before) producing a different type of distributed collection from the beginning as is done here: examples/logistic_regression.R.

Resources