I am new to Spark as well as SparkR. I have successfully installed Spark and SparkR.
When I tried to build Logistic regression model with R and Spark over csv file stored in HDFS, I got the error "incorrect number of dimensions".
My Code is :
points <- cache(lapplyPartition(textFile(sc, "hdfs://localhost:54310/Henry/data.csv"), readPartition))
collect(points)
w <- runif(n=D, min = -1, max = 1)
cat("Initial w: ", w, "\n")
# Compute logistic regression gradient for a matrix of data points
gradient <- function(partition) {
partition = partition[[1]]
Y <- partition[, 1] # point labels (first column of input file)
X <- partition[, -1] # point coordinates
# For each point (x, y), compute gradient function
dot <- X %*% w
logit <- 1 / (1 + exp(-Y * dot))
grad <- t(X) %*% ((logit - 1) * Y)
list(grad)
}
for (i in 1:iterations) {
cat("On iteration ", i, "\n")
w <- w - reduce(lapplyPartition(points, gradient), "+")
}
Error Message is:
On iteration 1
Error in partition[, 1] : incorrect number of dimensions
Calls: do.call ... func -> FUN -> FUN -> Reduce -> <Anonymous> -> FUN -> FUN
Execution halted
14/09/27 01:38:13 ERROR Executor: Exception in task 0.0 in stage 181.0 (TID 189)
java.lang.NullPointerException
at edu.berkeley.cs.amplab.sparkr.RRDD.compute(RRDD.scala:125)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:701)
14/09/27 01:38:13 WARN TaskSetManager: Lost task 0.0 in stage 181.0 (TID 189, localhost): java.lang.NullPointerException:
edu.berkeley.cs.amplab.sparkr.RRDD.compute(RRDD.scala:125)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:701)
14/09/27 01:38:13 ERROR TaskSetManager: Task 0 in stage 181.0 failed 1 times; aborting job
Error in .jcall(getJRDD(rdd), "Ljava/util/List;", "collect") : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 181.0 failed 1 times, most recent failure: Lost task 0.0 in stage 181.0 (TID 189, localhost): java.lang.NullPointerException: edu.berkeley.cs.amplab.sparkr.RRDD.compute(RRDD.scala:125) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:701) Driver stacktrace:
Dimension of data (sample) :
data <- read.csv("/home/Henry/data.csv")
dim(data)
[1] 17 541
What could be the possible reason for this error?
The problem is that textFile() reads some text data and return a distributed collection of strings, each of which corresponds to a line of the text file. Therefore later in the program partition[, -1] fails. The program's real intent seems to be treating points as a distributed collection of data frames. We are working on providing data frame support in SparkR soon (SPARKR-1).
To resolve the issue, simply manipulate your partition using string operations to extract X, Y correctly. Some other ways include (I think you've probably seen this before) producing a different type of distributed collection from the beginning as is done here: examples/logistic_regression.R.
Related
I'm solving a particular optimization problem in julia using JuMP, Ipopt and I have a problem finding the history of values i.e. value of x from every iteration.
I couldn't find anything useful in documentations.
Minimal example:
using JuMP
import Ipopt
model = Model(Ipopt.Optimizer)
#variable(model, -2.0 <= x <= 2.0, start = -2.0)
#NLobjective(model, Min, (x - 1.0) ^ 2)
optimize!(model)
value(x)
and I'd like to see value of x from every iteration, not only the last to create plot of x vs iteration.
Looking for any help :)
Each solver has a parameter on how verbose it is in representing the results.
In case of Ipopt you can do before calling optimize!(model):
set_optimizer_attribute(model, "print_level", 7)
In logs loog for curr_x (here is a part of logs):
**************************************************
*** Summary of Iteration: 6:
**************************************************
iter objective inf_pr inf_du lg(mu) ||d|| lg(rg) alpha_du alpha_pr ls
6 3.8455657e-13 0.00e+00 8.39e-17 -5.7 5.74e-05 - 1.00e+00 1.00e+00f 1
**************************************************
*** Beginning Iteration 6 from the following point:
**************************************************
Current barrier parameter mu = 1.8449144625279479e-06
Current fraction-to-the-boundary parameter tau = 9.9999815508553747e-01
||curr_x||_inf = 9.9999937987374388e-01
||curr_s||_inf = 0.0000000000000000e+00
||curr_y_c||_inf = 0.0000000000000000e+00
||curr_y_d||_inf = 0.0000000000000000e+00
||curr_z_L||_inf = 6.1403864613595829e-07
This is currently not possible. But there's an open issue: https://github.com/jump-dev/Ipopt.jl/issues/281
I am trying to run a function containing a for loop in parallel using spark_apply in Databricks on Azure.
My function is:
distribution <- function(sims){
for (p in 1:100){
increment_value <- list()
profiles <- list()
samples <- list()
sample_num <- list()
for (i in 1:length(samp_seq)){
w <- sample(sims, size=batch)
z <- sum(w)
name3 <- as.character(z)
samples[[name3]] <- data.frame(value = z)
}
}
}
when I put the function in spark_apply like this:
sdf_len(sc,1) %>%
spark_apply(distribution)
I get the following error:
Error : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 305.0 failed 4 times, most recent failure: Lost task 0.3 in stage 305.0 (TID 297, 10.139.64.6, executor 0): java.lang.Exception: sparklyr worker rscript failure with status 255, check worker logs for details. Error : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 305.0 failed 4 times, most recent failure: Lost task 0.3 in stage 305.0 (TID 297, 10.139.64.6, executor 0): java.lang.Exception: sparklyr worker rscript failure with status 255, check worker logs for details.
I'm trying to make a layer in Keras (R) which (matrix) exponential a layer of shape (d,d).
ie.: Input to layer is a dxd matrix and the output is a dxd matrix which is the (matrix) exponential of the input matrix.
What I've Implemented to Date:
Here's what I've done (its a degree 4 approximation because I'm also not sure how to get the tensorflow matrix exponential command working in Keras):
# Matrix Exponential
Matrix_Exp<- R6::R6Class("KerasLayer",
inherit = KerasLayer,
public = list(
call = function(x, mask = NULL) {
# Initialize Tenor-like Object -> Tensor Objects
ord0 = k_eye((k_shape(x)[1]))
ord1 = x
ord2 = (1/2)*k_dot(x,x) # note x is square so this works
ord3 = (1/6)*k_dot(x,ord2)
ord4 = (1/24)*k_dot(x,ord3)
ord0+ord1+ord2 +ord3+ord4
},
compute_output_shape = function(input_shape) {
c(d,d)
}
)
)
# Create layer wrapper function
layer_Matrix_Exp <- function(object) {
create_layer(Matrix_Exp, object)
}
I'm plugging a model with this summary into the custom layer:
Model: "sequential_32"
_________________________________________________________________________________________________________________________________________________________________
Layer (type) Output Shape Param #
=================================================================================================================================================================
dense_63 (Dense) (None, 100) 400
_________________________________________________________________________________________________________________________________________________________________
dense_64 (Dense) (None, 4) 404
_________________________________________________________________________________________________________________________________________________________________
reshape_10 (Reshape) (None, 2, 2) 0
=================================================================================================================================================================
Total params: 804
Trainable params: 804
Non-trainable params: 0
_________________________________________________________________________________________________________________________________________________________________
Problem/Error:
But I run into this error when passing layers_NE %>% layer_Matrix_Exp
WARNING:tensorflow:Entity <function wrap_fn.<locals>.fn at 0x7fbdd0cf2b90> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Evaluation error: object 'size' not found.
Error in py_call_impl(callable, dots$args, dots$keywords) :
RuntimeError: in converted code:
/scratch/users/BIM/R/x86_64-redhat-linux-gnu-library/3.6/keras/python/kerastools/layer.py:30 call *
return self.r_call(inputs, mask)
<string>:4 fn
/scratch/users/BIM/R/x86_64-redhat-linux-gnu-library/3.6/reticulate/python/rpytools/call.py:21 python_function
raise RuntimeError(res[kErrorKey])
RuntimeError: Evaluation error: object 'size' not found.
Note:
The problem is coming from the identity part but I don't know how to fix this.
Question:
How to fix error.
How to replace the order 4 (manual) approximation to the matrix exponential with the keras equivalent to the tensorflow matrix exponential command.
Thanks in advance.
Using pvclust::pvclust , I got an error
Error in solve.default(crossprod(X, X/vv)) : Lapack routine dgesv:
system is exactly singular: U[2,2] = 0 Calls: ...
pvclust.merge -> lapply -> FUN -> msfit -> solve -> solve.default
Execution halted
I don't want stop analysis even if crossprod(X, X/vv) be singular matrix, so I tried to insert an if {...} block in pvclust::msfit to check whether crossprod(X, X/vv) is singular or not by matrixcalc::is.singular.matrix, and if it is so, return NA and continue.
After saved my.msfit.R which contain msfit which incerted if(!is.singular.matrix(...)) {...}else{...} in original pvclust::msfit,
methods::insertSource('/myFuncDir/my.msfit.R',package="pvclust",functions='msfit')
But I got error below
Error in assign(this, thisObj, envir = envwhere) :
cannot change value of locked binding for 'msfit'
In addition: Warning message:
In methods::insertSource(filename, package = "pvclust", functions = "msfit", :
cannot insert these (not found in source): "msfit"
Is there any solution? Should I request of the author of the pvclust package?
== Below added after posting ==
An accurate advice was given in comments to use try/catch syntax, but I don't think it will give me solution.
Concerning my poor English skill, I present a toy sample which tells the situation.
fun.a <- function(a1,a2,a3,a4){
sum1 <- a1 + a2
sum2 <- a2 + a3
sum3 <- a3 + a4
return(list(sum1,sum2,sum3))
}
fun.a(1,2,3,'Char')
Because that the sum3 will be an error, so fun.a(1,2,3,'Char') returns error.
But, I want to return
List [sum1, sum2, NaN]
If I use tryCatch(...,error=expr), sum1 to sum3(actually, solve(...)in pvclust::msfit) should be wrapped.
But, fun.a(msfit) is inner function of locked package(pvclust).
I get a bounds error when running a function in parallel that runs fine normally (sequentially) e.g. when I run:
parallelsol = #time pmap(i -> findividual(x,y,z), 1:50)
It gives me an error:
exception on 2: exception on exception on 16: 20exception on 5: : ERROR: BoundsError()
in getindex at array.jl:246 (repeats 2 times)
But when I run:
parallelsol = #time map(i -> findividual(prodexcint,firstrun,q,r,unempinc,VUnempperm,Indunempperm,i,VUnemp,poachedwagevec, mw,k,Vp,Vnp,reswage), 1:50)
It runs fine. Any ideas as to why this might be happening?