I am very new to SparkR (and parallelization in general). I am running SparkR locally (I know that is not the right usage of spark but I am just getting started) and I have tried to re-write some part of my code with sparkR though
collect gives me the following errors by increasing the number of samples as (no error for small number of samples):
Error in unserialize(obj) :
ReadItem: unknown type 0, perhaps written by later version of R
Calls: assetForecast ... convertJListToRList -> lapply -> lapply -> FUN -> unserialize
Execution halted
and the other error which is is probably because of my low memory is:
heap memory error (trying increasing JVM memory & driver memory did not help)
I would appreciate any help regarding FIRST error (I posted the second error since I thought they may be somehow related even though I get them by setting different values for numSlices in parallelize). I think the first one may be a version incompatibility between spark, sparkR and R that causes this serialization issue. I tried installing different version though pretty soon stuck with resolving dependency.
Here is a sample script which simulates what I am doing in SparkR (the error are generated for input.len > 950):
library(SparkR) # load sparkR library
sc <- sparkR.init() ## initialize the sparkR
input.len <- 8000 # size of the input
num.slice <- 2 # number of slices for parallelize function
## Define a few functions to simulate actual calculations
latemail <- function(N, st="2012/01/01", et="2015/12/31") {
## create random date of length N
st <- as.POSIXct(as.Date(st))
et <- as.POSIXct(as.Date(et))
dt <- as.numeric(difftime(et,st,unit="sec"))
ev <- sort(runif(N, 0, dt))
rt <- st + ev
}
encode <- function(ele1, ele2) {
## concatenate ele1 and ele2, seperated by %
return (paste(toString(ele1), toString(ele2), sep = "%"))
}
decode <- function(coded) {
## separate input string by %
idx <- regexpr("%", coded)[1]
ele1 <- as.numeric(substr(coded, 1, idx-1))
ele2 <- substr(coded, idx + 1, nchar(coded))
return (list(ele1, ele2))
}
fakeFun <- function(asset.age, asset.year) {
## fake function to simulate my actual function
return (as.list(rep(asset.age, 10)))
}
wrapperFun <- function(x) {
asset.age <- decode(x)[[1]]
asset.y <- decode(x)[[1]]
df <- fakeFun(asset.age, asset.y)
return (df)
}
## Start of calculations with SparkR
calc.ts <- latemail(input.len) ## create fake years
asset.ages <- runif(input.len) * 10 ## create fake ages
paired <- list()
for (i in 1:length(asset.ages)) {
## keep information of both years and ages in one vector
## using encode function
paired[[length(paired) + 1]] <- encode(asset.ages[[i]], calc.ts[[i]])
}
rdd.paired <- parallelize(sc, paired, numSlices = num.slice)
rdd.df <- lapply(rdd.paired, wrapperFun)
rdd.list <- collect(rdd.df)
print(rdd.list)
sparkR.stop()
Here is the full report of error:
for numSlice = 5 in parallelize function:
> rdd.list <- collect(rdd.df)
15/07/22 17:20:40 INFO RRDD: Times: boot = 0.434 s, init = 0.015 s, broadcast = 0.000 s, read-input = 0.003 s, compute = 0.200 s, write-output = 0.004 s, total = 0.656 s
15/07/22 17:20:41 INFO RRDD: Times: boot = 0.010 s, init = 0.017 s, broadcast = 0.000 s, read-input = 0.003 s, compute = 0.193 s, write-output = 0.004 s, total = 0.227 s
15/07/22 17:20:41 INFO RRDD: Times: boot = 0.010 s, init = 0.013 s, broadcast = 0.001 s, read-input = 0.002 s, compute = 0.191 s, write-output = 0.003 s, total = 0.220 s
15/07/22 17:20:41 INFO RRDD: Times: boot = 0.010 s, init = 0.011 s, broadcast = 0.000 s, read-input = 0.002 s, compute = 0.191 s, write-output = 0.004 s, total = 0.218 s
15/07/22 17:20:41 INFO RRDD: Times: boot = 0.014 s, init = 0.015 s, broadcast = 0.000 s, read-input = 0.003 s, compute = 0.213 s, write-output = 0.004 s, total = 0.249 s
Error in unserialize(obj) :
ReadItem: unknown type 0, perhaps written by later version of R
Calls: collect ... convertJListToRList -> lapply -> lapply -> FUN -> unserialize
Execution halted
for numSlice = 6 in parallelize function
15/07/22 17:18:52 WARN TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2, localhost): java.lang.OutOfMemoryError: Java heap space
edu.berkeley.cs.amplab.sparkr.RRDD.readData(RRDD.scala:258)
edu.berkeley.cs.amplab.sparkr.RRDD.readData(RRDD.scala:243)
edu.berkeley.cs.amplab.sparkr.BaseRRDD.read(RRDD.scala:200)
edu.berkeley.cs.amplab.sparkr.BaseRRDD$$anon$1.next(RRDD.scala:70)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
edu.berkeley.cs.amplab.sparkr.BaseRRDD$$anon$1.foreach(RRDD.scala:66)
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
edu.berkeley.cs.amplab.sparkr.BaseRRDD$$anon$1.to(RRDD.scala:66)
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
edu.berkeley.cs.amplab.sparkr.BaseRRDD$$anon$1.toBuffer(RRDD.scala:66)
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
edu.berkeley.cs.amplab.sparkr.BaseRRDD$$anon$1.toArray(RRDD.scala:66)
org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:774)
org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:774)
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
15/07/22 17:18:52 ERROR TaskSetManager: Task 2 in stage 0.0 failed 1 times; aborting job
Error in readTypedObject(con, type) :
Unsupported type for deserialization
Calls: collect ... callJMethod -> invokeJava -> readObject -> readTypedObject
Execution halted
Is there really a problem in my SparkR installation? If yes, how it runs for small number of samples?
Thanks a lot
The following answer is how it works (or should work in Spark-1.4.0). First initialize a sqlContext as well:
sqlContext <- sparkRSQL.init(sc)
And than change your code starting from
paired <- list()
in
# Create a vector instead of a list
paired <- c()
for (i in 1:length(asset.ages)) {
## keep information of both years and ages in one vector
## using encode function
paired[length(paired) + 1] <- encode(asset.ages[[i]], calc.ts[[i]])
}
# What you actually need is a data.frame or SparkR DataFrame
paired.data.frame <- data.frame(paired=paired)
paired.DataFrame <- createDataFrame(sqlContext, paired.data.frame)
# Map function returns an RDD which you can not collect yet
# Therefor convert it to a DataFrame again
paired.df <- createDataFrame(sqlContext, map(paired.DataFrame,wrapperFun))
# This DataFrame you can collect
paired.result <- collect(paired.df)
Why did I say should work in my first sentence? It works when I run it on my laptop, but I altered the SparkR source code to make map available.
I do not know however to fix this in SparkR 1.2, but would suggest anyway to change to Spark-1.4.0 since SparkR is integrated in Spark since then.
Related
I am trying to bring real time data from LabVIEW (vibration of a bearing and temperature) into an app written in R to create a control chart. It works for a while but eventually crashes with the following error message:
Error in aggregate.data.frame(B, list(rep(1:(nrow(B)%/%n + 1), each = n, :
no rows to aggregate
The process works as LabVIEW takes the data and projects it onto two Excel files. Those files are read in the R code and used to project a control chart in R. The process succeeds for some time, and the failure moment is not always the same amount of time. Sometimes the control chart will run for 6-7 min, other times is will crash in 2 min.
My suspicion is that the Excel files are not being updated fast enough, so the R code tries to read that Excel file when it is empty.
Any suggestions would be great! thank you!
I have tried to lower the sample size taken per second. That did not work.
getwd()
setwd("C:/Users/johnd/Desktop/R Data")
while(1) {
A = fread("C:/Users/johnd/Desktop/R Data/a1.csv" , skip = 4 , header = FALSE , col.names = c("t1","B2","t2","AM","t3","M","t4","B1"))
t1 = A$t1
B2 = A$B2
t2 = A$t2
AM = A$AM
t3 = A$t3
M = A$M
t4 = A$t4
B1 = A$B1
B = fread("C:/Users/johnd/Desktop/R Data/b1.csv" , skip = 4 , header = FALSE , col.names = c("T1","small","T2","big"))
T1 = B$T1
small = B$small
T2 = B$T2
big = B$big
DJ1 = A[seq(1,nrow(A),1),c('t1','B2','AM','M','B1')]
DJ1
n = 16
DJ2 = aggregate(B,list(rep(1:(nrow(B)%/%n+1),each=n,len=nrow(B))),mean)[-1]
DJ2
#------------------------------------------------------------------------
DJ6 = cbind(DJ1[,'B1'],DJ2[,c('small','big')]) # creates matrix for these three indicators
DJ6
#--------------T2 Hand made---------------------------------------------------------------------
new_B1 = DJ6[,'B1']
new_small = DJ6[,'small'] ### decompose the DJ6 matrix into vectors for each indicator(temperature, big & small accelerometers)
new_big = DJ6[,'big']
new_B1
new_small
new_big
mean_B1 = as.numeric(colMeans(DJ6[,'B1']))
mean_small = as.numeric(colMeans(DJ6[,'small'])) ##decomposes into vectors of type numeric
mean_big = as.numeric(colMeans(DJ6[,'big']))
cov_inv = data.matrix(solve(cov(DJ6))) # obtain inverse covariance matrix
cov_inv
p = ncol(DJ6) #changed to pull number of parameters by taking the number of coumns in OG matrix #p=3 # #ofQuality Characteristics
m=64 # #of samples (10 seconds of data)
a_alpha = 0.99
f= qf(a_alpha , df1 = p,df2 = (m-p)) ### calculates the F-Statistic for our data
f
UCL = (p*(m+1)*(m-1)*(f))/(m*(m-p)) ###produces upper control limit
UCL
diff_B1 = new_B1-mean_B1
diff_small = new_small-mean_small
diff_big = new_big-mean_big
DJ7 = cbind(diff_B1, diff_small , diff_big) #produces matrix of difference between average and observations (x-(x-bar))
DJ7
# DJ8 = data.matrix(DJ7[1,])
# DJ8
DJ9 = data.matrix(DJ7) ### turns matrix into appropriate numeric form
DJ9
# T2.1.1 = DJ8 %*% cov_inv %*% t(DJ8)
# T2.1.1
# T2.1 = t(as.matrix(DJ9[1,])) %*% cov_inv %*% as.matrix(DJ9[1,])
# T2.1
#T2 <- NULL
for(i in 1:64){ #### creates vector of T^2 statistic
T2<- t(as.matrix(DJ9[i,])) %*% cov_inv %*% as.matrix(DJ9[i,]) # calculation of T^2 test statistic ## there is no calculation of x-double bar
write.table(T2,"C:/Users/johnd/Desktop/R Data/c1.csv",append=T,sep="," , col.names = FALSE)#
#
DJ12 <-fread("C:/Users/johnd/Desktop/R Data/c1.csv" , header = FALSE ) #
}
# DJ12
DJ12$V1 = 1:nrow(DJ12)
# plot(DJ12 , type='l')
p1 = nrow(DJ12)-m
p2 = nrow(DJ12)
plot(DJ12[p1:p2,], type ='o', ylim =c(0,15), ylab="T2 Chart" , xlab="Data points") ### plots last 640 points
# plot(DJ12[p1:p2,], type ='o' , ylim =c(0,15) , ylab="T2 Chart" , xlab="Data points")
abline(h=UCL , col="red") ## displays upper control limit
Sys.sleep(1)
}
The process succeeds for some time, and the failure moment is not always the same amount of time. Sometimes the control chart will run for 6-7 min, other times is will crash in 2 min.
My suspicion is that the Excel files are not being updated fast enough, so the R code tries to read that Excel file when it is empty.
Your suspicion is correct.
With your current design, your R application can crash depending on how fast it runs relative to your LabVIEW application. This is called a race condition; you must eliminate race conditions from your code.
A quick and dirty solution
One simple solution to avoid the crash is to call NROW to check if any data exists. If there's no data available, don't call aggregate. This is described here: error message in r: no rows to aggregate
A more robust solution
A better solution is to use a communications protocol like TCP to stream data from LabVIEW to R, instead of using CSV files to transfer real-time data. For example, your R program could listen for data on a TCP socket. Make it wait for data to be sent from LabVIEW before running your data processing code.
Here is an example on using socketConnection in R: http://blog.corynissen.com/2013/05/using-r-to-communicate-via-socket.html
Here is an example on sending/receiving data over TCP in LabVIEW: http://www.ni.com/product-documentation/2710/en/
I am developing Neural Networks in my SQLServer2017 with R.
I use the package MicrosoftML and the NYC TaxiData.
Goal: Neural Network to predict the "Ratecode" of a single TaxiRide
Here is the Code:
library(MicrosoftML)
library(dplyr)
dat_all <- InputData;
sizeAll <- length(InputData$tip_amount);
sample_train <- base::sample(nrow(dat_all),
size = (sizeAll*0.9))
sample_test <- base::sample((1:nrow(dat_all))[-sample_train],
size = (sizeAll*0.1))
dat_train <- dat_all %>%
slice(sample_train)
dat_test <- dat_all %>%
slice(sample_test);
form <- Rate ~ total_amount+trip_distance+duration_in_minutes+passenger_count+PULocationID+DOLocationID;
model <- rxNeuralNet(
formula = form,
data = dat_train,
type = "multiClass",
verbose = 1);
trained_model <- data.frame(payload = as.raw(serialize(model, connection=NULL)));
The Rate is successfully detected as a factor with size 5, representing different rates such as "Standard" or "JFK".
When running the Code, I get the following Error:
Error: All rows in data has missing values (N/A). Please clean
missing data before training. Error in processing machine learning
request. Fehler in doTryCatch(return(expr), name, parentenv, handler)
: Error: All rows in data has missing values (N/A). Please clean
missing data before training. Error in processing machine learning
request. Ruft auf: source ... tryCatch -> tryCatchList -> tryCatchOne
-> doTryCatch -> .Call
The very same error occurs when replacing the rate with a rateID.
I estimate that there is some form of Transformation to get this working, but somewhat the documentation of MS is lacking at this Point.
Here is the verbose of my NN before it wipes:
***** Net definition *****
input Data [6];
STDOUT message(s) from external script:
hidden H [100] sigmoid { // Depth 1
from Data all;
}
output Result [5] softmax { // Depth 0
from H all;
}
***** End net definition *****
Input count: 6
Output count: 5
Output Function: SoftMax
Loss Function: LogLoss
PreTrainer: NoPreTrainer
___________________________________________________________________
Starting training...
Learning rate: 0,001000
Momentum: 0,000000
InitWtsDiameter: 0,100000
___________________________________________________________________
Initializing 1 Hidden Layers, 1205 Weights...
Elapsed time: 00:00:00.7222942
I figured it out, here is the working Code:
library(MicrosoftML)
library(dplyr)
netDefinition <- ("
input Data auto;
hidden Mystery [100] sigmoid from Data all;
hidden Magic [100] sigmoid from Mystery all;
output Result auto softmax from Magic all;
")
dat_all <- InputData;
LocationLevels <- as.factor(c(1:265));
dat_all$PULocationID <- factor(dat_all$PULocationID, levels=LocationLevels);
dat_all$DOLocationID <- factor(dat_all$DOLocationID, levels=LocationLevels);
dat_all$RatecodeID <- factor(dat_all$RatecodeID, levels=as.factor(c(1:6)) );
form <- RatecodeID ~ trip_distance+total_amount+duration_in_minutes+passenger_count+PULocationID+DOLocationID;
model <- rxNeuralNet(
formula = form,
data = dat_all,
netDefinition=netDefinition,
type = "multiClass",
numIterations = 100,
verbose = 1);
trained_model <- data.frame(payload = as.raw(serialize(model, connection=NULL)));
Main Issue was Factorizing the Data correctly.
I'm trying to copy a big database into Spark using spark_read_csv, but I'm getting the following error as output:
Error: org.apache.spark.SparkException: Job aborted due to stage
failure: Task 0 in stage 16.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 16.0 (TID 176, 10.1.2.235):
java.lang.IllegalArgumentException: requirement failed: Decimal
precision 8 exceeds max precision 7
data_tbl <- spark_read_csv(sc, "data", "D:/base_csv", delimiter = "|", overwrite = TRUE)
It's a big data set, about 5.8 million of records, with my dataset I have data of types Int, num and chr.
I think you have a couple options depending on the spark version that you're using
Spark >=1.6.1
from here: https://docs.databricks.com/spark/latest/sparkr/functions/read.df.html
it seems, you can specifically specify your schema to force it to use doubles
csvSchema <- structType(structField("carat", "double"), structField("color", "string"))
diamondsLoadWithSchema<- read.df("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv",
source = "csv", header="true", schema = csvSchema)
Spark < 1.6.1
consider test.csv
1,a,4.1234567890
2,b,9.0987654321
you can easily make this more efficient, but I think you get the gist
linesplit <- function(x){
tmp <- strsplit(x,",")
return ( tmp)
}
lineconvert <- function(x){
arow <- x[[1]]
converted <- list(as.integer(arow[1]), as.character(arow[2]),as.double(arow[3]))
return (converted)
}
rdd <- SparkR:::textFile(sc,'/path/to/test.csv')
lnspl <- SparkR:::map(rdd, linesplit)
ll2 <- SparkR:::map(lnspl,lineconvert)
ddf <- createDataFrame(sqlContext,ll2)
head(ddf)
_1 _2 _3
1 1 a 4.1234567890
2 2 b 9.0987654321
NOTE: the SparkR::: methods are private for a reason, the docs say 'be careful when you use this'
I just started testing Spark R 2.0, and find the execution of dapply very slow.
For example, the following code
set.seed(2)
random_DF<-data.frame(matrix(rnorm(1000000),100000,10))
system.time(dummy_res<-random_DF[random_DF[,1]>1,])
user system elapsed
0.005 0.000 0.006 `
is executed in 6ms
Now, if I create a Spark DF on 4 partition, and run on 4 cores, I get:
sparkR.session(master = "local[4]")
random_DF_Spark <- repartition(createDataFrame(random_DF),4)
subset_DF_Spark <- dapply(
random_DF_Spark,
function(x) {
y <- x[x[1] > 1, ]
y
},
schema(random_DF_Spark))
system.time(dummy_res_Spark<-collect(subset_DF_Spark))
user system elapsed
2.003 0.119 62.919
I.e. 1 minute, which is abnormally slow.... Am I missing something?
I get also a warning (TaskSetManager: Stage 64 contains a task of very large size (16411 KB). The maximum recommended task size is 100 KB.). Why is this 100KB limit so low?
I am using R 3.3.0 on Mac OS 10.10.5
Any insight welcome!
I want to find documents whose similarity between other doucuments are larger than a given value(0.1) by cutting documents into blocks.
library(tm)
data("crude")
sample.dtm <- DocumentTermMatrix(
crude, control=list(
weighting=function(x) weightTfIdf(x, normalize=FALSE),
stopwords=TRUE
)
)
step = 5
n = nrow(sample.dtm)
block = n %/% step
start = (c(1:block)-1)*step+1
end = start+step-1
j = unlist(lapply(1:(block-1),function(x) rep(((x+1):block),times=1)))
i = unlist(lapply(1:block,function(x) rep(x,times=(block-x))))
ij <- cbind(i,j)
library(skmeans)
getdocs <- function(k){
ci <- c(start[k[[1]]]:end[k[[1]]])
cj <- c(start[k[[2]]]:end[k[[2]]])
combi <- sample.dtm[ci]
combj < -sample.dtm[cj]
rownames(combi)<-ci
rownames(combj)<-cj
comb<-c(combi,combj)
sim<-1-skmeans_xdist(comb)
cat("Block", k[[1]], "with Block", k[[2]], "\n")
flush.console()
tri.sim<-upper.tri(sim,diag=F)
results<-tri.sim & sim>0.1
docs<-apply(results,1,function(x) length(x[x==TRUE]))
docnames<-names(docs)[docs>0]
gc()
return (docnames)
}
It works well when using apply
system.time(rmdocs<-apply(ij,1,getdocs))
When using parRapply
library(snow)
library(skmeans)
cl<-makeCluster(2)
clusterExport(cl,list("getdocs","sample.dtm","start","end"))
system.time(rmdocs<-parRapply(cl,ij,getdocs))
Error:
Error in checkForRemoteErrors(val) :
2 nodes produced errors; first error: attempt to set 'rownames' on an object with no dimensions
Timing stopped at: 0.01 0 0.04
It seems that sample.dtm coundn't be used in parRapply. I'm confused. Can anyone help me? Thanks!
In addition to exporting objects, you need to load the necessary packages on the cluster workers. In your case, the result of not doing so is that there isn't a dimnames method defined for "DocumentTermMatrix" objects, causing rownames<- to fail.
You can load packages on the cluster workers with the clusterEvalQ function:
clusterEvalQ(cl, { library(tm); library(skmeans) })
After doing that, rownames(combi)<-ci will work correctly.
Also, if you want to see the output from cat, you should use the makeCluster outfile argument:
cl <- makeCluster(2, outfile='')