Related
#data splicing
set.seed(12345)
train <- sample(1:nrow(student.mat.pass.or.fail),size =
ceiling(0.80*nrow(student.mat.pass.or.fail)),replace = FALSE)
# training set
students_train <- student.mat.pass.or.fail[train,]
# test set
students_test <- student.mat.pass.or.fail[-train,]
# penalty matrix
penalty.matrix <- matrix(c(0,1,10,0), byrow=TRUE, nrow=2)
# building the classification tree with part
tree <- rpart(class~.,
data = students_train, # as.matrix(students_train)
parms = list(loss = penalty.matrix),
method = "class")
object is not a matrix, can someone help me cause I'm new in R I also used the as. matrix(students_train) but it still showing the same problem
Hi everyone I am trying to calculate the accuracy statistics for Hierarchical Time Series, using the hts package, but I get an error that says "Error in x - fcasts : non-conformable arrays".
library(hts)
abc <- matrix(sample(1:100, 32*140, replace=TRUE), ncol=32)
colnames(abc) <- c(
paste0("A0",1:5),
paste0("B0",1:9),"B10",
paste0("C0",1:8),
paste0("D0",1:5),
paste0("E0",1:4)
)
abc <- ts(abc, start=2019, frequency=365.25/7)
x <- hts(abc, characters = c(1,2))
data <- window(x, start = 2019.000, end = 2021.166)
test <- window(x, start = 2021.185)
fcasts <- forecast(data, h = 20, method = "bu")
accuracy(fcasts, test)
accuracy(fcasts test, levels = 1)
Then the error message is:
> data <- window(x, start = 2019.000, end = 2021.166)
> test <- window(x, start = 2021.185)
> fcasts <- forecast(data, h = 20, method = "bu")
There were 32 warnings (use warnings() to see them)
> accuracy(fcasts, test)
Error in x - fcasts : non-conformable arrays
> accuracy(fcasts, test, levels = 1)
Error in x - fcasts : non-conformable arrays
Thank you
This is a bug in the hts package, which I've now fixed in the dev version (https://github.com/earowang/hts/commit/3f444cf6d6aca23a3a7f2d482df2e33bb078dc55).
Using the CRAN version, the problem is avoided by using the same forecast horizon (h) as the length of the test set.
There was another bug in accuracy() triggered by weekly data which I've also fixed.
I think the problem occurs because of the list object for fcasts and test.
Try this:
accuracy(fcasts$bts, test$bts)
accuracy(fcasts$bts, test$bts, levels = 1)
I am trying to train a neural network for churn prediction with R package neuralnet. Here is the code:
data <- read.csv('C:/PredictChurn.csv')
maxs <- apply(data, 2, max)
mins <- apply(data, 2, min)
scaled_temp <- as.data.frame(scale(data, center = mins, scale = maxs - mins))
scaled <- data
scaled[, -c(1)] <- scaled_temp[, -c(1)]
index <- sample(1:nrow(data),round(0.75*nrow(data)))
train_ <- scaled[index,]
test_ <- scaled[-index,]
library(neuralnet)
n <- names(train_[, -c(1)])
f <- as.formula(paste("CHURNED_F ~", paste(n[!n %in% "CHURNED_F"], collapse = " + ")))
nn <- neuralnet(f,data=train_,hidden=c(5),linear.output=F)
It works as it should, however when training with the full data set (in the range of millions of rows) it just takes too long. So I know R is by default single threaded, so I have tried researching on how to parallelize the work into all the cores. Is it even possible to make this function in parallel? I have tried various packages with no success.
Has anyone been able to do this?
It doesn't have to be the neuralnet package, any solution that lets me train a neural network would work.
Thank you
I have had good experiences with the package Rmpi, and it may be applicable in your case too.
library(Rmpi)
Briefly, its usage is as follows:
nproc = 4 # could be automatically determined
# Specify one master and nproc-1 slaves
Rmpi:: mpi.spawn.Rslaves(nslaves=nproc-1)
# Execute function "func_to_be_parallelized" on multiple CPUs; pass two variables to function
my_fast_results = Rmpi::mpi.parLapply(var1_passed_to_func,
func_to_be_parallelized,
var2_passed_to_func)
# Close slaves
Rmpi::mpi.close.Rslaves(dellog=T)
You can try using the caret and doParallel packages for this. This is what I have been using. It works for some of the model types but may not work for all.
layer1 = c(6,12,18,24,30)
layer2 = c(6,12,18,24,30)
layer3 = c(6,12,18,24,30)
cv.folds = 5
# In order to make models fully reproducible when using parallel processing, we need to pass seeds as a parameter
# https://stackoverflow.com/questions/13403427/fully-reproducible-parallel-models-using-caret
total.param.permutations = length(layer1) * length(layer2) * length(layer3)
seeds <- vector(mode = "list", length = cv.folds + 1)
set.seed(1)
for(i in 1:cv.folds) seeds[[i]]<- sample.int(n=1, total.param.permutations, replace = TRUE)
seeds[[cv.folds + 1]]<-sample.int(1, 1, replace = TRUE) #for the last model
nn.grid <- expand.grid(layer1 = layer1, layer2 = layer2, layer3 = layer3)
cl <- makeCluster(detectCores()*0.5) # use 50% of cores only, leave rest for other tasks
registerDoParallel(cl)
train_control <- caret::trainControl(method = "cv"
,number=cv.folds
,seeds = seeds # user defined seeds for parallel processing
,verboseIter = TRUE
,allowParallel = TRUE
)
stopCluster(cl)
registerDoSEQ()
tic("Total Time to NN Training: ")
set.seed(1)
model.nn.caret = caret::train(form = formula,
data = scaled.train.data,
method = 'neuralnet',
tuneGrid = nn.grid,
trControl = train_control
)
toc()
I was using the example given by h2o for ECG anomaly detection.
When trying to compute manually the MSE, I got different results.
To demonstrate the difference I used the last test case
but all 23 cases differ.
Attached is the full code:
Thanks,
Eli.
suppressMessages(library(h2o))
localH2O = h2o.init(max_mem_size = '6g', # use 6GB of RAM of *GB available
nthreads = -1) # use all CPUs (8 on my personal computer :3)
# Download and import ECG train and test data into the H2O cluster
train_ecg <- h2o.importFile(path = "http://h2o-public-test-data.s3.amazonaws.com/smalldata/anomaly/ecg_discord_train.csv",
header = FALSE,
sep = ",")
test_ecg <- h2o.importFile(path = "http://h2o-public-test-data.s3.amazonaws.com/smalldata/anomaly/ecg_discord_test.csv",
header = FALSE,
sep = ",")
# Train deep autoencoder learning model on "normal"
# training data, y ignored
anomaly_model <- h2o.deeplearning(x = names(train_ecg),
training_frame = train_ecg,
activation = "Tanh",
autoencoder = TRUE,
hidden = c(50,20,50),
l1 = 1e-4,
epochs = 100)
# Compute reconstruction error with the Anomaly
# detection app (MSE between output layer and input layer)
recon_error <- h2o.anomaly(anomaly_model, test_ecg)
# Pull reconstruction error data into R and
# plot to find outliers (last 3 heartbeats)
recon_error <- as.data.frame(recon_error)
recon_error
plot.ts(recon_error)
test_recon <- h2o.predict(anomaly_model, test_ecg)
t <- as.vector(test_ecg[23,])
r <- as.vector(test_recon[23,])
mse.23 <- sum((t-r)^2)/length(t)
mse.23
recon_error[23,]
> mse.23
[1] 2.607374
> recon_error[23,]
[1] 8.264768
it is not really an answer but I did what #Arno Candel has suggested. I have tried to combine test and train data and normalize to 0 - 1. After that, I split the combined and normalized data back to test and train data and run the scripts as generated by the OP. However, I am still getting a different MSE using manual calculation. The MSE is also different when I normalized test and train data separately. Is there something I can do to get the manual calculation correctly?
suppressMessages(library(purrr))
suppressMessages(library(dplyr))
suppressMessages(library(h2o))
localH2O = h2o.init(max_mem_size = '6g', # use 6GB of RAM of *GB available
nthreads = -1) # use all CPUs (8 on my personal computer :3)
# Download and import ECG train and test data into the H2O cluster
train_ecg <- h2o.importFile(path = "http://h2o-public-test-data.s3.amazonaws.com/smalldata/anomaly/ecg_discord_train.csv",
header = FALSE,
sep = ",")
test_ecg <- h2o.importFile(path = "http://h2o-public-test-data.s3.amazonaws.com/smalldata/anomaly/ecg_discord_test.csv",
header = FALSE,
sep = ",")
### adding this section
# normalize data
train_ecg <- as.data.frame(train_ecg)
test_ecg <- as.data.frame(test_ecg)
dat <- rbind(train_ecg,test_ecg)
get_desc <- function(x) {
map(x, ~list(
min = min(.x),
max = max(.x),
mean = mean(.x),
sd = sd(.x)
))
}
normalization_minmax <- function(x, desc) {
map2_dfc(x, desc, ~(.x - .y$min)/(.y$max - .y$min))
}
desc <- dat %>%
get_desc()
dat <- dat %>%
normalization_minmax(desc)
train_ecg <- as.matrix(dat[1:20,]) ; test_ecg <- as.matrix(dat[21:43,])
# Train deep autoencoder learning model on "normal"
# training data, y ignored
anomaly_model <- h2o.deeplearning(x = names(train_ecg),
training_frame = train_ecg,
activation = "Tanh",
autoencoder = TRUE,
hidden = c(50,20,50),
l1 = 1e-4,
epochs = 100)
# Compute reconstruction error with the Anomaly
# detection app (MSE between output layer and input layer)
recon_error <- h2o.anomaly(anomaly_model, test_ecg)
# Pull reconstruction error data into R and
# plot to find outliers (last 3 heartbeats)
recon_error <- as.data.frame(recon_error)
recon_error
plot.ts(recon_error)
test_recon <- h2o.predict(anomaly_model, test_ecg)
t <- as.vector(test_ecg[23,])
r <- as.vector(test_recon[23,])
mse.23 <- sum((t-r)^2)/length(t)
mse.23
recon_error[23,]
> mse.23
[1] 23.14947
> recon_error[23,]
[1] 8.076866
For autoencoders in H2O, the MSE math is done in the normalized space to avoid numerical scaling issues. For example, if you have categorical features or very large numbers, the neural network autoencoder can't directly operate on those numbers, but instead, it first does dummy one-hot encoding and normalization of numeric features, then it does the fwd/back propagation and computation of reconstruction errors (in the normalized and expanded space). You can manually divide each column by its range (max-min) first for purely numerical data, and your results should match.
Here is a JUnit that does this check explicitly (on that very dataset):
https://github.com/h2oai/h2o-3/blob/master/h2o-algos/src/test/java/hex/deeplearning/DeepLearningAutoEncoderTest.java#L86-L104
You can also see https://0xdata.atlassian.net/browse/PUBDEV-2078 for more info.
I'm doing hierarchical clustering with an R package called pvclust, which builds on hclust by incorporating bootstrapping to calculate significance levels for the clusters obtained.
Consider the following data set with 3 dimensions and 10 observations:
mat <- as.matrix(data.frame("A"=c(9000,2,238),"B"=c(10000,6,224),"C"=c(1001,3,259),
"D"=c(9580,94,51),"E"=c(9328,5,248),"F"=c(10000,100,50),
"G"=c(1020,2,240),"H"=c(1012,3,260),"I"=c(1012,3,260),
"J"=c(984,98,49)))
When I use hclust alone, the clustering runs fine for both Euclidean measures and correlation measures:
# euclidean-based distance
dist1 <- dist(t(mat),method="euclidean")
mat.cl1 <- hclust(dist1,method="average")
# correlation-based distance
dist2 <- as.dist(1 - cor(mat))
mat.cl2 <- hclust(dist2, method="average")
However, when using the each set up with pvclust, as follows:
library(pvclust)
# euclidean-based distance
mat.pcl1 <- pvclust(mat, method.hclust="average", method.dist="euclidean", nboot=1000)
# correlation-based distance
mat.pcl2 <- pvclust(mat, method.hclust="average", method.dist="correlation", nboot=1000)
... I get the following errors:
Euclidean: Error in hclust(distance, method = method.hclust) :
must have n >= 2 objects to cluster
Correlation: Error in cor(x, method = "pearson", use = use.cor) :
supply both 'x' and 'y' or a matrix-like 'x'.
Note that the distance is calculated by pvclust so there is no need for a distance calculation beforehand. Also note that the hclust method (average, median, etc.) does not affect the problem.
When I increase the dimensionality of the data set to 4, pvclust now runs fine. Why is it that I'm getting these errors for pvclust at 3 dimensions and below but not for hclust? Furthermore, why do the errors disappear when I use a data set above 4 dimensions?
At the end of function pvclust we see a line
mboot <- lapply(r, boot.hclust, data = data, object.hclust = data.hclust,
nboot = nboot, method.dist = method.dist, use.cor = use.cor,
method.hclust = method.hclust, store = store, weight = weight)
then digging deeper we find
getAnywhere("boot.hclust")
function (r, data, object.hclust, method.dist, use.cor, method.hclust,
nboot, store, weight = F)
{
n <- nrow(data)
size <- round(n * r, digits = 0)
....
smpl <- sample(1:n, size, replace = TRUE)
suppressWarnings(distance <- dist.pvclust(data[smpl,
], method = method.dist, use.cor = use.cor))
....
}
also note, that the default value of parameter r for function pvclust is r=seq(.5,1.4,by=.1). Well, actually as we can see this value is being changed somewhere:
Bootstrap (r = 0.33)...
so what we get is size <- round(3 * 0.33, digits =0) which is 1, finally data[smpl,] has only 1 row, which is less than 2. After correction of r it returns some error which possibly is harmless and output is given too:
mat.pcl1 <- pvclust(mat, method.hclust="average", method.dist="euclidean",
nboot=1000, r=seq(0.7,1.4,by=.1))
Bootstrap (r = 0.67)... Done.
....
Bootstrap (r = 1.33)... Done.
Warning message:
In a$p[] <- c(1, bp[r == 1]) :
number of items to replace is not a multiple of replacement length
Let me know if the results is satisfactory.