CONFUSION MATRIX, R, - r

I need little help with the following code below. I have to setup a loop to train a neural network model on the TRAINING data with a different number of epochs each time by starting from 5 and adding 3 until I reach 20. Then I have to calculate a line chart showing the accuracy with differing numbers of epochs. I also have to keep all the parameters same as shown. Much of the code is what was given by our instructor. I added the epochs= c(5,8,11,14,17,20) to create a list of epochs and the error.rate = vector() where I intend to store the accuracy from each loop into a vector. The accuracy I want is from the confusion matrix and is found from the formula
h2o.hit_ratio_table(<model>,train = TRUE)[1,2]
The problem I face is that I have tried to create a loop. I am trying to get the results from each loop. I have labled the first part of the loop as X to try to put it into the vector for the accuracy for each loop into a vector error.rate=h2o.hit_ratio_table(x,train=TRUE)[1,2]).
However, it gives an error.
> Error in is(object, "H2OModelMetrics") : object 'X' not found In
> addition: Warning messages: 1: In 1:epochs : numerical expression has
> 6 elements: only the first used
Moreover, when I remove the error.rate=...... part, the function runs fine but there is no way to find the values of the accuracy.
I am a noob at R so a little help will be much appreciated.
s <- proc.time()
epochs= c(5,8,11,14,17,20)
error.rate = vector()
for (epoch in 1:epochs){#set up loop to go around 6 times
X=h2o.deeplearning(x = 2:785, # column numbers for predictors
y = 1, # column number for label
training_frame = train_h2o, # data in H2O format
activation = "RectifierWithDropout", # mathematical activation function
input_dropout_ratio = 0.2, # % of inputs dropout, because some inputs might not matter.
hidden_dropout_ratios = c(0.25,0.25,0.25,0.25), # % for nodes dropout, because maybe we don't need full connections. Improves generalisability
balance_classes = T, # over/under samples so that all classes are similar size.
hidden = c(50,50,50,50), # two layers of 100 nodes
momentum_stable = 0.99,
nesterov_accelerated_gradient = T,
error.rate=h2o.hit_ratio_table(x,train=TRUE)[1,2])
proc.time() - s}

You are doing for(epoch in 1:epochs). Here the 'epoch' term changes each loop (and usually you use it within the loop but i don't see it?). 1:epochs will not work as you think it should. It is taking the first element of epochs (5) and basically saying for(epoch in 1:5) where epoch is 1, then 2, ... and then 5. You want something like for(epoch in epochs) and if you DO want a sequence from 1:each epoch in your code you should write it within the loop.
Also, x is rewritten each time it loops. You should initialize it and save subsets of it each loop instead:
epochs= c(5,8,11,14,17,20)
x <- list() # save as list #option 1
y <- list() # for an option 2
for (epoch in epochs){ #set up loop to go around 6 times
X[[epoch]] = h2o.deeplearning(... )
# or NOW you can somehow use 1:epoch where each loop epoch changes
}
But I would really focus on there is no use of using your epoch in your for loop as I see in your post. Perhaps find out where you want to use it...

Related

Simulation with N trials in R

I am trying to create a simulation where a number 0:100 is chosen by a person, then a random number 0:100 is generated using sample(). The difference between their chosen number and the random number is calculated and stored. I would like to use a for loop to run this 10000 times and store the results in a vector so I can later plot the results. Can anyone point me to where I can read about this or see some examples? Below is what I have so far but I keep getting errors saying the lengths aren't the same multiple.
N = 10000
chosen.number = 0:100
generated.number = sample(0:100, N, replace = T)
differences = numeric(0)
for(i in 1:length(chosen.number)){
differences = (generated.number - chosen.number)
}
Then I'll make a scatterplot of the vector differences.
Here's an example of how you could go about it (if I understand your questions correctly).
You can set how many loops you want using Repeat.
Since you want a different randomly generated number each time, you'll have to put sample() within your loop. I didn't know where your user-selected number would come from, but in this example, it gets randomly generated with the same set of criteria as the random selection.
Then differences are collected in collect_differences for you to use downstream.
Repeat = 10 # Number of times to repeat/loop
collect_differences <- NULL
for(i in 1:Repeat){
randomly.generated.number = sample(0:100, size = 1, replace = T)
selected.number = sample(0:100, size = 1, replace = T)
differences = randomly.generated.number - selected.number
collect_differences = c(collect_differences, differences)
}
collect_differences
As for resources, you can look up anything related to the fundamentals of looping. You could also look through The Carpentries lessons in R as they have some resources for this as well.

How to use lapply with get.confusion_matrix() in R?

I am performing a PLS-DA analysis in R using the mixOmics package. I have one binary Y variable (presence or absence of wetland) and 21 continuous predictor variables (X) with values ranging from 1 to 100.
I have made the model with the data_training dataset and want to predict new outcomes with the data_validation dataset. These datasets have exactly the same structure.
My code looks like:
library(mixOmics)
model.plsda<-plsda(X,Y, ncomp = 10)
myPredictions <- predict(model.plsda, newdata = data_validation[,-1], dist = "max.dist")
I want to predict the outcome based on 10, 9, 8, ... to 2 principal components. By using the get.confusion_matrix function, I want to estimate the error rate for every number of principal components.
prediction <- myPredictions$class$max.dist[,10] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
I can do this seperately for 10 times, but I want do that a little faster. Therefore I was thinking of making a list with the results of prediction for every number of components...
library(BBmisc)
prediction_test <- myPredictions$class$max.dist
predictions_components <- convertColsToList(prediction_test, name.list = T, name.vector = T, factors.as.char = T)
...and then using lapply with the get.confusion_matrix and get.BER function. But then I don't know how to do that. I have searched on the internet, but I can't find a solution that works. How can I do this?
Many thanks for your help!
Without reproducible there is no way to test this but you need to convert the code you want to run each time into a function. Something like this:
confmat <- function(x) {
prediction <- myPredictions$class$max.dist[,x] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
}
Now lapply:
results <- lapply(10:2, confmat)
That will return a list with the get.BER results for each number of PCs so results[[1]] will be the results for 10 PCs. You will not get values for prediction or confusionmat unless they are included in the results returned by get.BER. If you want all of that, you need to replace the last line to the function with return(list(prediction, confusionmat, get.BER(confusion.mat)). This will produce a list of the lists so that results[[1]][[1]] will be the results of prediction for 10 PCs and results[[1]][[2]] and results[[1]][[3]] will be confusionmat and get.BER(confusion.mat) respectively.

ChoiceModelR, none=TRUE, "Error in betadraw[good, ] = newbeta[good, ]"

I am using ChoiceModelR for hierarchical multinomial logit. I want to get estimates for the utility of the outside good (which follows a normal distribution). The outside good has no covariates like the inside goods - e.g. it cannot have a price or brand dummy - , so I set list(none=TRUE) and do not add this no-choice to the X data (as described in the documentation of ChoiceModelR) but only to the y (choice) data.
The iterations start normally, then at some point it stops and says
"Error in betadraw[good, ] = newbeta[good, ] : NAs are not allowed in subscripted assignments".
This likely happens because in row 388 of the function "choicemodelr", the "good" subscript is NA.
I looked at some questions about choicemodelr (this,this,this), and also about NAs in subscript (this,this), but my guess is that my problem is specific to this function in the sense that probably some inputs in the iteration just get so large/small such that "good" will turn to be NA.
Below is a very simple example. I generate data with 3 products with varying attributed. In half of the periods product 3 is not offered. The 2000 consumers have preferences - distributed normally - over 3 attributes (and a preference for the outside good). Logit error added to be consistent with the model. Outside good is indexed as product 4 (both when 3 and 2 products were in the choice set).
How could I avoid the NA error? Am I doing something wrong, or is it a general bug in the function?
I also searched for examples online setting the option none=TRUE, but I did not find any reproducible one. Perhaps this option is only the problematic thing as there is no problem recovering the true parameters if I set none=FALSE, and I do not let customers choose the outside option.
So the code which results in the NA bug is the following:
library("ChoiceModelR")
library("MASS")
set.seed(36)
# Set demand pars
beta_mu = c(-3,4,1)
beta_sigma = diag(c(1,1,1))
alfa_mu = 5 #outside good mean utility
alfa_sigma = 2 #outside good sd
# Three/two products, 3 vars (2 continuous,1 dummy)
threeprod <- list()
twoprod <- list()
purchase <- list()
for (t in 1:1000){
threeprod[[t]] = cbind(rep(t,3),c(1,1,1),c(1,2,3),runif(3),runif(3),ceiling(runif(3,-0.5,0.5)))
purchase[[t]] = which.max(rbind(threeprod[[t]][,c(4,5,6)]%*%mvrnorm(1,beta_mu,beta_sigma) +
matrix( -log(-log(runif(3))), 3, 1),rnorm(1,alfa_mu,alfa_sigma)) )
threeprod[[t]] = cbind(threeprod[[t]],c(purchase[[t]],0,0))
}
for (t in 1001:2000){
twoprod[[t]] = cbind(rep(t,2),c(1,1),c(1,2),runif(2),runif(2),ceiling(runif(2,-0.5,0.5)))
purchase[[t]] = which.max(rbind(twoprod[[t]][,c(4,5,6)]%*%mvrnorm(1,beta_mu,beta_sigma) +
matrix( -log(-log(runif(2))), 2, 1),rnorm(1,alfa_mu,alfa_sigma)) )
if (purchase[[t]] == 3) {purchase[[t]] <- 4}
twoprod[[t]] = cbind(twoprod[[t]],c(purchase[[t]],0))
}
X <- rbind(do.call(rbind,threeprod),do.call(rbind,twoprod))
xcoding <- c(1,1,1)
mcmc = list(R = 5000, use = 2000)
options = list(none=TRUE, save=TRUE, keep=5)
out = choicemodelr(X, xcoding, mcmc = mcmc,options = options)
You have to sort them by ID,Set,Alt .. that solved the error (the same you got)The questions have to sorted by Respondent ID, The set number (Questions) and Alternatives in a given question.

How to do top down forecasted proportions for hts objects with 2 levels?

I had previously asked this question trying to get top down forecasted proportions forecast recombination using the hts package. The solution there works great for multilevel hierarchies, however I have found I get an error when I try to use the solution on a two level hierarchy.
library(hts)
# Create the hierarchy
newhts <- hts(htseg1$bts, list(ncol(htseg1$bts)))
# forecast creation adapted from the `combinef()` example
h <- 12
ally <- aggts(newhts)
allf <- matrix(NA, nrow = h, ncol = ncol(ally))
for(i in 1:ncol(ally))
allf[,i] <- forecast(auto.arima(ally[,i]), h = h, PI = FALSE)$mean
allf <- ts(allf, start = 51)
# Earo Wang's solution to my previous question
hts:::TdFp(allf, nodes = htseg1$nodes)
Error in *.default(fcasts[, 1L], prop) : time-series/vector length mismatch
The problem seems to arise because a two level hierarchy skips the last if conditional with the condition if (l.levels > 2L). The last statement of this conditional multiplies includes a piece where prop is multiplied by the time series flist[[k + 1L]], which converts prop into a time series matrix. When this statement is skipped, prop remains a regular matrix causing the error when the time series vector fcasts[, 1L] is multiplied by the matrix prop.
I understand that TdFp is a non exported function and therefore may not be as robust as the other functions in the package, but is there any way around this problem? Since it is a relatively simple case, I can code a solution myself, but since hts::forecast.hts() can handle two level hierarchies for method = "tdfp", I thought there might be a nice clean solution.

Using a for loop for performing several regressions

I am currently performing a style analysis using the following method: http://www.r-bloggers.com/style-analysis/ . It is a constrained regression of one asset on a number of benchmarks, over a rolling 36 month window.
My problem is that I need to perform this regression for a fairly large number of assets and doing it one by one would take a huge amount of time. To be more precise: Is there a way to tell R to regress columns 1-100 one by one on colums 101-116. Of course this also means printing 100 different plots, one for each asset. I am new to R and have been stuck for several days now.
I hope it doesn't matter that the following excerpt isn't reproducible, since the code works as originally intended.
# Style Regression over Window, constrained
#--------------------------------------------------------------------------
# setup
load.packages('quadprog')
style.weights[] = NA
style.r.squared[] = NA
# Setup constraints
# 0 <= x.i <= 1
constraints = new.constraints(n, lb = 0, ub = 1)
# SUM x.i = 1
constraints = add.constraints(rep(1, n), 1, type = '=', constraints)
# main loop
for( i in window.len:ndates ) {
window.index = (i - window.len + 1) : i
fit = lm.constraint( hist.returns[window.index, -1], hist.returns[window.index, 1], constraints )
style.weights[i,] = fit$coefficients
style.r.squared[i,] = fit$r.squared
}
# plot
aa.style.summary.plot('Style Constrained', style.weights, style.r.squared, window.len)
Thank you very much for any tips!
"Is there a way to tell R to regress columns 1-100 one by one on colums 101-116."
Yes! You can use a for loop, but you there's also a whole family of 'apply' functions which are appropriate. Here's a generalized solution with a random / toy dataset and using lm(), but you can sub in whatever regression function you want
# data frame of 116 cols of 20 rows
set.seed(123)
dat <- as.data.frame(matrix(rnorm(116*20), ncol=116))
# with a for loop
models <- list() # empty list to store models
for (i in 1:100) {
models[[i]] <-
lm(formula=x~., data=data.frame(x=dat[, i], dat[, 101:116]))
}
# with lapply
models2 <-
lapply(1:100,
function(i) lm(formula=x~.,
data=data.frame(x=dat[, i], dat[, 101:116])))
# compare. they give the same results!
all.equal(models, models2)
# to access a single model, use [[#]]
models2[[1]]

Resources