ChoiceModelR, none=TRUE, "Error in betadraw[good, ] = newbeta[good, ]" - r

I am using ChoiceModelR for hierarchical multinomial logit. I want to get estimates for the utility of the outside good (which follows a normal distribution). The outside good has no covariates like the inside goods - e.g. it cannot have a price or brand dummy - , so I set list(none=TRUE) and do not add this no-choice to the X data (as described in the documentation of ChoiceModelR) but only to the y (choice) data.
The iterations start normally, then at some point it stops and says
"Error in betadraw[good, ] = newbeta[good, ] : NAs are not allowed in subscripted assignments".
This likely happens because in row 388 of the function "choicemodelr", the "good" subscript is NA.
I looked at some questions about choicemodelr (this,this,this), and also about NAs in subscript (this,this), but my guess is that my problem is specific to this function in the sense that probably some inputs in the iteration just get so large/small such that "good" will turn to be NA.
Below is a very simple example. I generate data with 3 products with varying attributed. In half of the periods product 3 is not offered. The 2000 consumers have preferences - distributed normally - over 3 attributes (and a preference for the outside good). Logit error added to be consistent with the model. Outside good is indexed as product 4 (both when 3 and 2 products were in the choice set).
How could I avoid the NA error? Am I doing something wrong, or is it a general bug in the function?
I also searched for examples online setting the option none=TRUE, but I did not find any reproducible one. Perhaps this option is only the problematic thing as there is no problem recovering the true parameters if I set none=FALSE, and I do not let customers choose the outside option.
So the code which results in the NA bug is the following:
library("ChoiceModelR")
library("MASS")
set.seed(36)
# Set demand pars
beta_mu = c(-3,4,1)
beta_sigma = diag(c(1,1,1))
alfa_mu = 5 #outside good mean utility
alfa_sigma = 2 #outside good sd
# Three/two products, 3 vars (2 continuous,1 dummy)
threeprod <- list()
twoprod <- list()
purchase <- list()
for (t in 1:1000){
threeprod[[t]] = cbind(rep(t,3),c(1,1,1),c(1,2,3),runif(3),runif(3),ceiling(runif(3,-0.5,0.5)))
purchase[[t]] = which.max(rbind(threeprod[[t]][,c(4,5,6)]%*%mvrnorm(1,beta_mu,beta_sigma) +
matrix( -log(-log(runif(3))), 3, 1),rnorm(1,alfa_mu,alfa_sigma)) )
threeprod[[t]] = cbind(threeprod[[t]],c(purchase[[t]],0,0))
}
for (t in 1001:2000){
twoprod[[t]] = cbind(rep(t,2),c(1,1),c(1,2),runif(2),runif(2),ceiling(runif(2,-0.5,0.5)))
purchase[[t]] = which.max(rbind(twoprod[[t]][,c(4,5,6)]%*%mvrnorm(1,beta_mu,beta_sigma) +
matrix( -log(-log(runif(2))), 2, 1),rnorm(1,alfa_mu,alfa_sigma)) )
if (purchase[[t]] == 3) {purchase[[t]] <- 4}
twoprod[[t]] = cbind(twoprod[[t]],c(purchase[[t]],0))
}
X <- rbind(do.call(rbind,threeprod),do.call(rbind,twoprod))
xcoding <- c(1,1,1)
mcmc = list(R = 5000, use = 2000)
options = list(none=TRUE, save=TRUE, keep=5)
out = choicemodelr(X, xcoding, mcmc = mcmc,options = options)

You have to sort them by ID,Set,Alt .. that solved the error (the same you got)The questions have to sorted by Respondent ID, The set number (Questions) and Alternatives in a given question.

Related

Mclust() - NAs in model selection

I recently tried to perform a GMM in R on a multivariate matrix (400 obs of 196 var), which elements belong to known categories. The Mclust() function (from package mclust) gave very poor results (around 30% of individuals were well classified, whereas with k-means the result reaches more than 90%).
Here is my code :
library(mclust)
X <- read.csv("X.csv", sep = ",", h = T)
y <- read.csv("y.csv", sep = ",")
gmm <- Mclust(X, G = 5) #I want 5 clusters
cl_gmm <- gmm$classification
cl_gmm_lab <- cl_gmm
for (k in 1:nclusters){
ii = which(cl_gmm == k) # individuals of group k
counts=table(y[ii]) # number of occurences for each label
imax = which.max(counts) # Majority label
maj_lab = attributes(counts)$dimnames[[1]][imax]
print(paste("Group ",k,", majority label = ",maj_lab))
cl_gmm_lab[ii] = maj_lab
}
conf_mat_gmm <- table(y,cl_gmm_lab) # CONFUSION MATRIX
The problem seems to come from the fact that every other model than "EII" (spherical, equal volume) is "NA" when looking at gmm$BIC.
Until now I did not find any solution to this problem...are you familiar with this issue?
Here is the link for the data: https://drive.google.com/file/d/1j6lpqwQhUyv2qTpm7KbiMRO-0lXC3aKt/view?usp=sharing
Here is the link for the labels: https://docs.google.com/spreadsheets/d/1AVGgjS6h7v6diLFx4CxzxsvsiEm3EHG7/edit?usp=sharing&ouid=103045667565084056710&rtpof=true&sd=true
I finally found the answer. GMMs simply cannot apply every model when two much explenatory variables are involved. The right thing to do is first reduce dimensions and select an optimal number of dimensions that make it possible to properly apply GMMs while preserving as much informations as possible about the data.

R function loglik() returning -inf?

Simulating an SIR model in R. I have a data set I am trying to plot accurately with the model. I am right now using the particle filter function, then would like to use the corresponding logLik method on the result. When I do this, I get "[1] -Inf" as a result. I can't find in the documentation why this is and how I can avoid it. Are my parameters for the model not accurate enough? Is there something else wrong?
My function looks like this:
SIRsim %>%
pfilter(Np=5000) -> pf
logLik(pf)
From an online course lesson entitled Likelihood for POMPS https://kingaa.github.io/sbied/pfilter/ , this is the R script for the lesson. However, the code works here... I'm not sure how to reproduce my specific problem with it and unfortunately cannot share the dataset or code I am using because it is for academic research.
library(tidyverse)
library(pomp)
options(stringsAsFactors=FALSE)
stopifnot(packageVersion("pomp")>="3.0")
set.seed(1350254336)
library(tidyverse)
library(pomp)
sir_step <- Csnippet("
double dN_SI = rbinom(S,1-exp(-Beta*I/N*dt));
double dN_IR = rbinom(I,1-exp(-mu_IR*dt));
S -= dN_SI;
I += dN_SI - dN_IR;
R += dN_IR;
H += dN_IR;
")
sir_init <- Csnippet("
S = nearbyint(eta*N);
I = 1;
R = nearbyint((1-eta)*N);
H = 0;
")
dmeas <- Csnippet("
lik = dbinom(reports,H,rho,give_log);
")
rmeas <- Csnippet("
reports = rbinom(H,rho);
")
read_csv("https://kingaa.github.io/sbied/pfilter/Measles_Consett_1948.csv")
%>%
select(week,reports=cases) %>%
filter(week<=42) %>%
pomp(
times="week",t0=0,
rprocess=euler(sir_step,delta.t=1/7),
rinit=sir_init,
rmeasure=rmeas,
dmeasure=dmeas,
accumvars="H",
statenames=c("S","I","R","H"),
paramnames=c("Beta","mu_IR","eta","rho","N"),
params=c(Beta=15,mu_IR=0.5,rho=0.5,eta=0.06,N=38000)
) -> measSIR
measSIR %>%
pfilter(Np=5000) -> pf
logLik(pf)
library(doParallel)
library(doRNG)
registerDoParallel()
registerDoRNG(652643293)
foreach (i=1:10, .combine=c) %dopar% {
measSIR %>% pfilter(Np=5000)
} -> pf
logLik(pf) -> ll
logmeanexp(ll,se=TRUE)
If I set Beta=100 in the code above I can get a negative-infinite log-likelihood.
Replacing the measurement-error snippet with this:
dmeas <- Csnippet("
double ll = dbinom(reports,H,rho,give_log);
lik = (!isfinite(ll) ? -1000 : ll );
")
appears to 'solve' the problem, although you should be a little bit careful; papering over numerical cracks like this is sometimes OK, but could conceivably come back to bite you in some way later on. If you just need to avoid non-finite values long enough to get into a reasonable parameter range this might be OK ...
Some guesses as to why this is happening:
you are somehow getting an "impossible" situation like a positive number of reported cases when the underlying true number of infections is zero.
Sometimes non-finite log-likelihoods occur when a very small positive probability underflows to zero. The equivalent here is likely that the probability of infection 1-exp(-Beta*I/N*dt) goes to 1.0; then any observed outcome where less than 100% of the population is infected is impossible.
You can try to diagnose the situation by seeing what the filtered trajectory actually looks like and comparing it with the data, or by adding debugging statements to the code. If there's a way to run just the deterministic simulation with your parameter values that might tell you pretty quickly what's going wrong.
An easier/more direct way to debug would be to replace the Csnippet you're using for dmeas with an R function: this will be slower but easier to work with (especially if you're not familiar with C coding). If you uncomment the browser() statement below, the code will drop into debug mode when you encounter the bad situation ...
dmeas <- function(reports,H,rho,log, ...) {
lik <- dbinom(reports,size=H,prob=rho,log=log)
if (!is.finite(lik)) {
lik <- -1000
## browser()
}
return(lik)
}
For example:
(t = 3, reports = 2, S = 2280, I = 0, R = 35721, H = 0, Beta = 100,
mu_IR = 0.5, rho = 0.5, eta = 0.06, N = 38000, log = TRUE)
Browse[1]> debug at /tmp/SO65554258.R!ZlSILG#7: return(lik)
Browse[2]> reports
[1] 2
Browse[2]> H
[1] 0
Browse[2]> rho
[1] 0.5
This shows that the problem is indeed that you have a positive number of reported cases when there have been zero infections ... R is trying to compute the binomial probability of observing reports cases out when there are H infections that are potentially reportable, each reported with a probability rho. When the number of trials N in a binomial probability Binom(N,p) is zero, the only possible outcome is zero 'successes' (reported cases), with probability 1. All other outcomes have probability 0 (and log-probability -Inf).

CONFUSION MATRIX, R,

I need little help with the following code below. I have to setup a loop to train a neural network model on the TRAINING data with a different number of epochs each time by starting from 5 and adding 3 until I reach 20. Then I have to calculate a line chart showing the accuracy with differing numbers of epochs. I also have to keep all the parameters same as shown. Much of the code is what was given by our instructor. I added the epochs= c(5,8,11,14,17,20) to create a list of epochs and the error.rate = vector() where I intend to store the accuracy from each loop into a vector. The accuracy I want is from the confusion matrix and is found from the formula
h2o.hit_ratio_table(<model>,train = TRUE)[1,2]
The problem I face is that I have tried to create a loop. I am trying to get the results from each loop. I have labled the first part of the loop as X to try to put it into the vector for the accuracy for each loop into a vector error.rate=h2o.hit_ratio_table(x,train=TRUE)[1,2]).
However, it gives an error.
> Error in is(object, "H2OModelMetrics") : object 'X' not found In
> addition: Warning messages: 1: In 1:epochs : numerical expression has
> 6 elements: only the first used
Moreover, when I remove the error.rate=...... part, the function runs fine but there is no way to find the values of the accuracy.
I am a noob at R so a little help will be much appreciated.
s <- proc.time()
epochs= c(5,8,11,14,17,20)
error.rate = vector()
for (epoch in 1:epochs){#set up loop to go around 6 times
X=h2o.deeplearning(x = 2:785, # column numbers for predictors
y = 1, # column number for label
training_frame = train_h2o, # data in H2O format
activation = "RectifierWithDropout", # mathematical activation function
input_dropout_ratio = 0.2, # % of inputs dropout, because some inputs might not matter.
hidden_dropout_ratios = c(0.25,0.25,0.25,0.25), # % for nodes dropout, because maybe we don't need full connections. Improves generalisability
balance_classes = T, # over/under samples so that all classes are similar size.
hidden = c(50,50,50,50), # two layers of 100 nodes
momentum_stable = 0.99,
nesterov_accelerated_gradient = T,
error.rate=h2o.hit_ratio_table(x,train=TRUE)[1,2])
proc.time() - s}
You are doing for(epoch in 1:epochs). Here the 'epoch' term changes each loop (and usually you use it within the loop but i don't see it?). 1:epochs will not work as you think it should. It is taking the first element of epochs (5) and basically saying for(epoch in 1:5) where epoch is 1, then 2, ... and then 5. You want something like for(epoch in epochs) and if you DO want a sequence from 1:each epoch in your code you should write it within the loop.
Also, x is rewritten each time it loops. You should initialize it and save subsets of it each loop instead:
epochs= c(5,8,11,14,17,20)
x <- list() # save as list #option 1
y <- list() # for an option 2
for (epoch in epochs){ #set up loop to go around 6 times
X[[epoch]] = h2o.deeplearning(... )
# or NOW you can somehow use 1:epoch where each loop epoch changes
}
But I would really focus on there is no use of using your epoch in your for loop as I see in your post. Perhaps find out where you want to use it...

how to solve errors in frbs package of R using GFC.GCCL method?

I'm using frbs package in R on my data set using 5-fold stratified cross validation. I've implemented stratified CV. I use GFS.GCCL method for frbs.learn function in each fold and predict the result using test data. I get this error as well as 30 equal warning messages:
Error: object 'temp.rule.degree' not found
Warning: In max(MF.temp[m, ], na.rm = TRUE) :
no non-missing arguments to max; returning -Inf
My code is written in below:
library(frbs)
data<-read.csv(file.address)
data[,30] <- unclass(data[,30]) #column 30 has the class of samples
data <- data[,c(1,14,20,26,27, 30)] # I choose to have 5 attr. since
#my data is high dimensional
k <- 5 # 5-fold
seed <- 1
folds <- strf.cv(data, k, seed) #stratification function for CV
range.data.inp <- matrix(apply(data[,-ncol(data)], 2, range), nrow=2)
data<-norm.data(as.matrix(data[,-ncol(data)]),range.data.
inp,min.scale = 0.1, max.scale = 1)
ctrl <- list(popu.size = 30, num.class = 2, num.labels= 3,
persen_cross = 0.9, max.gen = 200, persen_mutant = 0.3,
name="sim-1")
for(i in 1:k){
str <- paste("fold",i)
print(str)
test.ind <- folds[[str]]
test.data <- data[test.ind,]
train.data <- data[-test.ind,]
obj <- frbs.learn(train.data , method.type="GFS.GCCL",
range.data.inp , ctrl)
pred <- predict(obj, test.data)
print("Predicted classes:")
print(pred)
}
I don't have any idea about error and warnings. Please let me know what I should do.
I've had similar problem (and others) trying to reproduce the SLAVE learning starting with the iris example data. I had 2 format items to solve before being able to run this with my artifical data:
my dataframe import was giving me integer, where the learn needs at least numeric.
my distribution of criteria was not flat. When I flattened the distribution (3 values so n/3 samples per value) everything went fine.
That's all I know.
Hope it helps.
I encountered the same issue when I was running SLAVE and GFS.GCCL. When I was looking at the source code of the library. I found that in frbs.learn(), each method has an implementation to calculate the range of input data. So, I think it might be a problem with the range of input data. For example, in GFS.GCCL, in the source code, for setting the parameters, it looks like this:
range.data.input <- range.data
data.train.ori <- data.train
popu.size <- control$popu.size
persen_cross <- control$persen_cross
persen_mutant <- control$persen_mutant
max.gen <- control$max.gen
name <- control$name
n.labels <- control$num.labels
n.class <- control$num.class
num.labels <- matrix(rep(n.labels, ncol(range.data)), nrow = 1)
num.labels <- cbind(num.labels, n.class)
## normalize range of data and data training
range.data.norm <- range.data.input
range.data.norm[1, ] <- 0
range.data.norm[2, ] <- 1
range.data.input.ori <- range.data.input
data.tra.norm <- norm.data(data.train[, 1 : ncol(data.train) - 1], range.data.input, min.scale = 0, max.scale = 1)
data.train <- cbind(data.tra.norm, matrix(data.train[, ncol(data.train)], ncol = 1))
in the first line, range.data is either coming from your specification nor the default setting of frbs.learn(). For the default setting, it gets the max and min for each row. In the source code:
range.data <- rbind(dt.min, dt.max)
After that, the range of data taken by the GFS.GCCL is
range.data.norm <- range.data.input
range.data.norm[1, ] <- 0
range.data.norm[2, ] <- 1
which is between 0 and 1. The GFS.GCCL is also taken the range.data.input as parameter. So, it takes both range.data.norm and range.data.input.
Therefore, I think if internally, there are some calculation corresponding to range.data.input (it needs to be set as min, max for each row), but the setting for this is actually not min and max for each row. The error is generated.
But, in summary, after I remove "range.data"from frbs.learn(), both GFS.GCCL and SLAVE work for me.
You can download the source code from here:
https://cran.r-project.org/web/packages/frbs/index.html
You can find the code for GFS.GCCL and SLAVE in:
FRBS.MainFunction.R
GFS.Methods.R
In addition to #Pilip38's good advice, I have three other ideas that have fixed similar errors for me while working with the frbs package.
Most important: Make sure your output variable is never equal to 0. It looks like you have a binary output variable so I am hoping just adding 1 to it so it is 1/2 instead of 0/1 will work.
Try setting your range.data.inp matrix to be all 0's in the first row and all 1's in the second. Naturally it's better to have a tighter range but it may be causing your bug.
Try decreasing the number of labels to 2.
It's can be a brittle procedure.

Using a for loop for performing several regressions

I am currently performing a style analysis using the following method: http://www.r-bloggers.com/style-analysis/ . It is a constrained regression of one asset on a number of benchmarks, over a rolling 36 month window.
My problem is that I need to perform this regression for a fairly large number of assets and doing it one by one would take a huge amount of time. To be more precise: Is there a way to tell R to regress columns 1-100 one by one on colums 101-116. Of course this also means printing 100 different plots, one for each asset. I am new to R and have been stuck for several days now.
I hope it doesn't matter that the following excerpt isn't reproducible, since the code works as originally intended.
# Style Regression over Window, constrained
#--------------------------------------------------------------------------
# setup
load.packages('quadprog')
style.weights[] = NA
style.r.squared[] = NA
# Setup constraints
# 0 <= x.i <= 1
constraints = new.constraints(n, lb = 0, ub = 1)
# SUM x.i = 1
constraints = add.constraints(rep(1, n), 1, type = '=', constraints)
# main loop
for( i in window.len:ndates ) {
window.index = (i - window.len + 1) : i
fit = lm.constraint( hist.returns[window.index, -1], hist.returns[window.index, 1], constraints )
style.weights[i,] = fit$coefficients
style.r.squared[i,] = fit$r.squared
}
# plot
aa.style.summary.plot('Style Constrained', style.weights, style.r.squared, window.len)
Thank you very much for any tips!
"Is there a way to tell R to regress columns 1-100 one by one on colums 101-116."
Yes! You can use a for loop, but you there's also a whole family of 'apply' functions which are appropriate. Here's a generalized solution with a random / toy dataset and using lm(), but you can sub in whatever regression function you want
# data frame of 116 cols of 20 rows
set.seed(123)
dat <- as.data.frame(matrix(rnorm(116*20), ncol=116))
# with a for loop
models <- list() # empty list to store models
for (i in 1:100) {
models[[i]] <-
lm(formula=x~., data=data.frame(x=dat[, i], dat[, 101:116]))
}
# with lapply
models2 <-
lapply(1:100,
function(i) lm(formula=x~.,
data=data.frame(x=dat[, i], dat[, 101:116])))
# compare. they give the same results!
all.equal(models, models2)
# to access a single model, use [[#]]
models2[[1]]

Resources