Resampling from multiple strata using boot() - r

I'm attempting to bootstrap a ZIP estimation while resampling from within specific populations. Each of the populations (clusters) are fundamentally different in some way, so I would like to proportionally represent them in the bootstrapping. The strata command will do that.
I sometimes encounter the following error:
Error in solve.default(as.matrix(fit$hessian)) :
system is computationally singular: reciprocal condition number = 2.02001e-16
Here's a way to replicate the problem, and it should only take about a minute or so to run, depending on your computer:
#Load dependencies
library(AER)
library(boot)
library(pscl)
library(sampling)
#generate some fake data.q. Seed will be used to make it replicatable.
set.seed(1)
x1<-rpois(1000,1)
set.seed(1)
x2<-rnorm(1000,0,1)
set.seed(1)
e<-round(runif(1000,0,1)) #this should add some disruptions and prevent any multicolinearity.
pop<-rep(1:10,length.out=1000) #there are 10 populations
y<-x1*abs(floor(x2*sqrt(pop)))+e #the populations each impact the y variable somewhat differently
fake_data<-as.data.frame(cbind(y,x1,x2,pop))
fake_data$pop<-factor(pop) #they are not actually simple scalars.
#Run zip proccess, confirm it works. I understand it's not a matching model.
system.time(zip<-zeroinfl(y ~ x1+x2+pop | x1+x2+pop, data=fake_data))
#storing estimates to speed up bootstrapping phase. General technique from http://www.ats.ucla.edu/stat/r/dae/zipoisson.htm
count_hold<-as.data.frame(dput(coef(zip, "count")))
count_short<-c(count_hold[,1])
zero_hold<-as.data.frame(dput(coef(zip, "zero")))
zero_short<-c(zero_hold[,1])
#bootstrapping
f <- function(fake_data, i) {
zip_boot<- zeroinfl(y ~ x1+x2+pop | x1+x2+pop, data=fake_data[i,], start=list(count=count_short, zero=zero_short))
return(coef(zip_boot))
} #defines function for R to repeat in bootstrapping phase.
set.seed(1)
system.time(res <- boot(fake_data, f, R =50, strata=fake_data$pop)) #adjust the number of cpus to match your computer.
There ought to be enough samples, considering that I have 900+ degrees of freedom, and at least 100 samples in each population to grab my resampling estimates from.
My questions:
1)What did I do that is causing this multicolinarity?

Related

No convergence for hard competitive learning clustering (flexclust package)

I am applying the functions from the flexclust package for hard competitive learning clustering, and I am having trouble with the convergence.
I am using this algorithm because I was looking for a method to perform a weighed clustering, giving different weights to groups of variables. I chose hard competitive learning based on a response for a previous question (Weighted Kmeans R).
I am trying to find the optimal number of clusters, and to do so I am using the function stepFlexclust with the following code:
new("flexclustControl") ## check the default values
fc_control <- new("flexclustControl")
fc_control#iter.max <- 500 ### 500 iterations
fc_control#verbose <- 1 # this will set the verbose to TRUE
fc_control#tolerance <- 0.01
### I want to give more weight to the first 24 variables of the dataframe
my_weights <- rep(c(1, 0.064), c(24, 31))
set.seed(1908)
hardcl <- stepFlexclust(x=df, k=c(7:20), nrep=100, verbose=TRUE,
FUN = cclust, dist = "euclidean", method = "hardcl", weights=my_weights, #Parameters for hard competitive learning
control = fc_control,
multicore=TRUE)
However, the algorithm does not converge, even with 500 iterations. I would appreciate any suggestion. Should I increase the number of iterations? Is this an indicator that something else is not going well, or did I a mistake with the R commands?
Thanks in advance.
Two things that answer my question (as well as a comment on weighted variables for kmeans, or better said, with hard competitive learning):
The weights are for observations (=rows of x), not variables (=columns of x). so using hardcl for weighting variables is wrong.
In hardcl or neural gas you need much more iterations compared to standard k-means: In k-means one iteration uses the complete data set to change the centroids, hard competitive learning and uses only a single observation. In comparison to k-means multiply the number of iterations by your sample size.

Finding the best LCA model in poLCA R package

I am applying LCA analysis with PoLCA R package, but the analysis not resulted since three days (it did not find the best model yet) and occasionally it gives the following error: "ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND". So i cancelled the process at 35 latent class. I am analyzing 16 variables (all of them categorical) and 36036 rows of data. When I test the variable importance for 16 variables in Boruta package, all the 16 variables resulted as important, so i used all 16 variables in LCA analysis with poLCA. Which path should i follow? Should I use another clustering method such as k-modes for clustering categorical variables in this dataset? I use the parameters with 500 iterations and nrep=10 model estimation number. The R script i use to find the best model in LCA and one of the outputs is as follows:
for(i in 2:50){
lc <- poLCA(f, data, nclass=i, maxiter=500,
tol=1e-5, na.rm=FALSE,
nrep=10, verbose=TRUE, calc.se=TRUE)
if(lc$bic < min_bic){
min_bic <- lc$bic
LCA_best_model<-lc
}
}
========================================================= Fit for 35 latent classes:
========================================================= number of observations: 36036
number of estimated parameters: 2029 residual
degrees of freedom: 34007
maximum log-likelihood: -482547.1
AIC(35): 969152.2
BIC(35): 986383 G^2(35): 233626.8 (Likelihood
ratio/deviance statistic)
X^2(35): 906572555 (Chi-square goodness of
fit)
ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND
The script you are using sequentially tests every model from 2 to 50 classes and keeps the one with the lowest BIC. BIC is not the only one or the best way to select "the best" model, but fair enough.
The problem is, you are estimating a LOT of parameters, especially in the last steps. The more classes you fit, the more time consuming the process is. Also, in this cases convergence problems are to be expected because you are fitting so many classes. That's what the error message reports, it can't find the maximum likelihood for a model with 35 classes.
I don't know what problem you are trying to solve, but models with over 10 classes are unusual in LCA. You do LCA to reduce the complexity of your data as much as possible. If you NEED to fit models with many -over 10- classes:
fit them one by one, so RAM consumption will be less of a problem.
increase the nrep= argument in the call, so you the probability of the model not finding maximum likelihood by chance -bad random initial numbers- is reduced. Also increases computing time.
Alternatively you can reduce computing time running models in parallel. Almost every modern PC has 2 or more cores. The function acl() in the next block does this with foreach() and %dopar%, so is OS independent.
library(poLCA)
library(foreach)
library(doParallel)
registerDoParallel(cores=2) #as many physical cores as available.
acl <- function(datos, #a data.frame with your data
k, #the maximum number of classes to fit
formula) {
foreach(i=1:k, .packages="poLCA") %dopar% poLCA(formula, datos, nclass=i
)
}
acm() returns a list of models, you can pick "the best" later. The next function will retrieve the quantities of intrest from the list and create a nicely formatted data.frame with usefull information to select the right number of classes.
comparar_clases_acl <- function(modelo) {
entropy<-function (p) sum(-p*log(p)) #to asses the quality of classification
tabla_LCA <- data.frame(Modelo=0, BIC=0, Lik_ratio=0, Entropia=0, MenorClase=0) #empty data.frame to prealocate memory.
for(i in 1:length(modelo)){
tabla_LCA [i,1] <- paste("Modelo", i)
tabla_LCA [i,2] <- modelo[[i]]$bic
tabla_LCA [i,3] <- modelo[[i]]$Gsq
error_prior <- entropy(modelo[[i]]$P)
error_post <- mean(apply(modelo[[i]]$posterior,1, entropy),na.rm = TRUE)
tabla_LCA [i,4]<-round(((error_prior-error_post) / error_prior),3)
tabla_LCA [i,5] <- min(modelo[[i]]$P)*100
}
return(tabla_LCA)
}
It takes only one argument: an object with a list of LCA models, exactly what acl() returns.
This parallel approach should reduce computing time. Still 50 classes are to much and you are probably getting the smallest BIC way before 50 classes. Remember, BIC penalices models as the number of estimated parameters increases, helping you find the point of diminishing returns of an extra class in your model.

number of trees in h2o.gbm

in traditional gbm, we can use
predict.gbm(model, newsdata=..., n.tree=...)
So that I can compare result with different number of trees for the test data.
In h2o.gbm, although it has n.tree to set, it seems it doesn't have any effect on the result. It's all the same as the default model:
h2o.test.pred <- as.vector(h2o.predict(h2o.gbm.model, newdata=test.frame, n.tree=100))
R2(h2o.test.pred, test.mat$y)
[1] -0.00714109
h2o.test.pred <- as.vector(h2o.predict(h2o.gbm.model, newdata=test.frame, n.tree=10))
> R2(h2o.test.pred, test.mat$y)
[1] -0.00714109
Does anybod have similar problem? How to solve it? h2o.gbm is much faster than gbm, so if it can get detailed result of each tree that would be great.
I don't think H2O supports what you are describing.
BUT, if what you are after is to get the performance against the number of trees used, that can be done at model building time.
library(h2o)
h2o.init()
iris <- as.h2o(iris)
parts <- h2o.splitFrame(iris,c(0.8,0.1))
train <- parts[[1]]
valid <- parts[[2]]
test <- parts[[3]]
m <- h2o.gbm(1:4, 5, train,
validation_frame = valid,
ntrees = 100, #Max desired
score_tree_interval = 1)
h2o.scoreHistory(m)
plot(m)
The score history will show the evaluation after adding each new tree. plot(m) will show a chart of this. Looks like 20 is plenty for iris!
BTW, if your real purpose was to find out the optimum number of trees to use, then switch early stopping on, and it will do that automatically for you. (Just make sure you are using both validation and test data frames.)
As of 3.20.0.6 H2O does support this. The method you are looking for is
staged_predict_proba. For classification models it produces predicted class probabilities after each iteration (tree), for every observation in your testing frame. For regression models (i.e. when response is numerical), although not really documented, it produces the actual prediction for every observation in your testing frame.
From these predictions it is also easy to compute various performance metrics (AUC, r2 etc), assuming that's what you're after.
Python API:
staged_predict_proba = model.staged_predict_proba(test)
R API:
staged_predict_proba <- h2o.staged_predict_proba(model, prostate.test)

how to resample and compare the resutls when I just want to predict the last row of the data using surv. functions in mlr package, R?

I just start trying the R package mlr, I am wondering if I can customize training set and test set. For example, all the data of a time sequence are the training set except for the last,and the last one is the test set.
Here is my example:
library(mlr)
library(survival)
data(lung)
myData2 <- lung %>%
select(time,status,age)
myData2$status = (myData2$status == 2)
myTrain <- c(1:(nrow(myData2)-1))
myTest <- nrow(myData2)
Lung data is from survival package. I just use three dimensions: time, status and age. Now, let's suppose they do not mean the patients' ages and how long they can survive. Let's say this is a ink purchase history of one customer.
age=74 means this customer bought 74 bottles of ink on that day and time=306 means the customer run out the ink after 306 days. So, I want to build up a survival model using all the data except for the last row. Then, when I have the data of the last row, which is age=58 implying the customer bought 58 bottles of ink on that day, I can make a prediction on time. A number close to 177 will be a good estimation. So, my training set and test set are fixed, which does not need to be resampled.
In addition, I need to change the hyperparameters for a comparison. Here is my code:
surv.task <- makeSurvTask(data=myData2,target=c('time','status'))
surv.lrn <- makeLearner("surv.cforest")
ps <- makeParamSet(
makeDiscreteParam('mincriterion',values=c(1.281552,2,3)),
makeDiscreteParam('ntree',values=c(100,200,300))
)
ctrl <- makeTuneControlGrid()
rdesc <- makeResampleDesc('Holdout',split=1,predict='train')
lrn = makeTuneWrapper(surv.lrn,control=ctrl,resampling=rdesc,par.set=ps,
measures = list(setAggregation(cindex,train.mean)))
mod <- train(learner=lrn,task=surv.task,subset=myTrain)
surv.pred <- predict(mod,task=surv.task,subset=myTest)
surv.pred
You can see that I use split=1 in makeResampleDesc because I have fixed training set which does not need to be resampled. measures in makeTuneWrapper is currently not meaningful to me as I need to customize my own measures. Because of fixed data split, I can not use the functions like resample or tuneParams to get an evaluation on test data when using different hyperparameters.
So, my question is: when the training set and test set are fixed, can mlr provide a comprehensive compare for every hyperparameter? If so, how to do it?
Incidentally, looks like there is function makeFixedHoldoutInstance which might can do this, just do not know how to use it. For example, I use makeFixedHoldoutInstance in this way and I have got such error information:
> f <- makeFixedHoldoutInstance(train.inds=myTrain,test.inds=myTest,size=length(myTrain)+1)
> lrn = makeTuneWrapper(surv.lrn,control=ctrl,resampling=f,par.set=ps)
> resample(learner=lrn,task=surv.task,resampling=f)
[Resample] holdout iter 1: [Tune] Started tuning learner surv.cforest for parameter set:
Type len Def Constr Req Tunable Trafo
mincriterion discrete - - 1.281552,2,3 - TRUE -
ntree discrete - - 100,200,300 - TRUE -
With control class: TuneControlGrid
Imputation value: -0
[Tune-x] 1: mincriterion=1.281552; ntree=100
Error in resample.fun(learner2, task, resampling, measures = measures, :
Size of data set: 227 and resampling instance: 228 differ!
With makeFixedHoldoutInstance you get the resampling you asked for.
But you can not use the same fixed resampling indices for the tuning inside the tuning wrapper and the resampling.
This is because first resample will split the data according to the fixed holdout instance f. Then the tuning inside the tuning wrapper will also need a resampling method to calculate the performance for a given configuration. As the tuning only sees the data after the split done by resample it can not apply the same fixed resampling.
From reading your question I guess you don't want to use the tuneWrapper but you want to directly tune your learner. So you should use simply tuneParams:
tr = tuneParams(learner = surv.lrn, task = surv.task, resampling = cv2, par.set = ps, control = ctrl)
Note: This does not work on the given example because the cindex needs at least one uncensored observation and even then it does not make sense because the cindex is only meaningful for a bigger test set.

Using Kolmogorov Smirnov Test in R

I designed 3000 experiments, so that in one experiment there are 4 groups (treatment), in each group there are 50 individuals (subjects). For each experiment I do a standard one way ANOVA and proof if their p.values has a uni probability function under the null-hypothesis, but ks.test rejects this assumption and I cant see why?
subject<-50
treatment<-4
experiment<-list()
R<-3000
seed<-split(1:(R*subject),1:R)
for(i in 1:R){
e<-c()
for(j in 1:subject){
set.seed(seed[[i]][j])
e<-c(e,rmvnorm(mean=rep(0,treatment),sigma=diag(3,4),n=1,method="chol"))
}
experiment<-c(experiment,list(matrix(e,subject,treatment,byrow=T)))
}
p.values<-c()
for(e in experiment){
d<-data.frame(response=c(e),treatment=factor(rep(1:treatment,each=subject)))
p.values<-c(p.values,anova(lm(response~treatment,d))[1,"Pr(>F)"])
}
ks.test(p.values, punif,alternative = "two.sided")
I commented out the lines in your code that change the random seed, and got a P-value of .34. That was with an unknown seed, so for reproducibility, I did set.seed(1) and ran it again. This time, I got a P-value of 0.98.
As to why this makes a difference, I'm not an expert in PRNGs, but any decent generator will ensure successive draws are statistically independent for all practical purposes. The best ones will ensure the same for greater lags, eg the Mersenne Twister which is R's default PRNG guarantees it for lags up to 623 (IIRC). In fact, meddling with the seed is likely to impair the statistical properties of the draws.
Your code is also doing things in a really inefficient way. You're creating a list for the experiments, and adding one item for each experiment. Within each experiment, you also create a matrix, and add a row for each observation. Then you do something very similar for the P-values. I'll see if I can fix that up.
This is how I'd replace your code. Strictly speaking I could make it even tighter, by avoiding formulas, creating the bare model matrix, and calling lm.fit directly. But that would mean having to manually code up the ANOVA test rather than simply calling anova, which is more trouble than it's worth.
set.seed(1) # or any other number you like
x <- factor(rep(seq_len(treatment), each=subject))
p.values <- sapply(seq_len(R), function(r) {
y <- rnorm(subject * treatment, s=3)
anova(lm(y ~ x))[1,"Pr(>F)"]
})
ks.test(p.values, punif,alternative = "two.sided")
One-sample Kolmogorov-Smirnov test
data: p.values
D = 0.0121, p-value = 0.772
alternative hypothesis: two-sided

Resources