Multilevel models in R using nmle package - r

I am using the nlme package to learn multilevel models, and following examples from the textbook "Discovering Statistics Using R" when it happened.
Mixed Models Code
The data set is Honeymoon Period.dat, also downloadable under their companion website.
Data Set - Multilevel Models
require(nlme)
require(reshape2)
satisfactionData = read.delim("Honeymoon Period.dat", header = TRUE)
restructuredData<-melt(satisfactionData, id = c("Person", "Gender"), measured = c("Satisfaction_Base", "Satisfaction_6_Months", "Satisfaction_12_Months", "Satisfaction_18_Months"))
names(restructuredData)<-c("Person", "Gender", "Time", "Life_Satisfaction")
#print(restructuredData)
#restructuredData.sorted<-restructuredData[order(Person),]
intercept <-gls(Life_Satisfaction~1, data = restructuredData, method = "ML", na.action = na.exclude)
randomIntercept <-lme(Life_Satisfaction ~1, data = restructuredData, random = ~1|Person, method = "ML", na.action = na.exclude, control = list(opt="optim"))
anova(intercept, randomIntercept)
timeRI<-update(randomIntercept, .~. + Time)
timeRS<-update(timeRI, random = ~Time|Person)
ARModel<-update(timeRS, correlation = corAR1(0, form = ~Time|Person))
The error occured at this moment, when I am trying to update "timeRS" model.
The error message is as follows:
Error in as.character.factor(X[[i]], ...) : malformed factor
Any stats people/programmers here who knows what this means?

I have looked at this book. It appears that the coding is wrong. The error you are getting is because your time variable is a factor and you need it to be numeric. In the author's first figure in the book he represents time as numeric (0 - 3) but his code for the models is incorrect. I've recoded it for you:
## First, Time needs to be recoded as a numeric
restructuredData$Time.Numeric <- with(restructuredData, ifelse(Time == "Satisfaction_Base", 0,
ifelse(Time == "Satisfaction_6_Months", 1,
ifelse(Time == "Satisfaction_12_Months", 2,
ifelse(Time == "Satisfaction_18_Months", 3, NA)))))
## Baseline Model
intercept <-gls(Life_Satisfaction~1, data = restructuredData, method = "ML", na.action = na.exclude)
summary(intercept)
## Model where intercept can vary for Individuals
randomIntercept <- lme(Life_Satisfaction ~ 1, data = restructuredData, random = ~1|Person, method = "ML", na.action = na.exclude, control = list(opt = "optim"))
summary(randomIntercept)
## Add time as a fixed effect
timeRI <- lme(Life_Satisfaction ~ Time.Numeric, data = restructuredData, random = ~1|Person, method = "ML", na.action = na.exclude, control = list(opt = "optim"))
summary(timeRI)
## Add a random slope to the model by nesting the Individual within the test time
timeRS <- lme(Life_Satisfaction ~ Time.Numeric, data = restructuredData, random = ~Time.Numeric|Person, method = "ML", na.action = na.exclude, control = list(opt = "optim"))
summary(timeRS)
## Modeling the covariance structure structure of the errors with a first-order autoregressive covariance structure
ARModel <- update(timeRS, correlation = corAR1(0, form = ~Time.Numeric|Person))
summary(ARModel)
anova(intercept, randomIntercept, timeRI, timeRS, ARModel)
The anova read out for model comparisons is now exactly as shown in the book.

Related

nlme error with lme function: function evaluation limit reached without convergence

I am conducting a linear mixed effect model with random intercept and slopes with the lme function from the ````nlme``` package in R.
library(readr)
library(nlme)
#Data
post = read_csv("C:/data.csv")
post$regions <- as.numeric(as.factor(post$regions))
post.sc = data.frame(scale(post))
post.sc = na.omit(post.sc)
#Regressions
lm.reg <- lm(V1 ~ V2+V3+V4+V5+V6+V7+V8+V9+V10, data = post.sc)
lmm.null <- lme(V1 ~ 1, data = post.sc, random = ~1 | regions, method = 'ML')
lmm.reg <- lme(V1 ~ V2+V3+V4+V5+V6+V7+V8+V9+V10, data = post.sc, random = ~1 | regions, method = 'ML')
lmm.reg.slope <- lme(V1 ~ V2+V3+V4+V5+V6+V7+V8+V9+V10, data = post.sc, random = ~1+V1 ~ V2+V3+V4+V5+V6+V7+V8+V9+V10| regions, method = 'ML', control = lmeControl(maxIter = 10000, msMaxIter = 10000))
I get the following error message:
Error in lme.formula(V1 ~ V2+V3+V4+V5+V6+V7+ :
nlminb problem, convergence error code = 1
message = function evaluation limit reached without convergence (9)
I have not encountered questions with similar problems.
Is there a way to fix this?

R: Caret package: Brier Score

I want to perform a logistic regression with the train() function from the caret package. My model looks something like that:
model <- train(Y ~.,
data = train_data,
family = "binomial",
method = "glmnet")
With the resulting model, I want to make predictions:
pred <- predict(model, newdata = test_data, s = "lambda.min", type = "prob")
Now, I want to evaluate how good the model predictions are in comparison with the actual test data. For this I know how to receive the ROC and AUC. However I am also interested in receiveing the BRIER SCORE. The formula for the Brier Score is almost identical to the MSE.
The problem I am facing, is that the type argument in predict only allows "prob" (or "class" which I am not interested in) which gives the probability of one prediction beeing a ONE (e.g. 0.64) , and the complementing probability of beeing a ZERO (e.g. 0.37). For the Brier Score however, I need One probability estimate for each prediction that contains the information of both (e.g. a value above 0.5 would indicate a 1 and a value below 0.5 would indicate a 0).
I have not found any solution for receiving the Brier Score in the caret package. I am aware that with the package cv.glmnet the predict function allows the argument "response" which would solve my problem. However, for personal preferences I would like to stay with the caretpackage.
Thanks for the help!
If we go by the wiki definition of brier score:
The most common formulation of the Brier score is
where f_t is the probability that was forecast, o_t the actual outcome of the (0 or 1) and N is the number of forecasting instances.
In R, if your label is a factor, then the logistic regression will always predict with respect to the 2nd level, meaning you just calculate the probability and 0/1 with respect to that. For example:
library(caret)
idx = sample(nrow(iris),100)
data = iris
data$Species = factor(ifelse(data$Species=="versicolor","v","o"))
levels(data$Species)
[1] "o" "v"
In this case, o is 0 and v is 1.
train_data = data[idx,]
test_data = data[-idx,]
model <- train(Species ~.,data = train_data,family = "binomial",method = "glmnet")
pred <- predict(model, newdata = test_data)
So we can see the probability of the class:
head(pred)
o v
1 0.8367885 0.16321154
2 0.7970508 0.20294924
3 0.6383656 0.36163437
4 0.9510763 0.04892370
5 0.9370721 0.06292789
To calculate the score:
f_t = pred[,2]
o_t = as.numeric(test_data$Species)-1
mean((f_t - o_t)^2)
[1] 0.32
I use the Brier score to tune my models in caret for binary classification. I ensure that the "positive" class is the second class, which is the default when you label your response "0:1". Then I created this master summary function, based on caret's own suite of summary functions, to return all the metrics I want to see:
BigSummary <- function (data, lev = NULL, model = NULL) {
pr_auc <- try(MLmetrics::PRAUC(data[, lev[2]],
ifelse(data$obs == lev[2], 1, 0)),
silent = TRUE)
brscore <- try(mean((data[, lev[2]] - ifelse(data$obs == lev[2], 1, 0)) ^ 2),
silent = TRUE)
rocObject <- try(pROC::roc(ifelse(data$obs == lev[2], 1, 0), data[, lev[2]],
direction = "<", quiet = TRUE), silent = TRUE)
if (inherits(pr_auc, "try-error")) pr_auc <- NA
if (inherits(brscore, "try-error")) brscore <- NA
rocAUC <- if (inherits(rocObject, "try-error")) {
NA
} else {
rocObject$auc
}
tmp <- unlist(e1071::classAgreement(table(data$obs,
data$pred)))[c("diag", "kappa")]
out <- c(Acc = tmp[[1]],
Kappa = tmp[[2]],
AUCROC = rocAUC,
AUCPR = pr_auc,
Brier = brscore,
Precision = caret:::precision.default(data = data$pred,
reference = data$obs,
relevant = lev[2]),
Recall = caret:::recall.default(data = data$pred,
reference = data$obs,
relevant = lev[2]),
F = caret:::F_meas.default(data = data$pred, reference = data$obs,
relevant = lev[2]))
out
}
Now I can simply pass summaryFunction = BigSummary in trainControl and then metric = "Brier", maximize = FALSE in the train call.

How to fix "variable length differ" error in cv.zipath?

Trying to run a Cross validation of a zero-inflated poisson model using cv.zipath from the mpath package.
Fitting the LASSO
fit.lasso = zipath(estimation_sample_nomiss ~ .| .,
data = missings,
nlambda = 100,
family = "poisson",
link = "logit")
Cross validation
n <- dim(docvisits)[1]
K <- 10
set.seed(197)
foldid <- split(sample(1:n), rep(1:K, length = n))
fitcv <- cv.zipath(F_time_unemployed~ . | .,
data = estimation_sample_nomiss, family = "poisson",
nlambda = 100, lambda.count = fit.lasso$lambda.count[1:30],
lambda.zero = fit.lasso$lambda.zero[1:30], maxit.em = 300,
maxit.theta = 1, theta.fixed = FALSE, penalty = "enet",
rescale = FALSE, foldid = foldid)
I encounter the following error:
Error in model.frame.default(formula = F_time_unemployed ~ . + ., data = list(: variable lengths differ (found for '(weights)')
I have cleaned the sample of all NA's but still encounter the error message.
The solution turns out to be that the cv.zipath() command does not accept tibble data formats - at least in this instance. (No guarantee as to how this statement can be generalised). Having used dplyr commands, one needs to convert back to data frame. Thus, the solution is as simple as as.dataframe().

R session crash, when gbm() is applied over factor response variable? please advise

Below is the excerpt from code, which I am trying for german credit dataset.
I am trying to make a generic function for ensemble techniques for my shinydashboard.
The problem is with gbm. The R session will crash if the response variable is not converted to a factor.
If the response variable is converted to a factor then RandomForest will not produce the OOB error rate and confusion matrix in its output component.
Please advise.
The response variable is "default". Before applying the model, the
response variable is treated as,
## load the dataset
data_x = read.csv("credit.csv")
## Preprocessing the dataset
data_x$default <- ifelse(data_x$default == "yes", 1, 0)
##Loading packages
pacman::p_load(shiny,shinydashboard,gbm,
randomForest,ggplot2,ipred,caret,ROCR,dplyr,ModelMetrics)
user defined function
model = function(algo =gbm ,distribution = 'bernoulli',
type = 'response', set ='AUC',n.trees =10000){
## Fit the model
model<- algo(formula = default ~ .,
distribution = distribution,
data = train,
n.trees = n.trees,
cv.fold= 3)
## Generate the prediction on the test set
pred<- predict(object = model,
newdata = test,
n.trees = n.trees,
type = type)
## Generate the test set AUCs using the pred
AUC<- auc(actual = test$default, predicted = pred)
if (set == 'AUC'){
return(AUC)
}
if (set == 'predictions'){
return(pred)
}
if (set == 'model'){
return(model)
}
else
return(NULL)
}
now call different model
List of different models
get_model<- function(algo,type = 'response', ntrees = 10000){
z= model(algo = algo, type= type, set = 'model')
}
Bag_model<- get_model(algo = bagging, type='prob')
RF_model<- get_model(algo = randomForest)
GBM_model<- get_model(algo = gbm)

wrong model type for regression error in 10 fold cross validation for Naive Bayes using R

I am implementing 10 fold cross validation for Naive Bayes on some test data with 2 classes(0 and 1).
I followed below steps and getting error.
data(testdata)
attach(testdata)
X <- subset(testdata, select=-Class)
Y <- Class
library(e1071)
naive_bayes <- naiveBayes(X,Y)
library(caret)
library(klaR)
nb_cv <- train(X, Y, method = "nb", trControl = trainControl(method = "cv", number = 10))
## Error:
## Error in train.default(X, Y, method = "nb", trControl = trainControl(number = 10)) :
## wrong model type for regression
dput(testdata)
structure(list(Feature.1 = 6.534088, Feature.2 = -19.050915,
Feature.3 = 7.599378, Feature.4 = 5.093594, Feature.5 = -22.15166,
Feature.6 = -7.478444, Feature.7 = -59.534652, Feature.8 = -1.587918,
Feature.9 = -5.76889, Feature.10 = 95.810563, Feature.11 = 49.124086,
Feature.12 = -21.101489, Feature.13 = -9.187984, Feature.14 = -10.53006,
Feature.15 = -3.782506, Feature.16 = -10.805074, Feature.17 = 34.039509,
Feature.18 = 5.64245, Feature.19 = 19.389724, Feature.20 = 16.450196,
Class = 1L), .Names = c("Feature.1", "Feature.2", "Feature.3",
"Feature.4", "Feature.5", "Feature.6", "Feature.7", "Feature.8",
"Feature.9", "Feature.10", "Feature.11", "Feature.12", "Feature.13",
"Feature.14", "Feature.15", "Feature.16", "Feature.17", "Feature.18",
"Feature.19", "Feature.20", "Class"), class = "data.frame", row.names = c(NA,
-1L))
Also, how to calculare R square or AUC for this model
Dataset: There are 10000 records with 20 features and Binary class.
NaiveBayes is a classifier and hence converting Y to a factor or boolean is the right way to tackle the problem. Your original formulation was using a classifier tool but using numeric values and hence R was confused.
As far as R-square is concerned, again that metric is only computed for Regression problems not classification problems. To evaluate classification problems there are other metrics like Precision and Recall.
Please refer to the wikipedia link for more information on these metrics:
http://en.wikipedia.org/wiki/Binary_classification
It is working after changing label vector Y <- as.factor(Y)
Add to your structure
colClasses=c("Class"="character")

Resources