Get test error in a logistic regression model in R - r

I'm performing some experiments with logistic regression in R with the Auto dataset included in R.
I've get the training part (80%) and the test part (20%) normalizing each part individually.
I can create the model without any problem with the line:
mlr<-glm(mpg ~
displacement + horsepower + weight, data =train)
I can even predict train$mpg with the train set:
trainpred<-predict(mlr,train,type="response")
And with this calculate the sample error:
etab <- table(trainpred, train[,1])
insampleerror<-sum(diag(etab))/sum(etab)
The problem comes when I want predict with the test set. I use the following line:
testpred<-predict(model_rl,test,type="response")
Which gives me this warning:
'newdata' had 79 rows but variables found have 313 rows
but it doesn't work, because testpred have the same length of trainpred (should be less). When I want calculate the error in test using testpred with the following line:
etabtest <- table(testpred, test[,1])
I get the following error:
Error en table(testpred, test[, 1]) :
all arguments must have the same length
What I'm doing wrong?

I response my own question if someone have the same problem:
When I put the arguments in glm I'm saying what I want to predict, this is Auto$mpg labels with train data, hence, my glm call must be:
attach(Auto)
mlr<-glm(mpg ~
displacement + horsepower + weight, data=Auto, subset=indexes_train)
If now I call predict, table, etc there isn't any problem of structures sizes. Modifying this mistake it works for me.

As imo says:
"More importantly, you might check that this creates a logistic regression. I think it is actually OLS. You have to set the link and family arguments."
set familiy = 'binomial'

Related

Error in glsEstimate(object, control = control) : computed "gls" fit is singular, rank 19

First time asking in the forums, this time I couldn't find the solutions in other answers.
I'm just starting to learn to use R, so I can't help but think this has a simple solution I'm failing to see.
I'm analyzing the relationship between different insect species (SP) and temperature (T), explanatory variables
and the area of the femur of the resulting adult (Femur.area) response variable.
This is my linear model:
ModeloP <- lm(Femur.area ~ T * SP, data=Datos)
No error, but when I want to model variance with gls,
modelo_varPower <- gls(Femur.area ~ T*SP,
weights = varPower(),
data = Datos
)
I get the following errors...
Error in glsEstimate(object, control = control) :
computed "gls" fit is singular, rank 19
The linear model barely passes the Shapiro test of normality, could this be the issue?
Shapiro-Wilk normality test
data: re
W = 0.98269, p-value = 0.05936
Strangely I've run this model using another explanatory variable and had no errors, all I can read in the forums has to do with multiple samplings along a period of time, and thats not my case.
Since the only difference is the response variable I'm uploading and image of how the table looks like in case it helps.
You have some missing cells in your SP:T interaction. lm() tolerates these (if you look at coef(lm(Femur.area~SP*T,data=Datos)) you'll see some NA values for the missing interactions). gls() does not. One way to deal with this is to create an interaction variable and drop the missing levels, then fit the model as (effectively) a one-way rather than a two-way ANOVA. (I called the data dd rather than datos.)
dd3 <- transform(na.omit(dd), SPT=droplevels(interaction(SP,T)))
library(nlme)
gls(Femur.area~SPT,weights=varPower(form=~fitted(.)),data=dd3)
If you want the main effects and the interaction term and the power-law variance that's possible, but it's harder.

Ridge Regression accuracy in R

I have been stuck on this for some time, and am in need of some help. I am new to R and have never done Ridge Regression using GLMNET. I am trying to learn ML via the MNIST-fashion dataset (https://www.kaggle.com/zalando-research/fashionmnist). The streamline the training (to make sure it works before I attempt to train on the full dataset, I take a stratified random sample (which produces a training dataset of 60 - 6 observations per label):
MNIST.sample.train = sample.split(MNIST.train$label, SplitRatio=0.001)
sample.train = MNIST.train[MNIST.sample.train,]
Next, I attempt to run ridge regression, using alpha=1...
x=model.matrix(label ~ . ,data=sample.train)
y=sample.train$label
rr.m <- glmnet(x,y,alpha=1, family="multinomial")
This seems to work. However, when I attempt to run the prediction, I get an error:
Error in cbind2(1, newx) %% (nbeta[[i]]) : not-yet-implemented
method for %% :
predict.rr.m <- predict(rr.m, MNIST.test, type = "class")
Ultimately, I am looking to obtain a single measure of the accuracy of the ridge regression. I believe that to do so, I must first obtain a prediction.
Any thoughts on how to fix my code would be greatly appreciated.
Kevin

SuperLearner Predict error

I am using SuperLearner R package.
I am trying to generate predicted y values for both train and test set.
After fitting a superlearner model without defining a "newX" to get predictions on the train set first so that I can compute MSE and plot predictions vs. actual Y values, I use "predict" command to predict Y values for the test set by running the following code:
sl.cv<-SuperLearner(Y = label, X = train,
SL.library=c("SL.randomForest", "SL.glmnet", "SL.svm"),
method = "method.NNLS", verbose=TRUE, cvControl=list(V=10))
pred.sl.cv <- predict(sl.cv, newdata=test, onlySL = T)
Then, I get the following error after "predict":
"Error in object$whichScreen : $ operator is invalid for atomic vectors"
I browsed many online sources to learn how to use "predict" after fitting a SuperLearner model, and I am doing just as what others do: That is, to put the object name of the fitted SuperLearner model (in this case, "sl.cv") followed by the new test set. I didn't even type $ operator.
Why am I getting this error message? How do I solve this problem?
Another question is: Does adding cvControl=list(V=10) as an option make any change? I think the default setting for SuperLearner model is to conduct 10-fold cross-validation. So, removing "cvControl=list(V=10)" will not change anything, right?
I would appreciate your advice. Thank you!
The problem is you are using matrices for your train and/or test data. You should use a data.frame. So change your code to the following:
sl.cv<-SuperLearner(Y = label, X = as.data.frame(train),
SL.library=c("SL.randomForest", "SL.glmnet", "SL.svm"),
method = "method.NNLS", verbose=TRUE, cvControl=list(V=10))
pred.sl.cv <- predict(sl.cv, newdata=as.data.frame(test), onlySL = T)
Also, make sure your labels are a list.

Computational error with lmerTest and undefined columns selected when using mixed()

I want to fit a mixed effects linear regression. The dependent variable is acceptability judgments on a 4 point rating scale (Totally unacceptable to Totally unacceptable). These judgments have been assigned a numeric value (1, 2, 3, 4) and that vector was centered and scaled.
I call the model with the following code:
ln1 = lmer(RatingNorm ~ Group + ProfScore + RegularityInflectedForm + RegRhyme*SimilarityReal + Tense + VerbClass + (1|SubjectID) + (1|Infinitive), data=AJT1)
Then try for p values with:
mixed(ln1, AJT1)
No error messages have appeared following fitting the model. Using mixed() from the afex package to get p values gives a strange error message.
Fitting one lmer() model. [DONE]
Calculating p-values.
anova from lme4 is returned
some computational error has occurred in lmerTest
Error in `[.data.frame`(anova_table, , c("NumDF", "DenDF", "F.value", :
undefined columns selected
This has repeated itself when I call the same model using the lmerTest package. I have also tried simpler models with only one of the fixed effects (just Group or just Tense, which are categorical, and just ProfScore, which is continuous), as well as using only one of the two random effects. The same error always repeats. However, I am able to use anova(model) to see p values. I would like to know why I cannot use mixed() successfully in this case. I also have the most recent version of R installed, and am not seeing any posts showing similar errors for this kind of scenario.
Here are links to code and dataset:
R code
Dataset
I am getting a similar problem trying to use piecewiseSEM in R. It runs fine if I eliminate the mixed model portion of the DVpopStand model (and use lm).
dumpster = list(
lmer(DVpopStand ~ Treatment + Site + (1|fLine/fBowl)),
lmer(centroidStand ~ Treatment + Site+ (1|fLine/fBowl)),
lmer(Crush ~ Treatment + Site + centroidStand + DVpopStand + (1|fLine/fBowl))
)
Model.result = sem.fit(dumpster, data= F1culled)
And get the error message:
summary from lme4 is returned
some computational error has occurred in lmerTest
Error in [.data.frame(ret, 3:4) : undefined columns selected

Using lme4 glmer function for unbalanced treatment comparison results in variable length error

I am using the lme4 package to run a generalized linear mixed model for proportion data using a binary response. I have unequal sample sizes for my treatments and am getting the following error, which I understand is due to the very fact that I have unequal sample sizes:
Error in model.frame.default(data = POL3, drop.unused.levels = TRUE,
formula = X2 ~ : variable lengths differ (found for 'Trtmt')
Here is the code that leads to the error:
#Excluding NA from the data set
POL3<-na.exclude(POL)
#Indicating the binary response
X2<-cbind(POL3$CHSd, POL3$TotSd-POL3$CHSd)
#Running the model
MMCHS4<-glmer(X2~Trtmt+(1|BSD)+(1|Hgt), family=binomial, data=POL3)
I have read that lme4 can deal with unbalanced samples but can't get this to work.
Impossible to say for sure without a reproducible example, but you probably need to make sure that the Trtmt variable is contained within POL3 (i.e., that there isn't another Trtmt variable lying around in your global workspace).
I would probably implement the model in this way:
glmer(CHSd/TotSd~Trtmt+(1|BSD)+(1|Hgt),
weights=TotSd,
family=binomial,
na.action=na.exclude,
data=POL)

Resources