Ridge Regression accuracy in R - r

I have been stuck on this for some time, and am in need of some help. I am new to R and have never done Ridge Regression using GLMNET. I am trying to learn ML via the MNIST-fashion dataset (https://www.kaggle.com/zalando-research/fashionmnist). The streamline the training (to make sure it works before I attempt to train on the full dataset, I take a stratified random sample (which produces a training dataset of 60 - 6 observations per label):
MNIST.sample.train = sample.split(MNIST.train$label, SplitRatio=0.001)
sample.train = MNIST.train[MNIST.sample.train,]
Next, I attempt to run ridge regression, using alpha=1...
x=model.matrix(label ~ . ,data=sample.train)
y=sample.train$label
rr.m <- glmnet(x,y,alpha=1, family="multinomial")
This seems to work. However, when I attempt to run the prediction, I get an error:
Error in cbind2(1, newx) %% (nbeta[[i]]) : not-yet-implemented
method for %% :
predict.rr.m <- predict(rr.m, MNIST.test, type = "class")
Ultimately, I am looking to obtain a single measure of the accuracy of the ridge regression. I believe that to do so, I must first obtain a prediction.
Any thoughts on how to fix my code would be greatly appreciated.
Kevin

Related

Parallel regression assumption on Imputed (MICE) Data with Brant Test in R

My data is ordinal, and so missing values are imputed with the polr method from the MICE package. Now I have multiple datasets which I can run an Ordinal Logistic Regression on. But, as the title mentioned: I want to perform a Brant test to check the parallel regression assumption. How can I perform such a test on my imputed datasets?
olr <- with(imputed, polr(target ~ var1+var2))
olrsummary <- summary(pool(olr))
> brant(olr)
Error in formula.default(model) : invalid formula
> brant(olrsummary)
Error in temp.data[, name] : incorrect number of dimensions
I know I can take the first dataset with complete(imputed, 1) and use that for my Brant test. But that just don't sees right.
Thanks in advance

Get test error in a logistic regression model in R

I'm performing some experiments with logistic regression in R with the Auto dataset included in R.
I've get the training part (80%) and the test part (20%) normalizing each part individually.
I can create the model without any problem with the line:
mlr<-glm(mpg ~
displacement + horsepower + weight, data =train)
I can even predict train$mpg with the train set:
trainpred<-predict(mlr,train,type="response")
And with this calculate the sample error:
etab <- table(trainpred, train[,1])
insampleerror<-sum(diag(etab))/sum(etab)
The problem comes when I want predict with the test set. I use the following line:
testpred<-predict(model_rl,test,type="response")
Which gives me this warning:
'newdata' had 79 rows but variables found have 313 rows
but it doesn't work, because testpred have the same length of trainpred (should be less). When I want calculate the error in test using testpred with the following line:
etabtest <- table(testpred, test[,1])
I get the following error:
Error en table(testpred, test[, 1]) :
all arguments must have the same length
What I'm doing wrong?
I response my own question if someone have the same problem:
When I put the arguments in glm I'm saying what I want to predict, this is Auto$mpg labels with train data, hence, my glm call must be:
attach(Auto)
mlr<-glm(mpg ~
displacement + horsepower + weight, data=Auto, subset=indexes_train)
If now I call predict, table, etc there isn't any problem of structures sizes. Modifying this mistake it works for me.
As imo says:
"More importantly, you might check that this creates a logistic regression. I think it is actually OLS. You have to set the link and family arguments."
set familiy = 'binomial'

Gaussian process classification with R kernlab package: issue predicting test set larger than training set

I'm using gausspr function from the kernlab package for Gaussian process classification, and running into the following error message:
Error in votematrix[i, ret > 0] : (subscript) logical subscript too
long
anytime I try to use the classifier to make predictions on a dataset that has more observations than the training set.
Here's a very simple example to reproduce this problem:
data(iris)
gp1 = gausspr(Species ~., data=iris)
predict(gp1,iris[c(1:150,1),-5])
Has anyone else run into this problem? Any insights into how to get around it other than calling predict many times on smaller subsets of the test data?
Thanks!
I don't have time to review the code right now, but predicting 'probabilities' jumps the faulty code, so try this instead:
data(iris)
gp1 = gausspr(Species ~., data=iris)
predict(gp1,iris[c(1:150,1),-5], type = 'probabilities')
And work with the probabilities.
This is the loop that outputs that error if you want to review it.

Random forest evaluation in R

I am a newbie in R and I am trying to do my best to create my first model. I am working in a 2- classes random forest project and so far I have programmed the model as follows:
library(randomForest)
set.seed(2015)
randomforest <- randomForest(as.factor(goodkit) ~ ., data=training1, importance=TRUE,ntree=2000)
varImpPlot(randomforest)
prediction <- predict(randomforest, test,type='prob')
print(prediction)
I am not sure why I don't get the overall prediction for my model.I must be missing something in my code. I get the OOB and the prediction per case in the test set but not the overall prediction of the model.
library(pROC)
auc <-roc(test$goodkit,prediction)
print(auc)
This doesn't work at all.
I have been through the pROC manual but I cannot get to understand everything. It would be very helpful if anyone can help with the code or post a link to a good practical sample.
Using the ROCR package, the following code should work for calculating the AUC:
library(ROCR)
predictedROC <- prediction(prediction[,2], as.factor(test$goodkit))
as.numeric(performance(predictedROC, "auc")#y.values))
Your problem is that predict on a randomForest object with type='prob' returns two predictions: each column contains the probability to belong to each class (for binary prediction).
You have to decide which of these predictions to use to build the ROC curve. Fortunately for binary classification they are identical (just reversed):
auc1 <-roc(test$goodkit, prediction[,1])
print(auc1)
auc2 <-roc(test$goodkit, prediction[,2])
print(auc2)

How do I plot predictions from new data fit with gee, lme, glmer, and gamm4 in R?

I have fit my discrete count data using a variety of functions for comparison. I fit a GEE model using geepack, a linear mixed effect model on the log(count) using lme (nlme), a GLMM using glmer (lme4), and a GAMM using gamm4 (gamm4) in R.
I am interested in comparing these models and would like to plot the expected (predicted) values for a new set of data (predictor variables). My goal is to compare the predicted effects for each model under particular conditions (x variables). Of particular interest is the comparison between marginal (GEE) and conditional estimates.
I think my main problem might be getting the new data in the correct form with the correct labels and attributes and such. I am still very much an R novice and struggle with this stuff (no course on this at my university unfortunately).
I currently have fitted models
gee1 lme1 lmer1 gamm1
and can extract their fixed effect coefficients and standard errors without a problem. I also don't have a problem converting them from the log scale or estimating confidence intervals accounting for the random effects.
I also have my new dataframe newdat which has 365 observations of 23 variables (average environmental data for each day of the year).
I am stuck on how to predict new count estimates from this. I played around with the model.matrix function but couldn't get it to work. For example, I tried:
mm = model.matrix(terms(glmm1), newdat) # Error in model.frame.default(object,
# data, xlev = xlev) : object is not a matrix
newdat$pcount = mm %*% fixef(glmm1)
Any suggestions or good references would be greatly appreciated. Can anyone help with the error above?
Getting predictions for lme() and lmer() is documented on http://glmm.wikidot.com/faq

Resources