std values of principal component object differs in prcomp and caret - r

I was trying Principal Component Analysis in the following data set. I tried through prcomp function and caret preProcess function.
library(caret)
library(AppliedPredictiveModeling)
set.seed(3433)
data(AlzheimerDisease)
adData = data.frame(diagnosis,predictors)
inTrain = createDataPartition(adData$diagnosis, p = 3/4)[[1]]
training = adData[ inTrain,]
testing = adData[-inTrain,]
# from prcomp
names1 <-names(training)[substr(names(training),1,2)=="IL"]
prcomp.data <- prcomp(training[,names1],center=TRUE, scale=TRUE)
prcomp.data$sdev
## from caret package
preProcess(training[, names1], method=c("center", "scale", "pca"))$std
I was wondering why sdev values differ in the above processes.
Thanks

The first method is giving you standard deviations of 12 principal components (which you can see with prcomp.data$rotation).
Also, this is mentioned in the documentation for the sdev value:
the standard deviations of the principal components (i.e., the square
roots of the eigenvalues of the covariance/correlation matrix, though
the calculation is actually done with the singular values of the data
matrix).
The 2nd is giving you standard deviations on the pre-processed input data (hence the variable names associated with each standard deviation).
A small side note -- caret PCA's are automatically scaled and centered unless otherwise specified.

Related

User specified variance-covariance matrix in car::Anova not working

I am trying to use the car::Anova function to carry out joint Wald chi-squared tests for interaction terms involving categorical variables.
I would like to compare results when using bootstrapped variance-covariance matrix for the model coefficients. I have some concerns about the normality of residuals and am doing this as a first step before considering permutation tests as an alternative to joint Wald chi-squared tests.
I have found the variance covariance from the model fitted on 1000 bootstrap resamples of the data. The problem is that the car::Anova.merMod function does not seem to use the user-specified variance covariance matrix. I get the same results whether I specify vcov. or not.
I have made a very simple example below where I try to use the identity matrix in Anova(). I have tried this with the more realistic bootstrapped var-cov as well.
I looked at the code on github and it looks like there is a line where vcov. is overwritten using vcov(mod), so that might be an error. However I thought I'd see if anyone here had come across this issue or could see if I had made a mistake.
Any help would be great!
df1 = data.frame( y = rbeta(180,2,5), x = rnorm(180), group = letters[1:30] )
mod1 = lmer(y ~ x + (1|group), data = df1)
# Default, uses variance-covariance from the model
Anova(mod1)
# Should use user-specified varcov matrix but does not - same results as above
Anova(mod1, vcov. = diag(2))
# I'm not bootstrapping the var-cov matrix here to save space/time
p.s. Using car::linearHypothesis works for user-specified vcov, but this does not give results using type 3 sums of squares. It is also more laborious to use for more than one interaction term. Therefore I'd prefer to use car::Anova if possible.

I am setting seed on Gradient Boosting Machine(GBM) Model but I keep on getting different prediction

I am performing credit risk modelling using the Gradient Boosting Machine (GBM) algorithm and on making predictions of Probability of Default (PD) I keep on getting different PDs for each run even when I have set.seed(1234) in my code.
What could be causing this to happen and how do I fix it. Here is my code below:
fitControl <- trainControl(
method = "repeatedcv",
number = 5,
repeats = 5)
modelLookup(model='gbm')
#Creating grid
grid <- expand.grid(n.trees=c(10,20,50,100,500,1000),shrinkage=c(0.01,0.05,0.1,0.5),n.minobsinnode
= c(3,5,10),interaction.depth=c(1,5,10))
#SetSeed
set.seed(1234)
# training the model
model_gbm<-train(trainSet[,predictors],trainSet[,outcomeName],method='gbm',trControl=fitControl,tuneGrid=grid)
# summarizing the model
print(model_gbm)
plot(model_gbm)
#using tune length
model_gbm<-train(trainSet[,predictors],trainSet[,outcomeName],method='gbm',trControl=fitControl,tuneLength=10)
print(model_gbm)
plot(model_gbm)
#Checking variable importance for GBM
#Variable Importance
library(gbm)
varImp(object=model_gbm, numTrees = 50)
#Plotting Varianle importance for GBM
plot(varImp(object=model_gbm),main="GBM - Variable Importance")
#Checking variable importance for RF
varImp(object=model_rf)
#Plotting Varianle importance for Random Forest
plot(varImp(object=model_rf),main="RF - Variable Importance")
#Checking variable importance for NNET
varImp(object=model_nnet)
#Plotting Variable importance for Neural Network
plot(varImp(object=model_nnet),main="NNET - Variable Importance")
#Checking variable importance for GLM
varImp(object=model_glm)
#Plotting Variable importance for GLM
plot(varImp(object=model_glm),main="GLM - Variable Importance")
#Predictions
predictions<-predict.train(object=model_gbm,testSet[,predictors],type="raw")
table(predictions)
confusionMatrix(predictions,testSet[,outcomeName])
PD <- predict.train(object=model_gbm,credit_transformed[,predictors],type="prob")
I assume you are using train() from caret.
I recommend you use the more complex but customizable trainControl() from the same package.
As you can see from ?trainControl, the parameter seeds is:
an optional set of integers that will be used to set the seed at each
resampling iteration. This is useful when the models are run in
parallel. A value of NA will stop the seed from being set within the
worker processes while a value of NULL will set the seeds using a
random set of integers. Alternatively, a list can be used. The list
should have B+1 elements where B is the number of resamples, unless
method is "boot632" in which case B is the number of resamples plus 1.
The first B elements of the list should be vectors of integers of
length M where M is the number of models being evaluated. The last
element of the list only needs to be a single integer (for the final
model). See the Examples section below and the Details section.
Fixing seeds should do the trick.
Please, next time try to offer a dput o analogous of your data in order to be reproducible.
Best!

LASSO analysis (glmnet package). Can I loop the analysis and the results extraction?

I'm using the package glmnet, I need to run several LASSO analysis for the calibration of a large number of variables (%reflectance for each wavelength throughout the spectrum) against one dependent variable. I have a couple of doubts on the procedure and on the results I wish to solve. I show my provisional code below:
First I split my data in training (70% of n) and testing sets.
smp_size <- floor(0.70 * nrow(mydata))
set.seed(123)
train_ind <- sample(seq_len(nrow(mydata)), size = smp_size)
train <- mydata[train_ind, ]
test <- mydata[-train_ind, ]
Then I separate the target trait (y) and the independent variables (x) for each set as follows:
vars.train <- train[3:2153]
vars.test <- test[3:2153]
x.train <- data.matrix(vars.train)
x.test <- data.matrix(vars.test)
y.train <- train$X1
y.test <- test$X1
Afterwords, I run a cross-validated LASSO model for the training set and extract and writte the non-zero coefficients for lambdamin. This is because one of my concerns here is to note which variables (wavebands of the reflectance spectrum) are selected by the model.
install.packages("glmnet")
library(glmnet)
cv.lasso.1 <- cv.glmnet(y=y.train, x= x.train, family="gaussian", nfolds =
5, standardize=TRUE, alpha=1)
coef(cv.lasso.1,s=cv.lasso.1$lambda.min) # Using lambda min.
(cv.lasso.1)
install.packages("broom")
library(broom)
c <- tidy(coef(cv.lasso.1, s="lambda.min"))
write.csv(c, file = "results")
Finally, I use the function “predict” and apply the object “cv.lasso1” (the model obtained previously) to the variables of the testing set (x.2) in order to get the prediction of the variable and I run the correlation between the predicted and the actual values of Y for the testing set.
predict.1.2 <- predict(cv.lasso.1, newx=x.2, type = "response", s =
"lambda.min")
cor.test(x=c(predict.1.2), y=c(y.2))
This is a simplified code and had no problem so far, the point is that I would like to make a loop (of one hundred repetitions) of the whole code and get the non-zero coefficients of the cross-validated model as well as the correlation coefficient of the predicted vs actual values (for the testing set) for each repetition. I've tried but couldn't get any clear results. Can someone give me some hint?
thanks!
In general, running repeated analyses of the same type over and over on the same data can be tricky. And in your case, may not be necessary the way in which you have outlined it.
If you are trying to find the variables most predictive, you can use PCA, Principal Component Analysis to select variables with the most variation within the a variable AND between variables, but it does not consider your outcome at all, so if you have poor model design it will pick the least correlated data in your repository but it may not be predictive. So you should be very aware of all variables in the set. This would be a way of reducing the dimensionality in your data for a linear or logistic regression of some sort.
You can read about it here
yourPCA <- prcomp(yourData,
center = TRUE,
scale. = TRUE)
Scaling and centering are essential to making these models work right, by removing the distance between your various variables setting means to 0 and standard deviations to 1. Unless you know what you are doing, I would leave those as they are. And if you have skewed or kurtotic data, you might need to address this prior to PCA. Run this ONLY on your predictors...keep your target/outcome variable out of the data set.
If you have a classification problem you are looking to resolve with much data, try an LDA, Linear Discriminant Analysis which looks to reduce variables by optimizing the variance of each predictor with respect to the OUTCOME variable...it specifically considers your outcome.
require(MASS)
yourLDA =r <- lda(formula = outcome ~ .,
data = yourdata)
You can also set the prior probabilities in LDA if you know what a global probability for each class is, or you can leave it out, and R/ lda will assign the probabilities of the actual classes from a training set. You can read about that here:
LDA from MASS package
So this gets you headed in the right direction for reducing the complexity of data via feature selection in a computationally solid method. In looking to build the most robust model via repeated model building, this is known as crossvalidation. There is a cv.glm method in boot package which can help you get this taken care of in a safe way.
You can use the following as a rough guide:
require(boot)
yourCVGLM<- cv.glmnet(y = outcomeVariable, x = allPredictorVariables, family="gaussian", K=100) .
Here K=100 specifies that you are creating 100 randomly sampled models from your current data OBSERVATIONS not variables.
So the process is two fold, reduce variables using one of the two methods above, then use cross validation to build a single model from repeated trials without cumbersome loops!
Read about cv.glm here
Try starting on page 41, but look over the whole thing. The repeated sampling you are after is called booting and it is powerful and available in many different model types.
Not as much code and you might hope for, but pointing you in a decent direction.

Boosting classification tree in R

I'm trying to boost a classification tree using the gbm package in R and I'm a little bit confused about the kind of predictions I obtain from the predict function.
Here is my code:
#Load packages, set random seed
library(gbm)
set.seed(1)
#Generate random data
N<-1000
x<-rnorm(N)
y<-0.6^2*x+sqrt(1-0.6^2)*rnorm(N)
z<-rep(0,N)
for(i in 1:N){
if(x[i]-y[i]+0.2*rnorm(1)>1.0){
z[i]=1
}
}
#Create data frame
myData<-data.frame(x,y,z)
#Split data set into train and test
train<-sample(N,800,replace=FALSE)
test<-(-train)
#Boosting
boost.myData<-gbm(z~.,data=myData[train,],distribution="bernoulli",n.trees=5000,interaction.depth=4)
pred.boost<-predict(boost.myData,newdata=myData[test,],n.trees=5000,type="response")
pred.boost
pred.boost is a vector with elements from the interval (0,1).
I would have expected the predicted values to be either 0 or 1, as my response variable z also consists of dichotomous values - either 0 or 1 - and I'm using distribution="bernoulli".
How should I proceed with my prediction to obtain a real classification of my test data set? Should I simply round the pred.boost values or is there anything I'm doing wrong with the predict function?
Your observed behavior is correct. From documentation:
If type="response" then gbm converts back to the same scale as the
outcome. Currently the only effect this will have is returning
probabilities for bernoulli.
So you should be getting probabilities when using type="response" which is correct. Plus distribution="bernoulli" merely tells that labels follows bernoulli (0/1) pattern. You can omit that and still model will run fine.
To proceed do predict_class <- pred.boost > 0.5 (cutoff = 0.5) or else plot ROC curve to decide on cutoff yourself.
Try using adabag. Class, probabilities, votes and error are inbuilt in adabag which makes it easy to interpret, and of course less lines of codes.

Cross validation of PCA+lm

I'm a chemist and about an year ago I decided to know something more about chemometrics.
I'm working with this problem that I don't know how to solve:
I performed an experimental design (Doehlert type with 3 factors) recording several analyte concentrations as Y.
Then I performed a PCA on Y and I used scores on the first PC (87% of total variance) as new y for a linear regression model with my experimental coded settings as X.
Now I need to perform a leave-one-out cross validation removing each object before perform the PCA on the new "training set", then create the regression model on the scores as I did before, predict the score value for the observation in the "test set" and calculate the error in prediction comparing the predicted score and the score obtained by the projection of the object in the test set in the space of the previous PCA. So repeated n times (with n the number of point of my experimental design).
I'd like to know how can I do it with R.
Do the calculations e.g. by prcomp and then lm. For that you need to apply the PCA model returned by prcomp to new data. This needs two (or three) steps:
Center the new data with the same center that was calculated by prcomp
Scale the new data with the same scaling vector that was calculated by prcomp
Apply the rotation calculated by prcomp
The first two steps are done by scale, using the $center and $scale elements of the prcomp object. You then matrix multiply your data by $rotation [, components.to.use]
You can easily check whether your reconstruction of the PCA scores calculation by calculating scores for the data you input to prcomp and comparing the results with the $x element of the PCA model returned by prcomp.
Edit in the light of the comment:
If the purpose of the CV is calculating some kind of error, then you can choose between calculating error of the predicted scores y (which is how I understand you) and calculating error of the Y: the PCA lets you also go backwards and predict the original variates from scores. This is easy because the loadings ($rotation) are orthogonal, so the inverse is just the transpose.
Thus, the prediction in original Y space is scores %*% t (pca$rotation), which is faster calculated by tcrossprod (scores, pca$rotation).
There is also R library pls (Partial Least Squares), which has tools for PCR (Principal Component Regression)

Resources