Cross validation of PCA+lm - r

I'm a chemist and about an year ago I decided to know something more about chemometrics.
I'm working with this problem that I don't know how to solve:
I performed an experimental design (Doehlert type with 3 factors) recording several analyte concentrations as Y.
Then I performed a PCA on Y and I used scores on the first PC (87% of total variance) as new y for a linear regression model with my experimental coded settings as X.
Now I need to perform a leave-one-out cross validation removing each object before perform the PCA on the new "training set", then create the regression model on the scores as I did before, predict the score value for the observation in the "test set" and calculate the error in prediction comparing the predicted score and the score obtained by the projection of the object in the test set in the space of the previous PCA. So repeated n times (with n the number of point of my experimental design).
I'd like to know how can I do it with R.

Do the calculations e.g. by prcomp and then lm. For that you need to apply the PCA model returned by prcomp to new data. This needs two (or three) steps:
Center the new data with the same center that was calculated by prcomp
Scale the new data with the same scaling vector that was calculated by prcomp
Apply the rotation calculated by prcomp
The first two steps are done by scale, using the $center and $scale elements of the prcomp object. You then matrix multiply your data by $rotation [, components.to.use]
You can easily check whether your reconstruction of the PCA scores calculation by calculating scores for the data you input to prcomp and comparing the results with the $x element of the PCA model returned by prcomp.
Edit in the light of the comment:
If the purpose of the CV is calculating some kind of error, then you can choose between calculating error of the predicted scores y (which is how I understand you) and calculating error of the Y: the PCA lets you also go backwards and predict the original variates from scores. This is easy because the loadings ($rotation) are orthogonal, so the inverse is just the transpose.
Thus, the prediction in original Y space is scores %*% t (pca$rotation), which is faster calculated by tcrossprod (scores, pca$rotation).

There is also R library pls (Partial Least Squares), which has tools for PCR (Principal Component Regression)

Related

How do I extract the principal component`s values of all observations using psych package

I'm performing dimensionality reduction using the psych package. After analyzing the scree plot I decided to use the 9 most important PCs (out of 15 variables) to build a linear model.
My question is, how do I extract the values of the 9 most important PCs for each of the 500 observations I have? Is there any built in function for that, or do I have to manually compute it using the loadings matrix?
Returns eigen values, loadings, and degree of fit for a specified number of components after performing an eigen value decomposition. Essentially, it involves doing a principal components analysis (PCA) on n principal components of a correlation or covariance matrix. Can also display residual correlations.By comparing residual correlations to original correlations, the quality of the reduction in squared correlations is reported. In contrast to princomp, this only returns a subset of the best nfactors. To obtain component loadings more characteristic of factor analysis, the eigen vectors are rescaled by the sqrt of the eigen values.
principal(r, nfactors = 1, residuals = FALSE,rotate="varimax",n.obs=NA, covar=FALSE,
scores=TRUE,missing=FALSE,impute="median",oblique.scores=TRUE,
method="regression",...)
I think So.

How does one extract hat values and Cook's Distance from an `nlsLM` model object in R?

I'm using the nlsLM function to fit a nonlinear regression. How does one extract the hat values and Cook's Distance from an nlsLM model object?
With objects created using the nls or nlreg functions, I know how to extract the hat values and the Cook's Distance of the observations, but I can't figure out how to get them using nslLM.
Can anyone help me out on this? Thanks!
So, it's not Cook's Distance or based on hat values, but you can use the function nlsJack in the nlstools package to jackknife your nls model, which means it removes every point, one by one, and bootstraps the resulting model to see, roughly speaking, how much the model coefficients change with or without a given observation in there.
Reproducible example:
xs = rep(1:10, times = 10)
ys = 3 + 2*exp(-0.5*xs)
for (i in 1:100) {
xs[i] = rnorm(1, xs[i], 2)
}
df1 = data.frame(xs, ys)
nls1 = nls(ys ~ a + b*exp(d*xs), data=df1, start=c(a=3, b=2, d=-0.5))
require(nlstools)
plot(nlsJack(nls1))
The plot shows the percentage change in each model coefficient as each individual observation is removed, and it marks influential points above a certain threshold as "influential" in the resulting plot. The documentation for nlsJack describes how this threshold is determined:
An observation is empirically defined as influential for one parameter if the difference between the estimate of this parameter with and without the observation exceeds twice the standard error of the estimate divided by sqrt(n). This empirical method assumes a small curvature of the nonlinear model.
My impression so far is that this a fairly liberal criterion--it tends to mark a lot of points as influential.
nlstools is a pretty useful package overall for diagnosing nls model fits though.

How to determine the coefficient of svm classifiers for linear kernels in R?

I am using the kernlab package for svm in R.I am using the linear kernel so that I can directly check the importance of the feature vectors, that is my variables.Using the coefficients of these feature vectors,I am required to calculate the weight of the various factors in the model,so that the linear separating plane that the svm will draw in my feature space can we evaluated. Basically I want to calculate the w in transpose(w)*x + b. Could someone please suggest what is to be done. I used the fields alpha and b and apha index and tried to logically calculate the weight vector, but to verify if I was calculating correctly I tried to predict on a test sample its correct predicted score, and this did not match the value predicted by the inbuilt predict function. How to calculate the weights?

Multiple Linear Regression and MSE from R

have a dataset (found here- https://netfiles.umn.edu/users/nacht001/www/nachtsheim/Kutner/Appendix%20C%20Data%20Sets/APPENC01.txt) and I have done some R coding for linear regression. In the attached dataset the columns are not labeled. I had to label the columns of the dataset and save it as a csv and I apologize I can't get that on hereā€¦ but the columns I am using are column 3(age) column 4(infection) column 5 (culratio) column 10 (census) and column 12(service), column 9 (region). I named the dataset hospital.
I am supposed to "For each geographic region, regress infection risk (Y) against the predictor variables age, culratio, census, service using a first order regression model. Then I need to find the MSE for each region. This is the code I have.
NE<- subset(hospital, region=="1")
NC<- subset(hospital, region=="2")
S<- subset(hospital, region=="3")
W<- subset(hospital, region=="4")
then to do a first order linear regression model I use the basic code for each
NE.Model<- lm(NE$infection~ NE$age + NE$culratio + NE$census + NE$service)
summary(NE.Model)
and I can get the adjusted R squared value, but how do I find MSE from this output?
Moving my comment to an answer. The "errors" or "residuals" are part of the model object, NE.Model$residuals, so getting the mean square error is as easy as that: mean(NE.Model$residuals^2).
Just as a note, you could do this in fewer steps by fitting a region fixed effect term in your model and then calculating the MSE for each subset of the residuals. Same difference, really.

How do I get individual tree probabilities from Random Forests in R?

I'm using the randomForest package in R on a classification problem (outcome is binary).
I want to get the probability output of each one of the trees (to get a prediction interval).
I've set the predict.all=TRUE argument in the predictions, but it gives me a matrix of 800 columns (=the number of trees in my forest) and each of them is a 1 or a 0. How do I get the probability output rather than class?
PS: the size of my nodes=1, which means that this should make sense. however, I changed the node size=50, still got all 0's and 1's no probabilities.
Here's what Im doing:
#build model (node size=1)
rf<-randomForest(y~. ,data=train, ntree=800,replace=TRUE, proximilty=TRUE, keep.inbag=TRUE)
#get the predictions
#store the predictions from all the trees
all_tree_train<-predict(rf, test, type="prob", predict.all= TRUE)$individual
This gives a matrix of 0's and 1's rather than probabilities.
I realise this question is old, but it might help anyone with a similar question.
If you query the trees for their results, you'll always get the end classifications which are deterministic given an initialised forest. You can extract the probabilities by setting predict all to TRUE as you've done and summing across the votes for a probability.
If you have more than 2 classes, the forest classifies an item 'm' as class 'x' with probability
(number of trees which bin m as x)/(number of trees)
As you only have a binary classification, the column sums of the prediction matrix give you the probability of being in class 1.
So the documentation for predict.randomForest states:
If predict.all=TRUE, then the individual component of the returned
object is a character matrix where each column contains the predicted
class by a tree in the forest.
...so it does not appear that it is possible to have a probability returned for each individual tree.
If you want something like a prediction interval for classification, you might try fitting a random forest with many more trees and then generating predictions from many different (random?) subsets of the forest.
One thing you need to be careful of though is that you appear to be feeding your training data to predict.randomForest. This will of course give you biased predictions, unless you use the information from the inbag component of the random forest object to only select trees on which that observation was out of bag.

Resources