Get LD1 values when applying LDA with CV = true in R - r

We wanted to see whether the recorded EEG data would be a good classifier to classify participants into group A or B.
I used the LDA function in the MASS package in R for a linear discriminant analysis. To make sure the model would be generalizable, we added a leave-one-out cross validation (CV = TRUE). All of this worked very well.
Now I would like to visualise this classification based on the discriminant scores. More specifically, show both groups on the x-axis and the discriminant scores on the y-axis. Something like this: classification based on discriminant scores
However, when applying CV = TRUE, it is not possible to use the predict function on the LDA object to generate LD1 values. Also, no coefficients of linear discriminants are returned (which could be used to calculate the discriminant scores myself (based on these links: What are “coefficients of linear discriminants” in LDA? and Discriminant functions)).
When I remove the CV = TRUE, I can use the predict function afterwards and get the LD1 values (i.e. discriminant scores) needed for the visualisation. However, my classification results differ depending on whether or not I apply the leave-one-out cross validation. For example, when using CV = TRUE, a lower number of participants would be classified in the correct group. So if I would plot the data based on the results of the linear discriminant analysis without leave-one-out cross validation, it would not correspond to the results I would describe.
My question is whether there is a way to get the LD1 values or the coefficients of linear discriminants when applying CV = TRUE to the LDA function of the MASS package?
Or if someone knows another option to produce a plot similar to the one I added above?

Related

Ordinal logistic regression (or Beta regression) with a LASSO regularization in R?

I was wondering if someone would know an R package that would allow me to fit an Ordinal Logistic regression with a LASSO regularization or, alternatively, a Beta regression still with the LASSO? And if you also know of a nice tutorial to help me code that in R (with appropriate cross-validation), that would be even better!
Some context: My response variable is a satisfaction score between 0 and 10 (actually, values lie between 2 and 10) so I can model it with a Beta regression or I can convert its values into ranked categories. My interest is to identify important variables explaining this score but as I have too many potential explanatory variables (p = 12) compared to my sample size (n = 105), I need to use a penalized regression method for model selection, hence my interest in the LASSO.
The ordinalNet package does this. There's a paper with example here:
https://www.jstatsoft.org/article/download/v099i06/1440
Also the glmnetcr package: https://cran.r-project.org/web/packages/glmnetcr/vignettes/glmnetcr.pdf

How to build regression models and then compare their fits with data held out from the model training-testing?

I have been building a couple different regression models using the caret package in R in order to make predictions about how fluorescent certain genetic sequences will become under certain experimental conditions.
I have followed the basic protocol of splitting my data into two sets: one "training-testing set" (80%) and one "hold-out set" (20%), the former of which would be utilized to build the models, and the latter would be used to test them in order to compare and pick the final model, based on metrics such as their R-squared and RMSE values. One such guide of the many I followed can be found here (http://www.kimberlycoffey.com/blog/2016/7/16/compare-multiple-caret-run-machine-learning-models).
However, I run into a block in that I do not know how to test and compare the different models based on how well they can predict the scores in the hold-out set. In the guide I linked to above, the author uses a ConfusionMatrix in order to calculate the specificity and accuracy for each model after building a predict.train object that applied the recently built models on the hold-out set of data (which is referred to as test in the link). However, ConfusionMatrix can only be applied to classification models, wherein the outcome (or response) is a categorical value (as far as my research has indicated. Please correct me if this is incorrect, as I have not been able to conclude without any doubt that this is the case).
I have found that the resamples method is capable of comparing multiple models against each other (source: https://www.rdocumentation.org/packages/caret/versions/6.0-77/topics/resamples), but it cannot take into account how the new models fit with the data that I excluded from the training-testing sessions.
I tried to create predict objects using the recently built models and hold-out data, then calculate Rsquared and RMSE values using caret's R2 and RMSE methods. But I'm not sure if such an approach is best possible way for comparing and picking the best model.
At this point, I should note that all the model building methods I am using are based on linear regression, since I need to be able to extract the coefficients and apply them in a separate Python script.
Another option I considered was setting a threshold in my outcome, wherein any genetic sequence that had a fluorescence value over 100 was considered useful, while sequences scoring values under 100 were not. This would allow me utilize the ConfusionMatrix. But I'm not sure how I should implement this within my R code to make these two classes in my outcome variable. I'm further concerned that this approach might make it difficult to apply my regression models to other data and make predictions.
For what it's worth, each of the predictors is either an integer or a float, and have ranges that are not normally distributed.
Here is the code I thus far been using:
library(caret)
data <- read.table("mydata.csv")
sorted_Data<- data[order(data$fluorescence, decreasing= TRUE),]
splitprob <- 0.8
traintestindex <- createDataPartition(sorted_Data$fluorescence, p=splitprob, list=F)
holdoutset <- sorted_Data[-traintestindex,]
trainingset <- sorted_Data[traintestindex,]
traindata<- trainingset[c('x1', 'x2', 'x3', 'x4', 'x5', 'fluorescence')]
cvCtrl <- trainControl(method = "repeatedcv", number= 20, repeats = 20, verboseIter = FALSE)
modelglmStepAIC <- train(fluorescence~., traindata, method = "glmStepAIC", preProc = c("center","scale"), trControl = cvCtrl)
model_rlm <- train(fluorescence~., traindata, method = "rlm", preProc = c("center","scale"), trControl = cvCtrl)
pred_glmStepAIC<- predict.lm(modelglmStepAIC$finalModel, holdoutset)
pred_rlm<- predict.lm(model_rlm$finalModel, holdoutset)
glmStepAIC_r2<- R2(pred_glmStepAIC, holdoutset$fluorescence)
glmStepAIC_rmse<- RMSE(pred_glmStepAIC, holdoutset$fluorescence)
rlm_r2<- R2(pred_rlm, holdoutset$fluorescence)
rlm_rmse<- RMSE(pred_rlm, holdoutset$fluorescence)
The out-of-sample performance measures offered by Caret are RMSE, MAE and squared correlation between fitted and observed values (called R2). See more info here https://topepo.github.io/caret/measuring-performance.html
At least in time series regression context, RMSE is the standard measure for out-of-sample performance of regression models.
I would advise against discretising continuous outcome variable, because you are essentially throwing away information by discretising.

How to determine the coefficient of svm classifiers for linear kernels in R?

I am using the kernlab package for svm in R.I am using the linear kernel so that I can directly check the importance of the feature vectors, that is my variables.Using the coefficients of these feature vectors,I am required to calculate the weight of the various factors in the model,so that the linear separating plane that the svm will draw in my feature space can we evaluated. Basically I want to calculate the w in transpose(w)*x + b. Could someone please suggest what is to be done. I used the fields alpha and b and apha index and tried to logically calculate the weight vector, but to verify if I was calculating correctly I tried to predict on a test sample its correct predicted score, and this did not match the value predicted by the inbuilt predict function. How to calculate the weights?

R, Multinomial Regression: How to Find Conditional Probabilities?

In R, given a multinomial linear logit regression, I would need to obtain the conditional probability given some values of the predictors.
For example, using the function multinom from the package nnet, imagine to have computed fit <- multinom(response ~ predictor). From fit, how can I obtain the probability weights of the different response classes, given a certain value of the predictor?
I thought of using something like predict(fit,newdata,type=???), but I have no idea about how to continue.
I found a possible solution: predict(fit, newdata = predictor, "probs"). In this way, I was able to find the probability weights for all the values of the predictor: every row corresponds to a certain value.

Cross validation of PCA+lm

I'm a chemist and about an year ago I decided to know something more about chemometrics.
I'm working with this problem that I don't know how to solve:
I performed an experimental design (Doehlert type with 3 factors) recording several analyte concentrations as Y.
Then I performed a PCA on Y and I used scores on the first PC (87% of total variance) as new y for a linear regression model with my experimental coded settings as X.
Now I need to perform a leave-one-out cross validation removing each object before perform the PCA on the new "training set", then create the regression model on the scores as I did before, predict the score value for the observation in the "test set" and calculate the error in prediction comparing the predicted score and the score obtained by the projection of the object in the test set in the space of the previous PCA. So repeated n times (with n the number of point of my experimental design).
I'd like to know how can I do it with R.
Do the calculations e.g. by prcomp and then lm. For that you need to apply the PCA model returned by prcomp to new data. This needs two (or three) steps:
Center the new data with the same center that was calculated by prcomp
Scale the new data with the same scaling vector that was calculated by prcomp
Apply the rotation calculated by prcomp
The first two steps are done by scale, using the $center and $scale elements of the prcomp object. You then matrix multiply your data by $rotation [, components.to.use]
You can easily check whether your reconstruction of the PCA scores calculation by calculating scores for the data you input to prcomp and comparing the results with the $x element of the PCA model returned by prcomp.
Edit in the light of the comment:
If the purpose of the CV is calculating some kind of error, then you can choose between calculating error of the predicted scores y (which is how I understand you) and calculating error of the Y: the PCA lets you also go backwards and predict the original variates from scores. This is easy because the loadings ($rotation) are orthogonal, so the inverse is just the transpose.
Thus, the prediction in original Y space is scores %*% t (pca$rotation), which is faster calculated by tcrossprod (scores, pca$rotation).
There is also R library pls (Partial Least Squares), which has tools for PCR (Principal Component Regression)

Resources