Interpreting SIMPLS regression results in R - r

I have done a SIMPLS regression in R but I am not sure how to interpret the results, this is how my function looks,
yarn.simpls<-mvr(Pcubes~X1+X2+X3,data=dtj,validation="CV",method="simpls")
and this is my results from
summary(yarn.simpls)
X dimension: 33471 3
Y dimension: 33471 1
Fit method: simpls
Number of components considered: 3
VALIDATION: RMSEP
Cross-validated using 10 random segments.
(Intercept) 1 comps 2 comps 3 comps
CV 0.5729 0.4449 0.4263 0.4175
adjCV 0.5729 0.4449 0.4263 0.4175
TRAINING: % variance explained
1 comps 2 comps 3 comps
X 86.77 97.67 100
Pcubes 39.74 44.72 47
What i would like to know is, what is my coefficients? Is it the adjCV row under VALIDATION: RMSEP. The TRAINING: % variance explianed, is that like the significance of the variables? I just want to make sure i interpret the results correctly.

The % variance is describing how much variation each ncomp captures from the x variables, and then the response variable, so it can be thought of as the relative ability of each of the ncomps to capture information in your data.
CV & adjCV are the values for the root mean squared error of prediction (RMSEP), which is giving you information about how well each ncomp model predicts the outcome variable. In your case, a model with 1 component seems to have the highest predictive power.
If you want coefficients for the underlying variables, use coef(yarn.simpls). This will give you what the variable coefficients would be at each ncomp.

Related

Structural Topic Model(STM) How to determine the optimal number of topics

I am trying to topic extraction with STM.
I have a question on how to determine the optimal number of topics.
kResult <- searchK(out$documents, out$vocab, K=c(7,8,9,10), prevalence=~rating+s(day), data=meta)
kResult$results
plot(kResult)
Numerical values and graphs are output in the output result of the searchK function.
I don't know how to determine the optimal number of topics from this result.
I would like to know how to determine the number of topics.
> kResult$results
K exclus semcoh heldout residual bound lbound em.its
1 7 8.937433 -52.95924 -7.80857 9.328384 -23391733 -23391725 17
2 8 9.090138 -58.20191 -7.793394 8.950438 -23337625 -23337614 20
3 9 9.168978 -61.09091 -7.781923 8.710382 -23296459 -23296447 25
4 10 9.256421 -61.51863 -7.764806 8.504863 -23247891 -23247876 55
kResult plot result
I read the treatise, but couldn't understand what the following values represent.
exclus:Exclusivity of each model.
semcoh:Semantic coherence of each model.
heldout:Heldout likelihood for each model.
residual:Residual for each model.
bound:Bound for each model.
lbound:lbound for each model.
em.its:Total number of EM iterations used in fitting the model.
Also, I don't know what each of the graphs below represents.
kResult plot result

Continuous Boyce Index (CBI) in R

I am predicting species' environmental suitability scores. To do so, I make use of the caret package and run classification problems. The literature advocates for the use of the continuous Boyce index (CBI) as a performance measure for model reliability (see 1). I currently tune the models to maximize AUC not the CBI. Nonetheless, I obtain perfect CBI scores (correlation of 1), for both the training and testing data, using ecospat::ecospat.boyce as well as the adamlilith/enmSdm::contBoyce2x available via rdrr.io.
I am not sure whether I am making a mistake and input the wrong vectors. The relevant logistic regression output looks something like this:
head(lg$pred)
pred obs absence presence rowIndex parameter Resample
1 presence absence 0.4144518 0.5855482 1 none Fold1
2 presence presence 0.4402172 0.5597828 2 none Fold1
3 presence presence 0.3647270 0.6352730 3 none Fold1
4 absence absence 0.7154779 0.2845221 6 none Fold1
5 presence presence 0.1574952 0.8425048 9 none Fold1
6 presence presence 0.0146231 0.9853769 10 none Fold1
I call the functions as follows
Ecospat::ecospat.boyce
fit = A vector or Raster-Layer containing the predicted suitability values
obs = A vector containing the predicted suitability values or xy-coordinates (if "fit" is a Raster-Layer) of the validation points (presence records)
ecospat::ecospat.boyce(fit=lg$pred$presence,obs=lg$pred$presence[lg$pred$obs=="presence"],PEplot=T)
adamlilith/enmSdm::contBoyce2x
pres = Numeric vector. Predicted values at presence sites.
bg = Numeric vector. Predicted values at absence/background sites.
contBoyce2x(pres=lg$pred$presence[lg$pred$obs=="presence"],bg=lg$pred$presence[lg$pred$obs=="absence"],graph=T)
Have I understood the input arguments correctly? Is it possible to obtain a perfect CBI score without even trying to tune the models to maximize this metric?
(1) Hirzel, A.H., Le Lay, G., Helfer, V., Randin, C., Guisan, A. (2006). Evaluating the ability of the habitat suitability models to predict species presences. Ecological Modelling 199, 142-152.

how to find prediction error after finding prediction model [duplicate]

Given two simple sets of data:
head(training_set)
x y
1 1 2.167512
2 2 4.684017
3 3 3.702477
4 4 9.417312
5 5 9.424831
6 6 13.090983
head(test_set)
x y
1 1 2.068663
2 2 4.162103
3 3 5.080583
4 4 8.366680
5 5 8.344651
I want to fit a linear regression line on the training data, and use that line (or the coefficients) to calculate the "test MSE" or Mean Squared Error of the Residuals on the test data once that line is fit there.
model = lm(y~x,data=training_set)
train_MSE = mean(model$residuals^2)
test_MSE = ?
In this case, it is more precise to call it MSPE (mean squared prediction error):
mean((test_set$y - predict.lm(model, test_set)) ^ 2)
This is a more useful measure as all models aim at prediction. We want a model with minimal MSPE.
In practice, if we do have a spare test data set, we can directly compute MSPE as above. However, very often we don't have spare data. In statistics, the leave-one-out cross-validation is an estimate of MSPE from the training dataset.
There are also several other statistics for assessing prediction error, like Mallows's statistic and AIC.

R - Calculate Test MSE given a trained model from a training set and a test set

Given two simple sets of data:
head(training_set)
x y
1 1 2.167512
2 2 4.684017
3 3 3.702477
4 4 9.417312
5 5 9.424831
6 6 13.090983
head(test_set)
x y
1 1 2.068663
2 2 4.162103
3 3 5.080583
4 4 8.366680
5 5 8.344651
I want to fit a linear regression line on the training data, and use that line (or the coefficients) to calculate the "test MSE" or Mean Squared Error of the Residuals on the test data once that line is fit there.
model = lm(y~x,data=training_set)
train_MSE = mean(model$residuals^2)
test_MSE = ?
In this case, it is more precise to call it MSPE (mean squared prediction error):
mean((test_set$y - predict.lm(model, test_set)) ^ 2)
This is a more useful measure as all models aim at prediction. We want a model with minimal MSPE.
In practice, if we do have a spare test data set, we can directly compute MSPE as above. However, very often we don't have spare data. In statistics, the leave-one-out cross-validation is an estimate of MSPE from the training dataset.
There are also several other statistics for assessing prediction error, like Mallows's statistic and AIC.

Does the varIdent function, used in LME work fine?

I would be glad if somebody could help me to solve this problem. I have data with repeated measurements design, where we tested a reaction of birds (time.dep) before and after the infection (exper). We have also FL (fuel loads, % of lean body mass), fat score and group (Experimental vs Control) as explanatory variables. I decided to use LME, because distribution of residuals doesn’t deviate from normality. But there is a problem with homogeneity of residuals. Variances of groups “before” and “after” and also between fat levels differ significantly (Fligner-Killeen test, p=0.038 and p=0.01 respectively).
ring group fat time.dep FL exper
1 XZ13125 E 4 0.36 16.295 before
2 XZ13125 E 3 0.32 12.547 after
3 XZ13126 E 3 0.28 7.721 before
4 XZ13127 C 3 0.32 9.157 before
5 XZ13127 C 3 0.40 -1.902 after
6 XZ13129 C 4 0.40 10.382 before
After I have selected the random part of the model, which is random-intercept (~1|ring), I have applied the weight parameter for both “fat” and “exper” (varComb(varIdent(form=~1|fat), varIdent(form=~1|exper)). Now the plot of standardized residuals vs. fitted looks better, but I still get the violation of homogeneity for these variables (same values in fligner test). What do I do wrong?
A common trap in lme is that the default is to give raw residuals, i.e. not adjusted for any of the heteroscedasticity (weights) or correlation (correlation) sub-models that may have been used. From ?residuals.lme:
type: an optional character string specifying the type of residuals
to be used. If ‘"response"’, as by default, the “raw”
residuals (observed - fitted) are used; else, if ‘"pearson"’,
the standardized residuals (raw residuals divided by the
corresponding standard errors) are used; else, if
‘"normalized"’, the normalized residuals (standardized
residuals pre-multiplied by the inverse square-root factor of
the estimated error correlation matrix) are used. Partial
matching of arguments is used, so only the first character
needs to be provided.
Thus if you want your residuals to be corrected for heteroscedasticity (as included in the model) you need type="pearson"; if you want them to be corrected for correlation, you need type="normalized".

Resources