PLS coefficients with r - r

I'm making a PLS model using packages "pls" and "ChemometricswithR". I'm able to perform the model but I have a problem. I did a leave-one-out validation and if I ask for the coefficients I can see only an equation (I suppose the average of all the equations developed in leave one out validation).
Is there a way to see all the "n" equations (where n is the number of the observations in my matrix) with all the slopes coefficients?
this is the model i used: mod2<-plsr(SH_uve~matrix_uve,ncomp=11, data=dataset_uve, validation="LOO",jackknife = TRUE)

This would be easier to answer if you gave more information, how you are calling the functions etc? Based on what you said you are doing I'm assuming you are using the functions crossval() and PCA() from packages "pls" and "ChemometricswithR" respectively. I'm not familiar with these functions but the documentations sates that for coefficients "(only if
jackknife is TRUE) an array with the jackknifed regression coefficients.The dimensions correspond to the predictors, responses, number of components, and segments, respectively". So I would say make sure jackknife=TRUE and that you are specifying the correct number of segments in crossval(). If you are using different functions you should edit your question and add in the relevant information.

OK, i found the solution.
The model i used is:
mod2<plsr(SH_uve~matrix_uve,ncomp=11,data=dataset_uve,validation="LOO",jackknife = TRUE)
The coefficients matrix is inside the mod2 array. I called the matrix with the command:
coefficients<-mod2$validation$coefficients[,,11,] and i obtained the coefficients matrix for all the equations used in the leave-one-out cross validation.

Related

Initial parameters in nonlinear regression in R

I want to learn how to do nonlinear regression in R. I managed to learn the basics of the nls function, but how we know it's crucial in nonlinear regression to use good initial parameters. I tried to figure out how selfStart and getInitial functions works but failed. The documentation is very scarce and not very usefull. I wanted to learn these functions via a simple simulation data. I simulated data from logistic model:
n<-100 #Number of observations
d<-10000 #our parameters
b<--2
e<-50
set.seed(n)
X<-rnorm(n, -e/b, 2) #Thanks to it we'll have many observations near the point where logistic function grows the faster
Y<-d/(1+exp(b*X+e))+rnorm(n, 0, 200) #I simulate data
Now I wanted to do regression with a function f(x)=d/(1+exp(b*x+e)) but I don't know how to use selfStart or getInitial. Could you help me? But please, don't tell me about SSlogis. I'm aware it's a functon destined to find initial parameters in logistic regression, but It seems it only works in regression with one explanatory variable and I'd like to learn how to do logistic regression with more than one explanatory variables and even how to do general nonlinear regression with a function that I defined mysefl.
I will be very gratefull for your help.
I don't know why the calculus of good initial parameters fails in R. The aim of my answer is to provide a method to find good enough initial parameters.
Note that a non-iterative method exists which doesn't requires initial parameters. The principle is explained in this paper, pp.37-46 : https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales
A simplified version is shown below.
If the results are not sufficient, they can be used as initial parameters in an usual non-linear regression software such as in R.
A numerical example is shown below. Usually the number of points is much higher. Here it is deliberately low in order to make easier the checking when one edit the code and check it.

How to train a multiple linear regression model to find the best combination of variables?

I want to run a linear regression model with a large number of variables and I want an R function to iterate on good combinations of these variables and give me the best combination.
The glmulti package will do this fairly effectively:
Automated model selection and model-averaging. Provides a wrapper for glm and other functions, automatically generating all possible models (under constraints set by the user) with the specified response and explanatory variables, and finding the best models in terms of some Information Criterion (AIC, AICc or BIC). Can handle very large numbers of candidate models. Features a Genetic Algorithm to find the best models when an exhaustive screening of the candidates is not feasible.
Unsolicited advice follows:
HOWEVER. Please be aware that while this approach can find the model that minimizes within-sample error (the goodness of fit on your actual data), it has two major problems that should make you think twice about using it.
this type of data-driven model selection will almost always destroy your ability to make reliable inferences (compute p-values, confidence intervals, etc.). See this CrossValidated question.
it may overfit your data (although using the information criteria listed in the package description will help with this)
There are a number of different ways to characterize a "best" model, but AIC is a common one, and base R offers step(), and package MASS offers stepAIC().
summary(lm1 <- lm(Fertility ~ ., data = swiss))
slm1 <- step(lm1)
summary(slm1)
slm1$anova

Which methods can I use to calculate correlation among words in quanteda?

My question is a continuation of this.
After cleaning my text data and visualizing it using a wordcloud, I want to see which words are correlated to each other. Here comes the problem:
quantedahas the function textstat_simil, but it says
similarity. So, are "similarity" and "correlation" in this case the same thing? (Is distance also related?).
Moreover, my dfm looks like a binary matrix. Is in this case phi
correlation (from chi'squared statistics) more indicated? Can I
calculate this via quanteda?
Do you guys have any other content rather than the source code of
github that explain in more detail the methods to calculate
similarity or distance measures? (I couldn't understand from
this
code, sorry).
Thanks for you patient!
To compute Pearson’s product-moment correlations among features, you would use:
textstat_simil(x, method = “correlation”, margin = “features”)
The documentation makes this pretty clear, and the correlation method is the default.
Pearson’s correlation would not be the most appropriate for binary data, and we currently do not implement Spearman’s or other correlation methods more appropriate for categorical or ordinal data. However you can always coerce the dfm to an ordinary matrix (use as.matrix()) and then use the stats::cor() methods, which include Spearman’s.
As for the last question, we use the standard implementation of these measures. If you want more clarity on what they mean, I suggest asking on Cross-Validated.

R - How to get one "summary" prediction map instead for 5 when using 5-fold cross-validation in maxent model?

I hope I have come to the right forum. I'm an ecologist making species distribution models using the maxent (version 3.3.3, http://www.cs.princeton.edu/~schapire/maxent/) function in R, through the dismo package. I have used the argument "replicates = 5" which tells maxent to do a 5-fold cross-validation. When running maxent from the maxent.jar file directly (the maxent software), an html file with statistics will be made, including the prediction maps. In R, an html file is also made, but the prediction maps have to be extracted afterwards, using the function "predict" in the dismo package in r. When I do this, I get 5 maps, due to the 5-fold cross-validation setting. However, (and this is the problem) I want only one output map, one "summary" prediction map. I assume this is possible, although I don't know how maxent computes it. The maxent tutorial (see link above) says that:
"...you may want to avoid eating up disk space by turning off the “write output grids” option, which will suppress writing of output grids for the replicate runs, so that you only get the summary statistics grids (avg, stderr etc.)."
A list of arguments that can be put into R is found in this forum https://groups.google.com/forum/#!topic/maxent/yRBlvZ1_9rQ.
I have tried to use the argument "outputgrids=FALSE" both in the maxent function itself, and in the predict function, but it doesn't work. I still get 5 maps, even though I don't get any errors in R.
So my question is: How do I get one "summary" prediction map instead of the five prediction maps that results from the cross-validation?
I hope someone can help me with this, I am really stuck and haven't found any answers anywhere on the internet. Not even a discussion about this. Hope my question is clear. This is the R-script that I use:
model1<-maxent(x=predvars, p=presence_points, a=target_group_absence, path="//home//...//model1", args=c("replicates=5", "outputgrids=FALSE"))
model1map<-predict(model1, predvars, filename="//home//...//model1map.tif", outputgrids=FALSE)
Best regards,
Kristin
Sorry to be the bearer of bad news, but based on the source code, it looks like Dismo's predict function does not have the ability to generate a summary map.
Nitty-gritty details for those who care: When you call maxent with replicates set to something greater than 1, the maxent function returns a MaxEntReplicates object, rather than a normal MaxEnt object. When predict receives a MaxEntReplicates object, it just iterates through all of the models that it contains and calls predict on them individually.
So, what next? Fortunately, all is not lost! The reason that Dismo doesn't have this functionality is that for most kinds of model-building, there isn't actually a valid way to average parameters across your cross-validation models. I don't want to go so far as to say that that's definitely the case for MaxEnt specifically, but I suspect it is. As such, cross-validation is usually used more as a way of checking that your model building methodology works for your data than as a way of building your model directly (see this question for further discussion of that point). After verifying via cross-validation that models built using a given procedure seem to be accurate for the phenomenon you're modelling, it's customary to build a final model using all of your data. In theory this new model should only be better than models trained on a subset of your data.
So basically, assuming your cross-validated models look reasonable, you can run MaxEnt again with only one replicate. Your final result will be a model accuracy estimate based on the cross-validation and a map based on the second run with all of your data lumped together. Depending on what exactly your question is, there might be other useful summary statistics from the cross-validation that you want to use, but those are all things you've already seen in the html output.
I may have found this a couple of years later. But you could do something like this:
xm <- maxent(predictors, pres_train) # basically the maxent model
px <- predict(predictors, xm, ext=ext, progress= '' ) #prediction
px2 <- predict(predictors, xm2, ext=ext, progress= '' ) #prediction #02
models <- stack(px,px2) # create a stack of prediction from all the models
final_map <- mean(px,px2) # Take a mean of all the prediction
plot(final_map) #plot the averaged map
xm1,xm2,.. would be the maxent models for each partitions in cross-validation, and px, px2,.. would be the predicted maps.

what is the difference between lmFit and rlm

I want to use robust limma on my microarray data and R's user guide says rlm is the correct function to use according to:
http://rss.acs.unt.edu/Rdoc/library/limma/html/mrlm.html
I currently have:
lmFit(ExpressionMatrix, design, method = "robust", na.omit=T)
I can see that I chose the method to be robust. Does that mean that rlm will be called by this lmFit? and if I want it not to be robust, what method should I use?
The help page says:
The function mrlm is used if method="robust".
And then goes on:
If method="ls", then gls.series is used if a correlation structure has been specified, i.e., if ndups>1 or block is non-null and correlation is different from zero. If method="ls" and there is no correlation structure, lm.series is used.
If you follow the links from the help page for lmFit (06.LinearModels)
Fitting Models
The main function for model fitting is lmFit. This is recommended
interface for most users. lmFit produces a fitted model object of
class MArrayLM containing coefficients, standard errors and residual
standard errors for each gene. lmFit calls one of the following three
functions to do the actual computations:
lm.series
Straightforward least squares fitting of a linear model for
each gene.
mrlm
An alternative to lm.series using robust regression as
implemented by the rlm function in the MASS package.
gls.series
Generalized least squares taking into account correlations
between duplicate spots (i.e., replicate spots on the same array) or
related arrays. The function duplicateCorrelation is used to estimate
the inter-duplicate or inter-block correlation before using
gls.series.

Resources