caret: Linear Model parameter estimates (mean and s.e.) via LOOCV - r

I am wondering if there is a way to extract out all of the lm parameter estimates from the results of the cross-validation training runs from caret::train() with an lm model.
I have a gist of the R code I used to do some of this checking, where I directly access the train() output object, get the cross-validation data.frame indexes used in each cross-validation run. But I was wondering if there were already functions that access this for me, because I would think that 1.) If it was a good or reasonable idea, the functionality would be there, or 2.) If the functionality is not there, my desire to do this may not be a good idea.
As a second part of the question, you can see in the gist that when I compute the mean and standard error of the single linear parameter over all of the cross-validation parameter estimates, that the mean of the CV parameter estimates and the linear model on a fit of the entire training data set match up well, but the standard error from the CV estimates is much smaller than that from the estimate from the single lm run on the whole training set. Is that expected, or am I computing/thinking about that wrong?
EDIT: I think the second part can be found by reading this answer .
Thanks in advance,
Matt

Related

Changing coefficients in logistic regression

I will try to explain my problem as best as i can. I am trying to externally validate a prediction model, made by one of my colleagues. For the external validation, I have collected data from a set of new patients.
I want to test the accuracy of the prediction model on this new dataset. Online i have found a way to do so, using the coef.orig function to extract de coefficients of the original prediction model (see the picture i added). Here comes the problem, it is impossible for me to repeat the steps my colleague did to obtain the original prediction model. He used multiple imputation and bootstrapping for the development and internal validation, making it very complex to repeat his steps. What I do have, is the computed intercept and coefficients from the original model. Step 1 from the picture i added, could therefor be skipped.
My question is, how can I add these coefficients into the regression model, without the use of the 'coef()' function?
Steps to externally validate the prediction model:
The coefficients I need to use:
I thought that the offset function would possibly be of use, however this does not allow me to set the intercept and all the coefficients for the variables at the same time

Initial parameters in nonlinear regression in R

I want to learn how to do nonlinear regression in R. I managed to learn the basics of the nls function, but how we know it's crucial in nonlinear regression to use good initial parameters. I tried to figure out how selfStart and getInitial functions works but failed. The documentation is very scarce and not very usefull. I wanted to learn these functions via a simple simulation data. I simulated data from logistic model:
n<-100 #Number of observations
d<-10000 #our parameters
b<--2
e<-50
set.seed(n)
X<-rnorm(n, -e/b, 2) #Thanks to it we'll have many observations near the point where logistic function grows the faster
Y<-d/(1+exp(b*X+e))+rnorm(n, 0, 200) #I simulate data
Now I wanted to do regression with a function f(x)=d/(1+exp(b*x+e)) but I don't know how to use selfStart or getInitial. Could you help me? But please, don't tell me about SSlogis. I'm aware it's a functon destined to find initial parameters in logistic regression, but It seems it only works in regression with one explanatory variable and I'd like to learn how to do logistic regression with more than one explanatory variables and even how to do general nonlinear regression with a function that I defined mysefl.
I will be very gratefull for your help.
I don't know why the calculus of good initial parameters fails in R. The aim of my answer is to provide a method to find good enough initial parameters.
Note that a non-iterative method exists which doesn't requires initial parameters. The principle is explained in this paper, pp.37-46 : https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales
A simplified version is shown below.
If the results are not sufficient, they can be used as initial parameters in an usual non-linear regression software such as in R.
A numerical example is shown below. Usually the number of points is much higher. Here it is deliberately low in order to make easier the checking when one edit the code and check it.

Did I screw up my entire data science homework assignment by standardizing my data?

Professor wanted us to run some 10 fold cross validation on a data set to get the lowest RMSE and use the coefficients of that to make a function that takes in parameters and predicts and returns a "Fitness Factor" Score which ranges between 25-75.
He encouraged us to try transforming the data, so I did. I used scale() on the entire data set to standardize it and then ran my regression and 10 fold cross validation. I then found the model I wanted and copied the coefficients over. The problem is my function predictions are WAY off when i put unstandardized parameters into it to predict a y.
Did I completely screw this up by standardizing the data to a mean of 0 and sd of 1? Is there anyway I can undo this mess if I did screw up?
My coefficients are extremely small numbers and I feel like I did something wrong here.
Build a proper pipeline, not just a hack with some R functions.
The problem is that you treat scaling as part of loading the data, not as part of the prediction process.
The proper protocol is as follows:
"Learn" the transformation parameters
Transform the training data
Train the model
Transform the new data
Predict the value
Inverse-transform the predicted value
During cross-validation these need to run separately for each fold, or you may overestimate (overfit) your quality.
Standardization is a linear transform, so the inverse is trivial to find.

R - Testing for homo/heteroscedasticity and collinearity in a multivariate regression model

I'm trying to optimize a multivariate linear regression model lmMod=lm(depend_var~var1+var2+var3+var4....,data=df) and I'm presently working on the premises of the model: the constant variance of residuals and the absence of auto-correlation. For this I'm using:
Breusch-Pagan test for homo/heteroscedasticity: lmtest::bptest(lmMod) 
Durbin Watson test for auto-correlation: durbinWatsonTest(lmMod)
I found examples which are testing either one independent variable at a time:
example for Breush-Pagan test – one independent variable:
https://datascienceplus.com/how-to-detect-heteroscedasticity-and-rectify-it/
example for Durbin Watson test - one independent variable:
http://math.furman.edu/~dcs/courses/math47/R/library/lmtest/html/dwtest.html
or the whole model with several independent variables at a time:
example for Durbin Watson test – multiple independent variable:
https://www.rdocumentation.org/packages/car/versions/2.1-6/topics/durbinWatsonTest
Here are the questions:
Can durbinWatsonTest() and bptest() be fed with a whole multivariate model
If answer to 1 is yes, how is it then possible to determine which variable is causing heteroscedasticity or auto-correlation in the model in order to fix it as each of those tests give only one p-value for the entire multivariate model?
If answer to 1 is no, the test should be then performed with one dependent variable at a time. But in the case of homoscedasticity, it can only be tested AFTER a particular regression has been modelled. Hence a pattern of homo/heteroscedasticity in an univariate regression model lmMod_1=lm(depend_var~var1, data=df) will be different from the pattern of a multivariate regression model lmMod_2=lm(depend_var~var1+var2+var3+var4....,data=df)
Thank very much in advance for your help!
I would like to try to give a first help
The answer to the first question: Yes, you can use the Breusch-Pagan test and the Durbin Watson test for mutlivariate models. (However, I have always used the dwtest() instead of the durbinWatsonTest()).
Also note that the dwtest() checks only the first-order autocorrelation. Unfortunately, I do not know how to find out which variable is causing heteroscedasticity or auto-correlation. However, if you encounter these problems, then one possible solution is that you use a robust estimation method, e.g. after NeweyWest (using: coeftest (regression model, vcov = NeweyWest)) at autocorrelation or with coeftest(regression model, vcov = vcovHC) at heteroscedasticity, both from the AER package.

what is the difference between lmFit and rlm

I want to use robust limma on my microarray data and R's user guide says rlm is the correct function to use according to:
http://rss.acs.unt.edu/Rdoc/library/limma/html/mrlm.html
I currently have:
lmFit(ExpressionMatrix, design, method = "robust", na.omit=T)
I can see that I chose the method to be robust. Does that mean that rlm will be called by this lmFit? and if I want it not to be robust, what method should I use?
The help page says:
The function mrlm is used if method="robust".
And then goes on:
If method="ls", then gls.series is used if a correlation structure has been specified, i.e., if ndups>1 or block is non-null and correlation is different from zero. If method="ls" and there is no correlation structure, lm.series is used.
If you follow the links from the help page for lmFit (06.LinearModels)
Fitting Models
The main function for model fitting is lmFit. This is recommended
interface for most users. lmFit produces a fitted model object of
class MArrayLM containing coefficients, standard errors and residual
standard errors for each gene. lmFit calls one of the following three
functions to do the actual computations:
lm.series
Straightforward least squares fitting of a linear model for
each gene.
mrlm
An alternative to lm.series using robust regression as
implemented by the rlm function in the MASS package.
gls.series
Generalized least squares taking into account correlations
between duplicate spots (i.e., replicate spots on the same array) or
related arrays. The function duplicateCorrelation is used to estimate
the inter-duplicate or inter-block correlation before using
gls.series.

Resources