down sampling in r for regression (NOT classification) - r

At the moment I am using simply:
down_sample_size = 3000
train <- train[sample(nrow(train), down_sample_size),]
to down-sample my training data to make my model fitting faster (in the context of hyper parameters search and CV - above is simplified). Are there better ways of doing this? In the context of classification, for example, class priors and stratification have to be taking into account. However, maybe the above is acceptable for regression? Thanks.

This seems perfectly acceptable unless you have clusters or any other viable reason to sample non-randomly. I've done something similar hundreds of times for linear regression.

Related

Imputation missing data for MLM in R

Maybe anyone can help me with this question. I conducted a follow-up study and obviously now have to face missing data. Now I am considering how to impute the missing data at best using MLM in R (f.e. participants concluded the follow up 2 survey, but not the follow up 1 survey, therefore I am missing L1 predictors for my longitudinal analysis).
I read about Multiple Imputation of multilevel data using the pan package (Schafer & Yucel, 2002) and came across the following code:
imp <- panImpute(data, formula = fml, n.burn = 1000, n.iter = 100, m = 5)
Yet, I have troubles understanding it completely. Is there maybe another way to impute missing data in R? Or maybe somebody could illustrate the process of the imputation method a bit more detailed, that would be so great! Do I have to conduct the imputation for every model I built in my MLM? (f.e. when I compared, whether a random intercept versus a random intercept and random slope model fits better for my data, do I have to use the imputation code for every model, or do I use it at the beginning of all my calculations?)
Thank you in advance
Is there maybe another way to impute missing data in R?
There are other packages. mice is the one that I normally use, and it does support multilevel data.
Do I have to conduct the imputation for every model I built in my MLM? (f.e. when I compared, whether a random intercept versus a random intercept and random slope model fits better for my data, do I have to use the imputation code for every model, or do I use it at the beginning of all my calculations?)
You have to specify the imputation model. Basically that means you have to tell the software which variables are predicted by which other variables. Since you are comparing models with the same fixed effect, and only changing the random effects (in particular comparing models with and without random slopes), the imputation model should be the same in both cases. So the workflow is:
perform the imputations;
run the model on all the imputed datasets,
pool the results (typically using Rubin's rules)
So you will need to do this twice, to end up with 2 sets of pooled results - one for each model. The software should provide functionality for doing all of this.
Having said all of that, I would advise against choosing your model based on fit statistics and instead use expert knowledge. If you have strong theoretical reasons for expecting slopes to vary by group, then include random slopes. If not, then don't include them.

Initial parameters in nonlinear regression in R

I want to learn how to do nonlinear regression in R. I managed to learn the basics of the nls function, but how we know it's crucial in nonlinear regression to use good initial parameters. I tried to figure out how selfStart and getInitial functions works but failed. The documentation is very scarce and not very usefull. I wanted to learn these functions via a simple simulation data. I simulated data from logistic model:
n<-100 #Number of observations
d<-10000 #our parameters
b<--2
e<-50
set.seed(n)
X<-rnorm(n, -e/b, 2) #Thanks to it we'll have many observations near the point where logistic function grows the faster
Y<-d/(1+exp(b*X+e))+rnorm(n, 0, 200) #I simulate data
Now I wanted to do regression with a function f(x)=d/(1+exp(b*x+e)) but I don't know how to use selfStart or getInitial. Could you help me? But please, don't tell me about SSlogis. I'm aware it's a functon destined to find initial parameters in logistic regression, but It seems it only works in regression with one explanatory variable and I'd like to learn how to do logistic regression with more than one explanatory variables and even how to do general nonlinear regression with a function that I defined mysefl.
I will be very gratefull for your help.
I don't know why the calculus of good initial parameters fails in R. The aim of my answer is to provide a method to find good enough initial parameters.
Note that a non-iterative method exists which doesn't requires initial parameters. The principle is explained in this paper, pp.37-46 : https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales
A simplified version is shown below.
If the results are not sufficient, they can be used as initial parameters in an usual non-linear regression software such as in R.
A numerical example is shown below. Usually the number of points is much higher. Here it is deliberately low in order to make easier the checking when one edit the code and check it.

How to train a multiple linear regression model to find the best combination of variables?

I want to run a linear regression model with a large number of variables and I want an R function to iterate on good combinations of these variables and give me the best combination.
The glmulti package will do this fairly effectively:
Automated model selection and model-averaging. Provides a wrapper for glm and other functions, automatically generating all possible models (under constraints set by the user) with the specified response and explanatory variables, and finding the best models in terms of some Information Criterion (AIC, AICc or BIC). Can handle very large numbers of candidate models. Features a Genetic Algorithm to find the best models when an exhaustive screening of the candidates is not feasible.
Unsolicited advice follows:
HOWEVER. Please be aware that while this approach can find the model that minimizes within-sample error (the goodness of fit on your actual data), it has two major problems that should make you think twice about using it.
this type of data-driven model selection will almost always destroy your ability to make reliable inferences (compute p-values, confidence intervals, etc.). See this CrossValidated question.
it may overfit your data (although using the information criteria listed in the package description will help with this)
There are a number of different ways to characterize a "best" model, but AIC is a common one, and base R offers step(), and package MASS offers stepAIC().
summary(lm1 <- lm(Fertility ~ ., data = swiss))
slm1 <- step(lm1)
summary(slm1)
slm1$anova

Does the glmulti function (from the gmulti package) need a set.seed value?

I am using glmulti to select a set of candidate generalized linear models and my variable importance values and 'best' model keep changing each time I run the model.
I am struggling to understand why this is, does glmulti need a set.seed value to make results reproducible?
Thanks.
I believe the glmulti function is in the glmulti package. (You should state this in your question...) If so, the help page says that sometimes it uses a genetic algorithm to find the best model. Those are indeed random algorithms, so you can expect to get an answer that depends on the random number seed.

Best alternative for stepwise regression in R

I know that there dosens of similar questions/answers, and lots of papers. But please read till the end.
Non-statisticians tend to use stepwise regressions which is strongly argued by statisticians. This is stomething that I don't understand, but I just obey them. "Ok this is not a good way to do your modelling".
Here is (was) my model:
b <- lmer(metric1~a+b+c+d+e+f+g+h+i+j+k+l+(1|X/Y) + (1|Z), data = dataset)
drop1 (b, test="Chisq")
(Just a small note: Watch out for the random effects in my model; random effects are Year, Month, Sampling.location; one of my variables is 1/0: I allready log-transformed my variables)
I am trying to find a exploratory model (with drop1 to reach final model) and evaluating it with my biological knowledge to see if the dependent ("metric" in this case) seems to be responding variables. I will repeat this process with 100 metrics just to evaulate which metrics seems to be responding environmental variables.
I was in the search for an acceptable model instead of stepwise according to the suggestions of statistics gurus.
However, there are lots of alternatives. I read alot, but still feel myself lost. Some say Lasso, some say elastic modelling, some say ridge regression... Which one fits for my purpose?
Any advise for a better alternative and an easy model or a help page for dummies, or examples (that could be better) would be much appreciated.
Thanks in advance.

Resources