How to set up multiple regression in R? - r

I don't know how to set up a multiple regression in R and run a simple OLS estimation for it.
I only get an multiple regression equation (eg. Salary = 0.3 + 0.5*age + experience + residual ) and equations for each variables (eg. age = ... ). The error term is normally distributed with mean 0 and SD 0.3
How can I run a simple OLS estimation of salary on age and compute a standard error for it?
Thank you.

Since you have only one predictor, it will be an example of linear regression instead of multiple regression. In R you can do-
model <- lm(Salary~age, data= your_dataset)
summary(model) # will give you summary statistics such as standard error and coefficients

Related

Having trouble with overfitting in simple R logistic regression

I am a newbie to R and I am trying to perform a logistic regression on a set of clinical data.
My independent variable is AGE, TEMP, WBC, NLR, CRP, PCT, ESR, IL6, and TIME.
My dependent variable is binomial CRKP.
After using glm.fit, I was given this error message:
glm.fit <- glm(CRKP ~ AGE + TEMP + WBC + NLR + CRP + PCT + ESR, data = cv, family = binomial, subset=train)
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
I searched up potential problems and used the corrplot function to see if there is multicollinearity that could potentially result in overfitting.
This is what I have as the plot.
Correlation plot shows that my ESR and PCT variable are highly correlated. Similarly, CRP and IL6 are highly correlated. But they are all important clinical indicators I would like to include in the model.
I tried to use the VIF to selectively discard variables, but wouldn't that be biased and also I would have to sacrifice some of my variables of interest.
Does anyone know what I can do? Please help. Thank you!
You have a multicollinearity problem but don't want to drop variables. In this case you can use Partial Least Squares (PLS) or Principal Component Regression (PCR).

How to obtain analysis of variance table for a nonlinear regression model in R

Previously I used SAS to fit data into nonlinear regression model. SAS was able to produce an analysis of variance table for the model. The table displays the degrees of freedom, sums of squares, and mean squares along with the model F test.
Please refer to Table 69.4 in this pdf file.
Source: https://support.sas.com/documentation/onlinedoc/stat/132/nlin.pdf
How can I re-create something similar in R? Thanks in advance.
I'm not sure what type of nonlinear regression you're interested in- but the general approach would be to run the model and call for a summary. The typical linear model would be:
linearmodel = lm(`outcomevar` ~ `predictorvar`, data = dataset)
linearmodel #gives coefficients
summary(linearmod) # gives model fit
For nonlinear regression you would add the polynomial term. For quadratic fit it would be
y = b0 + b1(Var) + b2(Var * Var) or:
nonlinmodel = lm(`outcomevar` ~ `predictorvar` + I(`predictorvar`^2), data = dataset)
nonlinmodel
summary(nonlinmodel)
other methods here: https://data-flair.training/blogs/r-nonlinear-regression/

What post-hoc test should be used for a glmer model with a continious and a categorical predictor variable?

I'm a bit of a newbie with stats and R, so need a bit of direction to find a suitable post-hoc test for my glmer model.
The model has a binary dependent variable (absent/present) and the predictor variables are interactive terms between a continuous variable(eg temp) and a categorical variable (species, n=3). Only interactive terms, rather than the continuous factor in isolation, produce significant results when an anova is run on the model. Species by itself has a large effect because one species is much rarer than the others. I'm trying to tease apart how the presence of these species varies across pH and between species.
I've tried lsmeans test with Tukey, and Firth's Bias-Reduced Logistic Regression, emmeans. I ran the effects function on the interactive terms, so had a rough expectation of what a post hoc could show, but the results logistf (firth's) have produced I was not expecting. Emmeans and tukey both gave the same results and ignored the continuous variable I assume because it's not a factor.
When I run firth's regression it produces chi-squared and p values that are either infinity for chi values or the p values astronomically small, even though what I saw through effects suggested no significant difference. I can't tell with the interactive term if there truly is an effect of the environmental variable or if the significant effect is because of the difference in species. Based on what I have seen of the logistf function, I didn't think it would produce a chi-square score. Is this an issue in coding or is it because of my data?
If I wasn't clear enough about something please let me know and if anyone has any suggestions or advice, they would be massively appreciated. Thanks!
The model and test code I used are below:
###glmer model
Large<-glmer(Abs.Pres~ Species:Q.Depth+Species:Conductivity+Species:Temp+Species:pH+Species:DO.P+(1|QID),
nAGQ=0,
family=binomial,
data=Stacked_Pref)
anova(Large)
Output:Analysis of Variance Table
npar Sum Sq Mean Sq F value
Species:Q.Depth 3 234.904 78.301 78.3014
Species:Conductivity 3 32.991 10.997 10.9970
Species:Temp 3 39.001 13.000 13.0004
Species:pH 3 25.369 8.456 8.4562
Species:DO.P 3 34.930 11.643 11.6434
###Firths
Lp<-logistf(Abs.Pres~Species:pH, data=Stacked_Pref, contrasts.arg=list(pH="contr.treatment", Species="contr.sum"))
> Lp
logistf(formula = Abs.Pres ~ Species:pH, data = Stacked_Pref,
contrasts.arg = list(pH = "contr.treatment", Species = "contr.sum"))
Model fitted by Penalized ML
Confidence intervals and p-values by Profile Likelihood
coef se(coef) lower 0.95 upper 0.95 Chisq p
(Intercept) 1.9711411 0.57309880 0.8552342 3.1015114 12.09107 5.066380e-04
SpeciesGoby:pH -0.3393185 0.07146049 -0.4804047 -0.2003108 23.31954 1.371993e-06
SpeciesMosquito:pH -0.3001385 0.07127771 -0.4408186 -0.1614419 18.24981 1.937453e-05
SpeciesRFBE:pH -0.4771393 0.07232469 -0.6200179 -0.3365343 45.73750 1.352096e-11
Likelihood ratio test=267.0212 on 3 df, p=0, n=3945

How to get confidence interval for hypothesis test of non-linear multiple parameters

I am trying to do something that seems very simple yet I cannot find any good advice out there. I would like to get the confidence interval for the non-linear combination of two coefficients in a regression model. I cam use linearHypothesis() to conduct an F-test and get the p-value for a linear combination. The code I ran for that part is:
reg4 <- lm(bpsys ~ current_tobac + male + wtlb + age, data=NAMCS2010)
linearHypothesis(reg4, "current_tobac + male = 0")
I can use glht() from the multcomp package to get the confidence interval for a linear combination of parameters:
confcm <- summary(glht(reg4, linfct = c("current_tobac + male = 0")))
confint(confcm)
But I'm not sure what to do for a non-linear combination like (summary(reg4)$coefficients[2])/ (summary(reg4)$coefficients[4])
Any advice?

How do you fit a linear mixed model with an AR(1) random effects correlation structure in R?

I am trying to use R to rerun someone else's project, so we need to use some macros in R.
Here comes a very basic question:
m1.nlme = lme(log.bp.dia ~ M25.9to9.ma5iqr + temp.c.9to9.ma4iqr + o3.ma5iqr + sea_spring + sea_summer + sea_fall + BMI + male + age_ini, data=barbara.1.clean, random = ~ 1|study_id)
Since the model is using AR(1) [autocorrelation 1 covariance model] in SAS for within person variance, I am not sure how to do this in R.
And where I can see the index for different models, like unstructured?
Thanks
I don't know what you mean by "index" for different models, but to specify an AR(1) covariance structure for the residuals, you can add corr=corAR1() to your lme call.
The correlation at lag $1$ is say $r$, where $-1< r <1$ for a stationary $AR(1)$ model. The correlation at lag $k \geq 1$ is $r^k$. This gives you the autocovariance matrix by just multiplying by the variance of $X_t$.

Resources