Calculating F-Ratio (F-value) for multiple regression in R - r

In a statistics course for management and economics I am taking we were given a dataset consisting of information about job applicants (age, school and bachelor grades, number of internships performed, time spent abroad, performance during an interview) and an assessment of their job performance after they were hired. We are supposed to analyze which variables predict later job performance using multiple linear regressions. In total we are supposed to fit four models that subsequently add more and more predictors and compare their R^2, F-value, p-value for the F-test, and explain if the model holds explanatory value.
So far I have the following code:
#import function for calculating linear regression models with heteroscadacity robust standard errors
url_robust <-"https://raw.githubusercontent.com/IsidoreBeautrelet/economictheoryblog/master/robust_summary.R"
eval(parse(text = getURL(url_robust, ssl.verifypeer = FALSE)), envir=.GlobalEnv)
#calculate linear models
lm_1 <- lm(performance ~ age + sex, data_A)
summary(lm_1, robust = T)
lm_2 <- lm(performance ~ age + sex + school + bachelor, data_A)
summary(lm_2, robust = T)
lm_3 <- lm(performance ~ age + sex + school + bachelor + abroad + internships, data_A)
summary(lm_3, robust = T)
lm_4 <- lm(performance ~ age + sex + school + bachelor + abroad + internships + interview, data_A)
summary(lm_4, robust = T)
Here is my output from the summary core of lm_4:
Call:
lm(formula = performance ~ age + sex + school + bachelor + abroad +
internships + interview, data = data_A)
Residuals:
Min 1Q Median 3Q Max
-1.26292 -0.31620 -0.00085 0.29548 1.51859
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.073422 1.046419 1.981 0.0527 .
age 0.011915 0.023007 0.518 0.6067
sex 0.046603 0.149346 0.312 0.7562
school -0.192044 0.123340 -1.557 0.1254
bachelor 0.243797 0.111866 2.179 0.0338 *
abroad 0.015325 0.009493 1.614 0.1124
internships 0.037040 0.041557 0.891 0.3768
interview 0.339886 0.216935 1.567 0.1231
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4776 on 53 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared: 0.3195, Adjusted R-squared: 0.2297
F-statistic: 4.18 on 7 and 53 DF, p-value: 0.0009912
Additionally I have the solutions we are supposed to obtain for each of the parameters from our model as well as the coefficients of model 4 (the most complex model with highest number of predictors).
So far the models I fitted seem to be correct since I obtain the same coefficients with the same t and p-values. My issue is that in the F-test on the multiple linear model I obtain the F-statistic instead of the F-value. Based on my research I believe that the different between F-statistic and F-value (also sometimes called F-ratio) is that the F-statistic is the critical F-value (i.e. that value where if the F-value is smaller than it the F-test is no longer statistically significant) and the F-value/F-ratio is the actual test value. Is this correct? In R I would also like to obtain the F-value. Our professor uses gretl to solve his problems instead of R and gretl apparently always gives you the F-value for linear regression models instead of the F-statistic. I have tried many different methods of somehow getting this F-value but I can't get any of them to work. I will post my failed solution approach below. Any suggestions on how I can get the F-value for each of these models would be highly appreciated! Also if anyone could explain how exactly the F-value would in theory be calculated for one of my models (probably lm_1 would be most practical since it has the fewest number of predictors) that would really help me since I think my current issue is that I have misunderstood something about calculating the F-value.
(For simplicities sake I always started with the least complex model and wanted to find a solution for that and then work my way up to the more complex ones).
Approach 1: perform an anova on my model
anova(lm_1)
Console Output:
Analysis of Variance Table
Response: performance
Df Sum Sq Mean Sq F value Pr(>F)
age 1 0.1212 0.121247 0.4054 0.5266
sex 1 0.0170 0.016995 0.0568 0.8124
Residuals 63 18.8425 0.299087
Here I obtain two different F-values. As far as I understood the anova analysis the first one in the row for age is only testing the effect of age without sex and the second one tests the affect of sex while removing the effects analyzed in the row above? Am still a bit confused about this.
Since F value = variance of the group means (Mean Square Between) / mean of the within group variances (Mean Squared Error), I wanted to manually calculate the F-value form the Anova output but I wasn't exactly sure how to proceed. All solutions I tried did not lead to the F-value I should be obtaining. Could someone explain how I would go about calculating it?
In my research I also often came across the phrase nested models and that this would then give me the F-value which analyses the entire model. Sadly I couldn't figure out what a nested model exactly is and how I would have to input this into R.
Thanks for your help!

Related

Predictor in logistic regression for a large sample size (1.8 million obs.) predicts only 0's

I am trying to run a logistic regression model to predict the default probabilities of individual loans. I have a large sample size of 1.85 million observations, about 81% of which were fully paid off and the rest defaulted. I had run the logistic regression with 20+ other predictors that were statistically significant and got warning "fitted probabilities 0 or 1 occurred", and by adding predictors step by step I found that only 1 predictor was causing this problem, the "annual income" (annual_inc). I ran a logistic regression with only this predictor and found that it predicts only 0's (fully paid off loans), although there is a significant proportion of defaulted loans. I tried different proportions of training and testing data. If I give split the model in the way that gives 80% of the original sample to the Testing set and 20% to the Training set, R doesn't show the fitted probabilities warning, but the model still predicts 0's only on the testing set. Below I attach the little code concerned just in case. I doubt that adding a small sample of my data would be of any use given the circumstance, but if I am mistaken, let me know please and I will add it.
>set.seed(42)
>indexes <- sample(1:nrow(df), 0.8*nrow(df))
>df_test = df[indexes,]
>df_train = df[-indexes,]
>mymodel_2 <- glm(loan_status ~ annual_inc, data = df_train, family = 'binomial')
>summary(mymodel_2)
Call:
glm(formula = loan_status ~ annual_inc, family = "binomial",
data = df_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.6902 -0.6530 -0.6340 -0.5900 5.4533
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.308e+00 8.290e-03 -157.83 <2e-16 ***
annual_inc -2.426e-06 9.382e-08 -25.86 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 352917 on 370976 degrees of freedom
Residual deviance: 352151 on 370975 degrees of freedom
AIC: 352155
Number of Fisher Scoring iterations: 4
>res <- predict(mymodel_2, df_test, type = "response")
>confmatrix <- table(Actual_value = df_test$loan_status, Predicted_value = res >0.5)
>confmatrix
Predicted_value
Actual_value FALSE
0 1212481
1 271426
Moreover, when I searched for the solution of the issue on the Internet, I seen it is often often attributed to perfect separation, but my case predicts 0's only, and the analogue-cases I have seen had small sample size. So far I am hesitant about implementing penalised logistic regression, because I think my issue is not perfect separation. Also, it is worth pointing out, I want to use logistic regression specifically due to specifics of research. How can I overcome the issue at hand?
As #deschen suggested I used resampling ROSE technique from ROSE package for R and it solved my issue, although over-, under-sampling methods, and a combination of both worked as well.

How to implement a non linear model regression in R

I am quite new to both R and Statistics and really need your help. I should analyze some data to find an analytical model that describes it.
I have 2 response (y1,y2) and (4 predictors).
I thought of performing the analysis using R and followed these steps:
1) For each response, I tested a linear model (lm command) and I found:
Call:
lm(formula = data_mass$m ~ ., data = data_mass)
Residuals:
Min 1Q Median 3Q Max
-7.805e-06 -1.849e-06 -1.810e-07 2.453e-06 7.327e-06
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.367e-04 1.845e-05 -7.413 1.47e-06 ***
d 1.632e-04 1.134e-05 14.394 1.42e-10 ***
L 2.630e-08 1.276e-07 0.206 0.83927
D 1.584e-05 5.103e-06 3.104 0.00682 **
p 1.101e-06 1.195e-07 9.215 8.46e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.472e-06 on 16 degrees of freedom
Multiple R-squared: 0.9543, Adjusted R-squared: 0.9429
F-statistic: 83.51 on 4 and 16 DF, p-value: 1.645e-10
2) So I analyzed how good the model is by taking a look at plot(model) graphs.
Looking at the "residual vs fitted value" plot, the model should not be linear!! Is it correct?
3) I tried to eliminate some factors (like "L") and to introduce some quadratic terms (d^2 ; D^2), but the "residual vs fitted value" plot has the same trend.
What can I do now? Should I use a non-linear model?
Thank you to everyone can help me =)
UPDATE:
Thank you again. I attached graph of plot(model) and DATA. The responses are m, Fz and the predictors d,L,D,p. The model is a linear model of response m.
[Residual vs Fitted][1]
[Normal Q-Q][2]
[Scale Location][3]
[Residual vs Leverage][4]
[DATA][5]
enter code here
Looking the "residual vs fitted value" plot the model should not be linear!! Is it correct?
Yes and no. If absolute value of the residuals have strong correlation with the fitted values, that could mean heteroscedasticity (heterogeneity of variance).
Then the residuals would not be equally spread along the fitted values. And heteroscedasticity is one of the thing you could look at on fitted vs residual graph, because it can invalidate statistical tests such as *t*-test or lm. You could also confirm it with scale-location plot (which is quite similar to this but slightly better).
On the other hand nonlinear distribution indicate nonlinearity and would probably want to change the structure of your model. Though you don´t wont neither linear, nor nonlinear relationship between residuals and fitted values: in ideal case scenario values should be more or less randomly and symmetrically scattered around 0 between two parallel lines with 0 slope.
You can find more discussion on the issue here: 1 2 3
What can I do now? Should I use a non-linear model?
If your diagnostic plots indicate nonlinearity, you may want to change/restructure/readjust your model (or transform the data) - there is some discussion on the options here

Back-transforming gamma GLM to natural units to be able to predict values in unsampled locations

I'm working with ecological data, where I have used cameras to sample animal detections (converted to biomass) and run various models to identify the best fitting model, chosen through looking at diagnostic plots, AIC and parameter effect size. The model is a gamma GLM (due to biomass having a continuous response), with a log link. The chosen model has the predictor variables of distance to water ("dist_water") and distance to forest patch ("dist_F3"). This is the model summary:
glm(formula = RAI_biomass ~ Dist_water.std + Dist_F3.std, family = Gamma(link = "log"),
data = biomass_RAI)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.3835 -1.0611 -0.3937 0.4355 1.5923
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.3577 0.2049 26.143 2.33e-16 ***
Dist_water.std -0.7531 0.2168 -3.474 0.00254 **
Dist_F3.std 0.5831 0.2168 2.689 0.01452 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Gamma family taken to be 0.9239696)
Null deviance: 41.231 on 21 degrees of freedom
Residual deviance: 24.232 on 19 degrees of freedom
AIC: 287.98
Number of Fisher Scoring iterations: 7
The covariates were standardised prior to running the model. What I need to do now is to back-transform this model into natural units in order to predict biomass values at unsampled locations (in this case, farms). I made a table of each farm and their respective distance to water, and distance to forest patch. I thought the way to do this would be to use the exp(coef(biomass_glm)), but when I did this the dist_water.std coefficient changed direction and became positive.
exp(coef(biomass_glm8))
## Intercept Dist_water.std Dist_F3.std
## 212.2369519 0.4709015 1.7915026
To me this seems problematic, as in the original GLM, an increasing distance to water meant a decrease in biomass (this makes sense) - but now we are seeing the opposite? The calculated biomass response had a very narrow range, from 210.97-218.9331 (for comparison, in the original data, biomass ranged from 3.04-2227.99.
I then tried to take the exponent of the entire model, without taking the exponent of each coefficient individually:
farms$biomass_est2 <- exp(5.3577 + (-0.7531*farms$Farm_dist_water_std) + (0.5831*farms$Farm_dist_F3_std))
and this gave me a new biomass response that makes a bit more sense, i.e. more variation given the variation in the two covariates (2.93-1088.84).
I then tried converting the coefficient estimates by doing e^B - 1, which gave again different results, although most similar to the ones obtained through exp(coef(biomass_glm)):
(e^(-0.7531))-1 #dist_water = -0.5290955
(e^(0.5831))-1 #dist_F3 = 0.7915837
(e^(5.3577))-1 #intercept = 211.2362
My question is, why are these estimates different, and what is the best way to take this gamma GLM with a log link and convert it into a format that can be used to calculate predicted values? Any help would be greatly appreciated!

R: Calculate and interpret odds ratio in logistic regression

I am having trouble interpreting the results of a logistic regression. My outcome variable is Decision and is binary (0 or 1, not take or take a product, respectively).
My predictor variable is Thoughts and is continuous, can be positive or negative, and is rounded up to the 2nd decimal point.
I want to know how the probability of taking the product changes as Thoughts changes.
The logistic regression equation is:
glm(Decision ~ Thoughts, family = binomial, data = data)
According to this model, Thoughts has a significant impact on probability of Decision (b = .72, p = .02). To determine the odds ratio of Decision as a function of Thoughts:
exp(coef(results))
Odds ratio = 2.07.
Questions:
How do I interpret the odds ratio?
Does an odds ratio of 2.07 imply that a .01 increase (or decrease) in Thoughts affect the odds of taking (or not taking) the product by 0.07 OR
Does it imply that as Thoughts increases (decreases) by .01, the odds of taking (not taking) the product increase (decrease) by approximately 2 units?
How do I convert odds ratio of Thoughts to an estimated probability of Decision?
Or can I only estimate the probability of Decision at a certain Thoughts score (i.e. calculate the estimated probability of taking the product when Thoughts == 1)?
The coefficient returned by a logistic regression in r is a logit, or the log of the odds. To convert logits to odds ratio, you can exponentiate it, as you've done above. To convert logits to probabilities, you can use the function exp(logit)/(1+exp(logit)). However, there are some things to note about this procedure.
First, I'll use some reproducible data to illustrate
library('MASS')
data("menarche")
m<-glm(cbind(Menarche, Total-Menarche) ~ Age, family=binomial, data=menarche)
summary(m)
This returns:
Call:
glm(formula = cbind(Menarche, Total - Menarche) ~ Age, family = binomial,
data = menarche)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0363 -0.9953 -0.4900 0.7780 1.3675
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -21.22639 0.77068 -27.54 <2e-16 ***
Age 1.63197 0.05895 27.68 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3693.884 on 24 degrees of freedom
Residual deviance: 26.703 on 23 degrees of freedom
AIC: 114.76
Number of Fisher Scoring iterations: 4
The coefficients displayed are for logits, just as in your example. If we plot these data and this model, we see the sigmoidal function that is characteristic of a logistic model fit to binomial data
#predict gives the predicted value in terms of logits
plot.dat <- data.frame(prob = menarche$Menarche/menarche$Total,
age = menarche$Age,
fit = predict(m, menarche))
#convert those logit values to probabilities
plot.dat$fit_prob <- exp(plot.dat$fit)/(1+exp(plot.dat$fit))
library(ggplot2)
ggplot(plot.dat, aes(x=age, y=prob)) +
geom_point() +
geom_line(aes(x=age, y=fit_prob))
Note that the change in probabilities is not constant - the curve rises slowly at first, then more quickly in the middle, then levels out at the end. The difference in probabilities between 10 and 12 is far less than the difference in probabilities between 12 and 14. This means that it's impossible to summarise the relationship of age and probabilities with one number without transforming probabilities.
To answer your specific questions:
How do you interpret odds ratios?
The odds ratio for the value of the intercept is the odds of a "success" (in your data, this is the odds of taking the product) when x = 0 (i.e. zero thoughts). The odds ratio for your coefficient is the increase in odds above this value of the intercept when you add one whole x value (i.e. x=1; one thought). Using the menarche data:
exp(coef(m))
(Intercept) Age
6.046358e-10 5.113931e+00
We could interpret this as the odds of menarche occurring at age = 0 is .00000000006. Or, basically impossible. Exponentiating the age coefficient tells us the expected increase in the odds of menarche for each unit of age. In this case, it's just over a quintupling. An odds ratio of 1 indicates no change, whereas an odds ratio of 2 indicates a doubling, etc.
Your odds ratio of 2.07 implies that a 1 unit increase in 'Thoughts' increases the odds of taking the product by a factor of 2.07.
How do you convert odds ratios of thoughts to an estimated probability of decision?
You need to do this for selected values of thoughts, because, as you can see in the plot above, the change is not constant across the range of x values. If you want the probability of some value for thoughts, get the answer as follows:
exp(intercept + coef*THOUGHT_Value)/(1+(exp(intercept+coef*THOUGHT_Value))
Odds and probability are two different measures, both addressing the same aim of measuring the likeliness of an event to occur. They should not be compared to each other, only among themselves!
While odds of two predictor values (while holding others constant) are compared using "odds ratio" (odds1 / odds2), the same procedure for probability is called "risk ratio" (probability1 / probability2).
In general, odds are preferred against probability when it comes to ratios since probability is limited between 0 and 1 while odds are defined from -inf to +inf.
To easily calculate odds ratios including their confident intervals, see the oddsratio package:
library(oddsratio)
fit_glm <- glm(admit ~ gre + gpa + rank, data = data_glm, family = "binomial")
# Calculate OR for specific increment step of continuous variable
or_glm(data = data_glm, model = fit_glm,
incr = list(gre = 380, gpa = 5))
predictor oddsratio CI.low (2.5 %) CI.high (97.5 %) increment
1 gre 2.364 1.054 5.396 380
2 gpa 55.712 2.229 1511.282 5
3 rank2 0.509 0.272 0.945 Indicator variable
4 rank3 0.262 0.132 0.512 Indicator variable
5 rank4 0.212 0.091 0.471 Indicator variable
Here you can simply specify the increment of your continuous variables and see the resulting odds ratios. In this example, the response admit is 55 times more likely to occur when predictor gpa is increased by 5.
If you want to predict probabilities with your model, simply use type = response when predicting your model. This will automatically convert log odds to probability. You can then calculate risk ratios from the calculated probabilities. See ?predict.glm for more details.
I found this epiDisplay package, works fine! It might be useful for others but note that your confidence intervals or exact results will vary according to the package used so it is good to read the package details and chose the one that works well for your data.
Here is a sample code:
library(epiDisplay)
data(Wells, package="carData")
glm1 <- glm(switch~arsenic+distance+education+association,
family=binomial, data=Wells)
logistic.display(glm1)
Source website
The above formula to logits to probabilities, exp(logit)/(1+exp(logit)), may not have any meaning. This formula is normally used to convert odds to probabilities. However, in logistic regression an odds ratio is more like a ratio between two odds values (which happen to already be ratios). How would probability be defined using the above formula? Instead, it may be more correct to minus 1 from the odds ratio to find a percent value and then interpret the percentage as the odds of the outcome increase/decrease by x percent given the predictor.

What is the difference between Multiple R-squared and Adjusted R-squared in a single-variate least squares regression?

Could someone explain to the statistically naive what the difference between Multiple R-squared and Adjusted R-squared is? I am doing a single-variate regression analysis as follows:
v.lm <- lm(epm ~ n_days, data=v)
print(summary(v.lm))
Results:
Call:
lm(formula = epm ~ n_days, data = v)
Residuals:
Min 1Q Median 3Q Max
-693.59 -325.79 53.34 302.46 964.95
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2550.39 92.15 27.677 <2e-16 ***
n_days -13.12 5.39 -2.433 0.0216 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 410.1 on 28 degrees of freedom
Multiple R-squared: 0.1746, Adjusted R-squared: 0.1451
F-statistic: 5.921 on 1 and 28 DF, p-value: 0.0216
The "adjustment" in adjusted R-squared is related to the number of variables and the number of observations.
If you keep adding variables (predictors) to your model, R-squared will improve - that is, the predictors will appear to explain the variance - but some of that improvement may be due to chance alone. So adjusted R-squared tries to correct for this, by taking into account the ratio (N-1)/(N-k-1) where N = number of observations and k = number of variables (predictors).
It's probably not a concern in your case, since you have a single variate.
Some references:
How high, R-squared?
Goodness of fit statistics
Multiple regression
Re: What is "Adjusted R^2" in Multiple Regression
The Adjusted R-squared is close to, but different from, the value of R2. Instead of being based on the explained sum of squares SSR and the total sum of squares SSY, it is based on the overall variance (a quantity we do not typically calculate), s2T = SSY/(n - 1) and the error variance MSE (from the ANOVA table) and is worked out like this: adjusted R-squared = (s2T - MSE) / s2T.
This approach provides a better basis for judging the improvement in a fit due to adding an explanatory variable, but it does not have the simple summarizing interpretation that R2 has.
If I haven't made a mistake, you should verify the values of adjusted R-squared and R-squared as follows:
s2T <- sum(anova(v.lm)[[2]]) / sum(anova(v.lm)[[1]])
MSE <- anova(v.lm)[[3]][2]
adj.R2 <- (s2T - MSE) / s2T
On the other side, R2 is: SSR/SSY, where SSR = SSY - SSE
attach(v)
SSE <- deviance(v.lm) # or SSE <- sum((epm - predict(v.lm,list(n_days)))^2)
SSY <- deviance(lm(epm ~ 1)) # or SSY <- sum((epm-mean(epm))^2)
SSR <- (SSY - SSE) # or SSR <- sum((predict(v.lm,list(n_days)) - mean(epm))^2)
R2 <- SSR / SSY
The R-squared is not dependent on the number of variables in the model. The adjusted R-squared is.
The adjusted R-squared adds a penalty for adding variables to the model that are uncorrelated with the variable your trying to explain. You can use it to test if a variable is relevant to the thing your trying to explain.
Adjusted R-squared is R-squared with some divisions added to make it dependent on the number of variables in the model.
Note that, in addition to number of predictive variables, the Adjusted R-squared formula above also adjusts for sample size. A small sample will give a deceptively large R-squared.
Ping Yin & Xitao Fan, J. of Experimental Education 69(2): 203-224, "Estimating R-squared shrinkage in multiple regression", compares different methods for adjusting r-squared and concludes that the commonly-used ones quoted above are not good. They recommend the Olkin & Pratt formula.
However, I've seen some indication that population size has a much larger effect than any of these formulas indicate. I am not convinced that any of these formulas are good enough to allow you to compare regressions done with very different sample sizes (e.g., 2,000 vs. 200,000 samples; the standard formulas would make almost no sample-size-based adjustment). I would do some cross-validation to check the r-squared on each sample.

Resources