How to interpret the linear regression coefficient summarized by R? - r

Please find reprex below:
library(tidyverse)
# Work days for January from 2010 - 2018
data = data.frame(work_days = c(20,21,22,20,20,22,21,21),
sale = c(1205,2111,2452,2054,2440,1212,1211,2111))
# Apply linear regression
model = lm(sale ~ work_days, data)
summary(model)
Call:
lm(formula = sale ~ work_days, data = data)
Residuals:
Min 1Q Median 3Q Max
-677.8 -604.5 218.7 339.0 645.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2643.82 5614.16 0.471 0.654
work_days -38.05 268.75 -0.142 0.892
Residual standard error: 593.4 on 6 degrees of freedom
Multiple R-squared: 0.00333, Adjusted R-squared: -0.1628
F-statistic: 0.02005 on 1 and 6 DF, p-value: 0.892
Could you please help me understand if the coefficients
Every work day decreases the sale by 38.05 ?
data = data.frame(work_days = c(20,21,22,20,20,22,21,21),
sale = c(1212,1211,2111,1205,2111,2452,2054,2440))
model = lm(sale ~ work_days, data)
summary(model)
Call:
lm(formula = sale ~ work_days, data = data)
Residuals:
Min 1Q Median 3Q Max
-686.8 -301.0 -8.6 261.3 599.7
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6220.0 4555.9 -1.365 0.221
work_days 386.6 218.1 1.772 0.127
Residual standard error: 481.5 on 6 degrees of freedom
Multiple R-squared: 0.3437, Adjusted R-squared: 0.2343
F-statistic: 3.142 on 1 and 6 DF, p-value: 0.1267
Does this mean,
Every workday increases the sales by 387 ?
How about the negative intercept ?
Similar questions but couldnt apply the learnings:
Interpreting regression coefficients in R
Interpreting coefficients from Logistic Regression from R
Linear combination of regression coefficients in R

Could you please help me understand if the coefficients Every work day decreases the sale by 38.05 ?
Yes and no. Given only the 8 data points the best regression line has a negative slope of -38.05 which appears to be counterintuitive.
However, you need to take the standard error of this -38.05 value into account, which is 268.75. So the result can be translated into "in this sample it looks like the slope is negative but it might as well be positive, anything between '-38.05 + 2*268.75' and '-38.05 - 2*268.75' is a resonable guess. So do not extrapolate from this small sample to anything other than this sample.
Also look at
Multiple R-squared: 0.00333
This means, less than 1 % of the sample variance can be explained with this regression. Do not take it to serious and try to explain numbers from such a small sample.
Every workday increases the sales by 387 ? How about the negative intercept ?
Judging only from the small sample you investigated, it looks like every workday increased sales by 387. However, the standard error is high and thus you cannot tell, whether additional workdays increase or decrease sales outside of this small sample. The whole model is not significant so nobody claims, this model is better then pure guessing.
How about the negative intercept ?
You forced the computer to calculate a linear model. That model will allow you to compute stupid values like "what if sales were a linear function of work days and a month had negative or zero workdays"? You could of course force R to predict a linear model, in which zero workdays lead to zero sales and this brings us back on topic. Forcing R to compute a model through the point (0; 0) takes the following syntax:
model <- lm(sales ~ work_days - 1, data = data)

The Intercept of the regression line is interpreted as the predicted sale when work_days is equal to zero. If the predictor (work_days in this case) can't be zero, then it doesn't make sense. The slope of the regression line or the predicted estimate -38.5 can be interpreted as for each additional increase in work_days, sale measurement is reduced by -38.05.

Related

How can I test a variable as confounding in linear regression in R?

I'm currently doing the statistical analysis I'm going to use in my article. It's about sleep and some functioning/cognitive measures on mood disorder patients.
The problem I have is: I correlated a functioning score (continuous variable) with a sleep quality score (continuous variable) using Spearman's correlation. It had a significant p-value (p < 0.05).
And now, I would like to test some variables as confounding variables, like years of education (numeric), use of hypnotic and sedatives (dichotomous), suicide risk (dichotomous) and psychotherapeutic/farmacological treatment.
I use R to run all of my analysis. And my advisor said that I should consider confounding variables those associated with exposure and outcome with p-value < 0.20 in the crude analysis, considering a linear regression model.
What I've tried (that I actually don't know if it's correct or not and how should I interpret the output):
summary(lm(functioning_score ~ sleep_score + years_of_education + sleep_score*ages_of_education, data = data))
Call:
lm(formula = functioning_score ~ sleep_score + years_of_education + sleep_score*years_of_education, data = data)
Residuals:
Min 1Q Median 3Q Max
-29.08309673 -7.39316605 -1.09011226 5.49959525 31.53154265
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.9339474261 9.4999174669 0.83516 0.4073078
sleep_score 2.1791574956 0.7761289987 2.80773 0.0069287
years_of_education -0.3209011778 0.8309350862 -0.38619 0.7008713
sleep_score:years_of_education -0.0144874634 0.0746344060 -0.19411 0.8468163
---
Residual standard error: 11.106846 on 54 degrees of freedom
Multiple R-squared: 0.519259282, Adjusted R-squared: 0.492551464
F-statistic: 19.4422205 on 3 and 54 DF, p-value: 0.0000000112304043

How to fit a known linear equation to my data in R?

I used a linear model to obtain the best fit to my data, lm() function.
From literature I know that the optimal fit would be a linear regression with the slope = 1 and the intercept = 0. I would like to see how good this equation (y=x) fits my data? How do I proceed in order to find an R^2 as well as a p-value?
This is my data
(y = modelled, x = measured)
measured<-c(67.39369,28.73695,60.18499,49.32405,166.39318,222.29022,271.83573,241.72247, 368.46304,220.27018,169.92343,56.49579,38.18381,49.33753,130.91752,161.63536,294.14740,363.91029,358.32905,239.84112,129.65078,32.76462,30.13952,52.83656,67.35427,132.23034,366.87857,247.40125,273.19316,278.27902,123.24256,45.98363,83.50199,240.99459,266.95707,308.69814,228.34256,220.51319,83.97942,58.32171,57.93815,94.64370,264.78007,274.25863,245.72940,155.41777,77.45236,70.44223,104.22838,294.01645,312.42321,122.80831,41.65770,242.22661,300.07147,291.59902,230.54478,89.42498,55.81760,55.60525,111.64263,305.76432,264.27192,233.28214,192.75603,75.60803,63.75376)
modelled<-c(42.58318,71.64667,111.08853,67.06974,156.47303,240.41188,238.25893,196.42247,404.28974,138.73164,116.73998,55.21672,82.71556,64.27752,145.84891,133.67465,295.01014,335.25432,253.01847,166.69241,68.84971,26.03600,45.04720,75.56405,109.55975,202.57084,288.52887,140.58476,152.20510,153.99427,75.70720,92.56287,144.93923,335.90871,NA,264.25732,141.93407,122.80440,83.23812,42.18676,107.97732,123.96824,270.52620,388.93979,308.35117,100.79047,127.70644,91.23133,162.53323,NA ,276.46554,100.79440,81.10756,272.17680,387.28700,208.29715,152.91548,62.54459,31.98732,74.26625,115.50051,324.91248,210.14204,168.29598,157.30373,45.76027,76.07370)
Now I would like to see how good the equation y=x fits the data presented above (R^2 and p-value)?
I am very grateful if somebody can help me with this (basic) problem, as I found no answers to my question on stackoverflow?
Best regards Cyril
Let's be clear what you are asking here. You have an existing model, which is "the modelled values are the expected value of the measured values", or in other words, measured = modelled + e, where e are the normally distributed residuals.
You say that the "optimal fit" should be a straight line with intercept 0 and slope 1, which is another way of saying the same thing.
The thing is, this "optimal fit" is not the optimal fit for your actual data, as we can easily see by doing:
summary(lm(measured ~ modelled))
#>
#> Call:
#> lm(formula = measured ~ modelled)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -103.328 -39.130 -4.881 40.428 114.829
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 23.09461 13.11026 1.762 0.083 .
#> modelled 0.91143 0.07052 12.924 <2e-16 ***
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> Residual standard error: 55.13 on 63 degrees of freedom
#> Multiple R-squared: 0.7261, Adjusted R-squared: 0.7218
#> F-statistic: 167 on 1 and 63 DF, p-value: < 2.2e-16
This shows us the line that would produce the optimal fit to your data in terms of reducing the sum of the squared residuals.
But I guess what you are asking is "How well do my data fit the model measured = modelled + e ?"
Trying to coerce lm into giving you a fixed intercept and slope probably isn't the best way to answer this question. Remember, the p value for the slope only tells you whether the actual slope is significantly different from 0. The above model already confirms that. If you want to know the r-squared of measured = modelled + e, you just need to know the proportion of the variance of measured that is explained by modelled. In other words:
1 - var(measured - modelled) / var(measured)
#> [1] 0.7192672
This is pretty close to the r squared from the lm call.
I think you have sufficient evidence to say that your data is consistent with the model measured = modelled, in that the slope in the lm model includes the value 1 within its 95% confidence interval, and the intercept contains the value 0 within its 95% confidence interval.
As mentioned in the comments, you can use the lm() function, but this actually estimates the slope and intercept for you, whereas what you want is something different.
If slope = 1 and the intercept = 0, essentially you have a fit and your modelled is already the predicted value. You need the r-square from this fit. R squared is defined as:
R2 = MSS/TSS = (TSS − RSS)/TSS
See this link for definition of RSS and TSS.
We can only work with observations that are complete (non NA). So we calculate each of them:
TSS = nonNA = !is.na(modelled) & !is.na(measured)
# residuals from your prediction
RSS = sum((modelled[nonNA] - measured[nonNA])^2,na.rm=T)
# total residuals from data
TSS = sum((measured[nonNA] - mean(measured[nonNA]))^2,na.rm=T)
1 - RSS/TSS
[1] 0.7116585
If measured and modelled are supposed to represent the actual and fitted values of an undisclosed model, as discussed in the comments below another answer, then if fm is the lm object for that undisclosed model then
summary(fm)
will show the R^2 and p value of that model.
The R squared value can actually be calculated using only measured and modelled but the formula is different if there is or is not an intercept in the undisclosed model. The signs are that there is no intercept since if there were an intercept sum(modelled - measured, an.rm = TRUE) should be 0 but in fact it is far from it.
In any case R^2 and the p value are shown in the output of the summary(fm) where fm is the undisclosed linear model so there is no point in restricting the discussion to measured and modelled if you have the lm object of the undisclosed model.
For example, if the undisclosed model is the following then using the builtin CO2 data frame:
fm <- lm(uptake ~ Type + conc, CO2)
summary(fm)
we have the this output where the last two lines show R squared and p value.
Call:
lm(formula = uptake ~ Type + conc, data = CO2)
Residuals:
Min 1Q Median 3Q Max
-18.2145 -4.2549 0.5479 5.3048 12.9968
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.830052 1.579918 16.349 < 2e-16 ***
TypeMississippi -12.659524 1.544261 -8.198 3.06e-12 ***
conc 0.017731 0.002625 6.755 2.00e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.077 on 81 degrees of freedom
Multiple R-squared: 0.5821, Adjusted R-squared: 0.5718
F-statistic: 56.42 on 2 and 81 DF, p-value: 4.498e-16

Access z-value and other statistics in output of Zelig relogit

I want to compute a logit regression for rare events. I decided to use the Zelig package (relogit function) to do so.
Usually, I use stargazer to extract and save regression results. However, there seem to be compatibility issues with these two packages (Using stargazer with Zelig).
I now want to extract the following information from the Zelig relogit output:
Coefficients, z values, p values, number of observations, log likelihood, AIC
I have managed to extract the p-values and coefficients. However, I failed at the rest. But I am sure these values must be accessible somehow, because they are reported in the summary() output (however, I did not manage to store the summary output as an R object). The summary cannot be processed in the same way as a regular glm summary (https://stats.stackexchange.com/questions/176821/relogit-model-from-zelig-package-in-r-how-to-get-the-estimated-coefficients)
A reproducible example:
##Initiate package, model and data
require(Zelig)
data(mid)
z.out1 <- zelig(conflict ~ major + contig + power + maxdem + mindem + years,
data = mid, model = "relogit")
##Call summary on output (reports in console most of the needed information)
summary(z.out1)
##Storing the summary fails and only produces a useless object
summary(z.out1) -> z.out1.sum
##Some of the output I can access as follows
z.out1$get_coef() -> z.out1.coeff
z.out1$get_pvalue() -> z.out1.p
z.out1$get_se() -> z.out1.se
However, I did not find similar commands for other elements, such as z values, AIC etc. However, as they are shown in the summary() call, they should be accessible somehow.
The summary call result:
Model:
Call:
z5$zelig(formula = conflict ~ major + contig + power + maxdem +
mindem + years, data = mid)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.0742 -0.4444 -0.2772 0.3295 3.1556
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.535496 0.179685 -14.111 < 2e-16
major 2.432525 0.157561 15.439 < 2e-16
contig 4.121869 0.157650 26.146 < 2e-16
power 1.053351 0.217243 4.849 1.24e-06
maxdem 0.048164 0.010065 4.785 1.71e-06
mindem -0.064825 0.012802 -5.064 4.11e-07
years -0.063197 0.005705 -11.078 < 2e-16
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3979.5 on 3125 degrees of freedom
Residual deviance: 1868.5 on 3119 degrees of freedom
AIC: 1882.5
Number of Fisher Scoring iterations: 6
Next step: Use 'setx' method
Use from_zelig_model for deviance, AIC.
m <- from_zelig_model(z.out1)
m$aic
...
Z-values are coefficient / sd.
z.out1$get_coef()[[1]]/z.out1$get_se()[[1]]

Residual standard error in survey package

I am trying to calculate the residual standard error of a linear regression model using the survey package. I am working with a complex design, and the sampling weight of the complex design is given by "weight" in the code below.
fitM1 <- lm(med~x1+x2,data=pop_sample,weight=weight)
fitM2 <- svyglm(med~x1+x2,data=pop_sample,design=design)
First, if I call "summary(fitM1)", I get the following:
Call: lm(formula=med~x1+x2,data=pop_sample,weights=weight)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.001787 0.042194 0.042 0.966
x1 0.382709 0.061574 6.215 1.92e-09 ***
x2 0.958675 0.048483 19.773 < 2e-16 ***
Residual standard error: 9.231 on 272 degrees of freedom
Multiple R-squared: 0.8958, Adjusted R-squared: 0.8931
F-statistic: 334.1 on 7 and 272 DF, p-value: < 2.2e-16
Next, if I call "summary(fitM2)", I get the following:
summary(fitM2)
Call: svyglm(formula=med~x1+x2,data=pop_sample,design=design)
Survey design: svydesign(id=~id_cluster,strat=~id_stratum,weight=weight,data=pop_sample)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.001787 0.043388 0.041 0.967878
x1 0.382709 0.074755 5.120 0.000334 ***
x2 0.958675 0.041803 22.933 1.23e-10 ***
When using "lm", I can extract the residual standard error by calling:
fitMvariance <- summary(fitM1)$sigma^2
However, I can't find an analogous function for "svyglm" anywhere in the survey package. The point estimates are the same when comparing the two approaches, but the standard errors of the coefficients (and, presumably, the residual standard error of the model) are different.
Survey Analysis
use the library survey in the r to perform survey analysis, it offers a wide range of functions to calculate the statistics like Percentage, Lower CI, Upper CI, population and RSE.
RSE
we can use thesvyby function in the survey package to get all the statistics including the Root squared error
library("survey")
Survey design: svydesign(id=~id_cluster,strat=~id_stratum,weight=weight,data=pop_sample)
svyby(~med, ~x1+x2, design, svytotal, deff=TRUE, verbose=TRUE,vartype=c("se","cv","cvpct","var"))
The cvpct will give the root squared error
Refer for further information svyby
Because svyglm is built on glm not lm, the variance estimate is called $dispersion rather than $sigma
> data(api)
> dstrat<-svydesign(id = ~1, strata = ~stype, weights = ~pw, data = apistrat,
+ fpc = ~fpc)
> model<-svyglm(api00~ell+meals+mobility, design=dstrat)
> summary(model)$dispersion
variance SE
[1,] 5172 492.28
This is the estimate of $\sigma^2$, which is the population residual variance. In this example we actually have the whole population, so we can compare
> popmodel<-lm(api00~ell+meals+mobility, data=apipop)
> summary(popmodel)$sigma
[1] 70.58365
> sqrt(5172)
[1] 71.91662

Using zelig for simulation

I am very confused about the package Zelig and in particular the function sim.
What i want to do is estimate a logistic regression using a subset of my data and then estimate the fitted values of the remaining data to see how well the estimation performs. Some sample code follows:
data(turnout)
turnout <- data.table(turnout)
Shuffle the data
turnout <- turnout[sample(.N,2000)]
Create a sample for regression
turnout_sample <- turnout[1:1800,]
Create a sample for out of data testing
turnout_sample2 <- turnout[1801:2000,]
Run the regression
z.out1 <- zelig(vote ~ age + race, model = "logit", data = turnout_sample)
summary(z.out1)
Model:
Call:
z5$zelig(formula = vote ~ age + race, data = turnout_sample)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9394 -1.2933 0.7049 0.7777 1.0718
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.028874 0.186446 0.155 0.876927
age 0.011830 0.003251 3.639 0.000274
racewhite 0.633472 0.142994 4.430 0.00000942
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2037.5 on 1799 degrees of freedom
Residual deviance: 2002.9 on 1797 degrees of freedom
AIC: 2008.9
Number of Fisher Scoring iterations: 4
Next step: Use 'setx' method
Set the x values to the remaining 200 observations
x.out1 <- setx(z.out1,fn=NULL,data=turnout_sample2)
Simulate
s.out1 <- sim(z.out1,x=x.out1)
Get the fitted values
fitted <- s.out1$getqi("ev")
What i don't understand is that the list fitted now contains 1000 values and all the values are between 0,728 and 0,799.
1. Why are there 1000 values when what I am trying to estimate is the fitted value of 200 observations?
2. And why are the observations so closely grouped?
I hope someone can help me with this.
Best regards
The first question: From the signature of sim (sim(obj, x = NULL, x1 = NULL, y = NULL, num = 1000..) you see the default number of simulations is 1000. If you want to have 200, set num=200.
However, the sim in this example from documentation you use, actually generates (simulates) the probability that a person will vote given certain values (either computed by setx or computed and fixed on some value like this setx(z.out, race = "white")).
So in your case, you have 1000 simulated probability values between 0,728 and 0,799, which is what you are supposed to get.

Resources