Understanding lm Multiple R when fit is a horizontal line - r

I have been using R and the lm function to do linear regression and report R2.
y = c(1,2,3,4)
x = c(1,2,3,4)
f = lm(y~x)
r2 = summary(f)$r.squared
However, someone gave me this case-
y = c(1,1,1,1,1)
x = c(75,33,50,33,50)
Excel reports an intercept of 1, coefficient of 0 and Multiple R and r2 of 1.
R reports an intercept of 1, coefficient of 01e-17 and Multiple R-squared of 0.3392
Not being a statistician, I'm not understanding where lm() gets that number for Multiple R-squared from. Could someone help me out with an explanation?
If I change the data to
y = c(1,1,1,1,1)
x = c(1,1,1,1,1)
Excel still gives y = 1 + 0 * x r2 = 1
whereas lm() reports the slope as NA and does not report a Multiple R-squared.
While this seems like a unique case, I'm still being told my program that calls lm() does not work because it fails these tests and Excel gives the 'expected' answer.
Thanks

I thought I would summarize the very helpful, but long series of comments related to my original question which I will restate as: what is the appropriate value of r2 when y does not vary, i.e., the y data can be exactly fit to the equation y = c?
a. Excel reports an r2 of 1. Which is what my users want, since the data is fit exactly.
b. The r2 value should reflect the fraction of variation accounted for by the model as compared with that accounted for by the null hypothesis, i.e., the mean. The equation is
R2 = 1 - SSR/SST
where SSR is the sum of the squared distances between the actual and model (predicted) values and SST is the sum of the squared distances between the actual and mean values.
When the data exactly fits a horizontal line, there is no deviation from the mean. So asking what proportion of the deviation is accounted for by the model is actually meaningless. From the equation, one is dividing 0 by 0.
The values reported by R are thus likely to be nothing more than round-off error in values that are effectively zero.
I should therefore check for this condition and not report an R2 rather than either report the number coming from R (lm) or report the value that Excel would give (1).

Related

Linear Regression Model with a variable that zeroes the result

For my class we have to create a model to predict the credit balance of each individuals. Based on observations, many results are zero where the lm tries to calculate them.
To overcome this I created a new variable that results in zero if X and Y are true.
CB$Balzero = ifelse(CB$Rating<=230 & CB$Income<90,0,1)
This resulted in getting 90% of the zero results right. The problem is:
How can I place this variable in the lm so it correctly results in zeros when the proposition is true and the calculation when it is false?
Something like: lm=Balzero*(Balance~.)
I think that
y ~ -1 + Balzero:Balance
might work (you haven't given us a reproducible example to try).
-1 tells R to omit the intercept
: specifies an interaction. If both variables are numeric, then A:B includes the product of A and B as a term in the model.
The second term could also be specified as I(Balzero*Balance) (I means "as is", i.e. interpret * in the usual numerical sense, not in its formula-construction context.)
These specifications should fit the model
Y = beta1*Balzero*Balance + eps
where eps is an error term.
If Balzero == 0, the predicted value will be zero. If Balzero==1 the predicted value will be beta1*Balance.
You might want to look into random forest models, which naturally incorporate the kind of qualitative splitting that you're doing by hand in your example.

GAM smooths interaction differences - calculate p value using mgcv and gratia 0.6

I am using the useful gratia package by Gavin Simpson to extract the difference in two smooths for two different levels of a factor variable. The smooths are generated by the wonderful mgcv package. For example
library(mgcv)
library(gratia)
m1 <- gam(outcome ~ s(dep_var, by = fact_var) + fact_var, data = my.data)
diff1 <- difference_smooths(m1, smooth = "s(dep_var)")
draw(diff1)
This give me a graph of the difference between the two smooths for each level of the "by" variable in the gam() call. The graph has a shaded 95% credible interval (CI) for the difference.
Statistical significance, or areas of statistical significance at the 0.05 level, is assessed by whether or where the y = 0 line crosses the CI, where the y axis represents the difference between the smooths.
Here is an example from Gavin's site where the "by" factor variable had 3 levels.
The differences are clearly statistically significant (at 0.05) over nearly all of the graphs.
Here is another example I have generated using a "by" variable with 2 levels.
The difference in my example is clearly not statistically significant anywhere.
In the mgcv package, an approximate p value is outputted for a smooth fit that tests the null hypothesis that the coefficients are all = 0, based on a chi square test.
My question is, can anyone suggest a way of calculating a p value that similarly assesses the difference between the two smooths instead of solely relying on graphical evidence?
The output from difference_smooths() is a data frame with differences between the smooth functions at 100 points in the range of the smoothed variable, the standard error for the difference and the upper and lower limits of the CI.
Here is a link to the release of gratia 0.4 that explains the difference_smooths() function
enter link description here
but gratia is now at version 0.6
enter link description here
Thanks in advance for taking the time to consider this.
Don
One way of getting a p value for the interaction between the by factor variables is to manipulate the difference_smooths() function by activating the ci_level option. Default is 0.95. The ci_level can be manipulated to find a level where the y = 0 is no longer within the CI bands. If for example this occurred when ci_level = my_level, the p value for testing the hypothesis that the difference is zero everywhere would be 1 - my_level.
This is not totally satisfactory. For example, it would take a little manual experimentation and it may be difficult to discern accurately when zero drops out of the CI. Although, a function could be written to search the accompanying data frame that is outputted with difference_smooths() as the ci_level is varied. This is not totally satisfactory either because the detection of a non-zero CI would be dependent on the 100 points chosen by difference_smooths() to assess the difference between the two curves. Then again, the standard errors are approximate for a GAM using mgcv, so that shouldn't be too much of a problem.
Here is a graph where the zero first drops out of the CI.
Zero dropped out at ci_level = 0.88 and was still in the interval at ci_level = 0.89. So an approxiamte p value would be 1 - 0.88 = 0.12.
Can anyone think of a better way?
Reply to Gavin Simpson's comments Feb 19
Thanks very much Gavin for taking the time to make your comments.
I am not sure if using the criterion, >= 0 (for negative diffs), is a good way to go. Because of the draws from the posterior, there is likely to be many diffs that meet this criterion. I am interpreting your criterion as sample the posterior distribution and count how many differences meet the criterion, calculate the percentage and that is the p value. Correct me if I have misunderstood. Using this approach, I consistently got p values at around 0.45 - 0.5 for different gam models, even when it was clear the difference in the smooths should be statistically significant, at least at p = 0.05, because the confidence band around the smooth did not contain zero at a number of points.
Instead, I was thinking perhaps it would be better to compare the means of the posterior distribution of each of the diffs. For example
# get coefficients for the by smooths
coeff.level1 <- coef(gam.model1)[31:38]
coeff.level0 <- coef(gam.model1)[23:30]
# these indices are specific to my multi-variable gam.model1
# in my case 8 coefficients per smooth
# get posterior coefficients variances for the by smooths' coefficients
vp_level1 <- gam.model1$Vp[31:38, 31:38]
vp_level0 <- gam.model1$Vp[23:30, 23:30]
#run the simulation to get the distribution of each
#difference coefficient using the joint variance
library(MASS)
no.draws = 1000
sim <- mvrnorm(n = no.draws, (coeff.level1 - coeff.level0),
(vp_level1 + vp_level0))
# sim is a no.draws X no. of coefficients (8 in my case) matrix
# put the results into a data.frame.
y.group <- data.frame(y = as.vector(sim),
group = c(rep(1,no.draws), rep(2,no.draws),
rep(3,no.draws), rep(4,no.draws),
rep(5,no.draws), rep(6,no.draws),
rep(7,no.draws), rep(8,no.draws)) )
# y has the differences sampled from their posterior distributions.
# group is just a grouping name for the 8 sets of differences,
# (one set for each difference in coefficients)
# compare means with a linear regression
lm.test <- lm(y ~ as.factor(group), data = y.group)
summary(lm.test)
# The p value for the F statistic tells you how
# compatible the data are with the null hypothesis that
# all the group means are equal to each other.
# Same F statistic and p value from
anova(lm.test)
One could argue that if all coefficients are not equal to each other then they all can't be equal to zero but that isn't what we want here.
The basis of the smooth tests of fit given by summary(mgcv::gam.model1)
is a joint test of all coefficients == 0. This would be from a type of likelihood ratio test where model fit with and without a term are compared.
I would appreciate some ideas how to do this with the difference between two smooths.
Now that I got this far, I had a rethink of your original suggestion of using the criterion, >= 0 (for negative diffs). I reinterpreted this as meaning for each simulated coefficient difference distribution (in my case 8), count when this occurs and make a table where each row (my case, 8) is for one of these distributions with two columns holding this count and (number of simulation draws minus count), Then on this table run a chi square test. When I did this, I got a very low p value when I believe I shouldn't have as 0 was well within the smooth difference CI across almost all the levels of the exposure. Maybe I am still misunderstanding your suggestion.
Follow up thought Feb 24
In a follow up thought, we could create a variable that represents the interaction between the by factor and continuous variable
library(dplyr)
my.dat <- my.dat %>% mutate(interact.var =
ifelse(factor.2levels == "yes", 1, 0)*cont.var)
Here I am assuming that factor.2levels has the levels ("no", "yes"), and "no" is the reference level. The ifelse function creates a dummy variable which is multiplied by the continuous variable to generate the interactive variable.
Then we place this interactive variable in the GAM and get the usual statistical test for fit, that is, testing all the coefficients == 0.
#GavinSimpson actually posted a method of how to get the difference between two smooths and assess its statistical significance here in 2017. Thanks to Matteo Fasiolo for pointing me in that direction.
In that approach, the by variable is converted to an ordered categorical variable which causes mgcv::gam to produce difference smooths in comparison to the reference level. Statistical significance for the difference smooths is then tested in the usual way with the summary command for the gam model.
However, and correct me if I have misunderstood, the ordered factor approach causes the smooth for the main effect to now be the smooth for the reference level of the ordered factor.
The approach I suggested, see the main post under the heading, Follow up thought Feb 24, where the interaction variable is created, gives an almost identical result for the p value for the difference smooth but does not change the smooth for the main effect. It also does not change the intercept and the linear term for the by categorical variable which also both changed with the ordered variable approach.

ANOVA (AOV function) in R: Misleading p_value reported on equal values

I would greatly appreciate any guidance on the following: I am running ANOVA (aov) to retrieve p_value s for a number of subsets of a larger data set. So I kind of bumped into a subset where my numeric variables/values are equally 36. Because it is a part of a loop ANOVA is still executed along with reporting an seemingly infinitely small p_value 1.2855e-134--> Correct me if I am wrong but the smaller the p_value the higher the probability that the difference between the factors is significantly different?
For simplicity this is the subset:
sUBSET_FOR_ANOVA
Here is how I calculate ANOVA and retrieve p_value, where TEMP_DF2 is just the subset you see attached:
#
anova_sweep <- aov(TEMP_DF2$GOOD_PTS~TEMP_DF2$MACH,data = TEMP_DF2)
p_value <- summary(anova_sweep)[[1]][["Pr(>F)"]]
p_value <- p_value[1]
#
Many thanks for any guidance,
I can't replicate your findings. Let's produce an example dataset with all values being 36:
df <- data.frame(gr = rep(letters[1:2], 100),
y = 36)
summary(aov(y~gr, data = df))
Gives:
Df Sum Sq Mean Sq F value Pr(>F)
gr 1 1.260e-27 1.262e-27 1 0.319
Residuals 198 2.499e-25 1.262e-27
Basically, depending on the sample size, we obtain a p-value around 0.3 or so. The F statistic is (by definition) always 1, since the between and within group variances are equal.
Are there results misleading? To some extent, yes. The estimated SS and MS should be 0, aov calculates them as very very small. Some other statistical tests in R and in some packages check for zero variance and would produce an error, but aov apparently does not.
However, more importantly, I would say your data is violating the assumptions of the ANOVA and therefore any result cannot be trusted to base conclusion on. The expectation in R when it comes to statistical tests is usually that it is upon the user to employ the tests in the correct circumstances.

MLE regression that accounts for two constraints

So I am wanting to create a logistic regression that simultaneously satisfies two constraints.
The link here, outlines how to use the Excel solver to maximize the value of Log-Likelihood value of a logistic regression, but I am wanting to implement a similar function in R
What I am trying to create in the end is an injury risk function. These take an S-shape function.
As we see, the risk curves are calculated from the following equation
Lets take some dummy data to begin with
set.seed(112233)
A <- rbinom(153, 1, 0.6)
B <- rnorm(153, mean = 50, sd = 5)
C <- rnorm(153, mean = 100, sd = 15)
df1 <- data.frame(A,B,C)
Lets assume A indicates if a bone was broken, B is the bone density and C is the force applied.
So we can form a logistic regression model that uses B and C are capable of explaining the outcome variable A. A simple example of the regression may be:
Or
glm(A ~ B + C, data=df1, family=binomial())
Now we want to make the first assumption that at zero force, we should have zero risk. This is further explained as A1. on pg.124 here
Here we set our A1=0.05 and solve the equation
A1 = 1 - (1-P(0))^n
where P(0) is the probability of injury when the injury related parameter is zero and n is the sample size.
We have our sample size and can solve for P(0). We get 3.4E-4. Such that:
The second assumption is that we should maximize the log-likelihood function of the regression
We want to maximize the following equation
Where pi is estimated from the above equation and yi is the observed value for non-break for each interval
My what i understand, I have to use one of the two functions in R to define a function for max'ing LL. I can use mle from base R or the mle2 from bbmle package.
I guess I need to write a function along these lines
log.likelihood.sum <- function(sequence, p) {
log.likelihood <- sum(log(p)*(sequence==1)) + sum(log(1-p)*(sequence==0))
}
But I am not sure where I should account for the first assumption. Ie. am I best to build it into the above code, and if so, how? Or will it be more effiecient to write a secondary finction to combine the two assumptions. Any advice would be great, as I have very limited experience in writing and understanding functions

Weighted censored regression in R?

I am very new to R (mostly program in SQL) but was faced with a problem that SQL couldn't help me with. I'll try to simplify the problem below.
Assume I have a set of data with 100 rows where each row has a different weight associated with it. Out of those 100 rows of data, 5 have an X value that is top-coded at 1000. Also assume that X can be represented by the linear equation X ~ Y + Z + U + 0 (want a positive value so I don't want a Y-intercept).
Now, without taking the weights of each row of data into consideration, the formula I used in R was:
fit = censReg(X ~ Y + Z + U + 0, left = -Inf, right = 1000, data = dataset)
If I computed summary(fit) I would get 0 left-censored values, 95 uncensored values, and 5 right censored values which is exactly what I want, minus the fact that the weights haven't been sufficiently added into the mix. I checked the reference manual on the censReg function and it doesn't seem like it accepts a weight argument.
Is there something I'm missing about the censReg function or is there another function that would be of better use to me? My end goal is to estimate X in the cases where it is censored (i.e. the 5 cases where it is 1000).
You should use Tobit regression for this situation, it is designed specifically to linearly model latent variables such as the one you describe.
The regression accounts for your weights and the censored observations, which can be seen in the derivation of the log-likelihood function for the Type I Tobit (upper and lower bounded).
Tobit regression can be found in the VGAM package using the vglm function with a tobit control parameter. An excellent example can be found here:
http://www.ats.ucla.edu/stat/r/dae/tobit.htm

Resources