GLM LM predicted values sum and High Dispersion - r

Any reason why the sum of predicted values and sum of dependent variable is same?
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl*100, trt*20)
lm.D9 <- glm(weight ~ group,family = gaussian())
summary(lm.D9)
y<-predict(lm.D9,newdata=group,type="response")
sum(weight)
sum(y)
Also the dispersion is also very high (in my actual data). Any leads on how to tackle this? My original data is buidling a model on actual vs expected. I have tried 2 different models,
Ratio of Actual by Expected as dependent and GLM with gaussian
Actual - Expected difference as dependent.
But the dispersion in the second case is very high, and both models not validating.
Help appreciated!

You have two groups, when you perform a linear regression, the predicted value is the mean of each group:
predict(lm.D9,newdata=data.frame(group=c("Ctl","Trt")))
1 2
503.20 93.22
You can check this:
tapply(weight,group,mean)
Ctl Trt
503.20 93.22
And if you sum up the predicted values, it is essentially the number of observations * mean which gives you back the sum of your values to begin with.
we can check how the data looks, and to me it looks ok, no crazy outliers:
boxplot(weight ~ group)
You can check out this post, the dispersion in lm is the sum of squared residuals divided by degree of freedom, basically the square of the deviation from your predicted values:
sum(residuals(lm.D9)^2)/lm.D9$df.residual
[1] 1825.962
Given the mean of your data is 298.21 , an average deviation of sqrt(1825.962) = 42.73128 is pretty ok

Related

Test for Poisson residuals in the analysis of variance model

I try to find any way for test Poisson residuals like normals in aov(). In my hypothetical example:
# For normal distribution
x <- rep(seq(from=10, to=50, by=0.5),6)
y1 <- rnorm(length(x), mean=10, sd=1.5)
#Normality test in aov residuals
y1.av<-aov(y1 ~ x)
shapiro.test(y1.av$res)
# Shapiro-Wilk normality test
#
#data: y1.av$res
#W = 0.99782, p-value = 0.7885
Sounds silly, OK!!
Now, I'll like to make a same approche but for Poisson distribution:
# For Poisson distribution
x <- rep(seq(from=10, to=50, by=0.5),6)
y2 <- rpois(x, lambda=10)
#Normality test in aov residuals
y2.av<-aov(y2 ~ x)
poisson.test(y2.av$res)
Error in poisson.test(y2.av$res) :
'x' must be finite, nonnegative, and integer
There is any stat approach for make this?
Thanks!
You could analyse your data below a counting context. Discrete data, such as variables of Poisson nature, can be analysed based on observed frequencies. You can formulate hypothesis testing for this task. Being your data y you can contrast the null hypothesis that y follows a Poisson distribution with some parameter lambda against the alternative hypothesis that y does not come from the Poisson distribution. Let's sketch the test with you data:
#Data
set.seed(123)
# For Poisson distribution
x <- rep(seq(from=10, to=50, by=0.5),6)
y2 <- rpois(x, lambda=10)
Now we obtain the counts, which are elemental for the test:
#Values
df <- as.data.frame(table(y2),stringsAsFactors = F)
df$y2 <- as.integer(df$y2)
After that we must separate the observed values O and its groups or categories classes. Both elements constitute the y variable:
#Observed values
O <- df$Freq
#Groups
classes <- df$y2
As we are testing a Poisson distribution, we must compute the lambda parameter. This can be obtained with Maximum Likelihood Estimation (MLE). The MLE for Poisson is the mean (considering we have counts and groups in order to determine this value), so we compute it with next code:
#MLE
meanval <- sum(O*classes)/sum(O)
Now, we have to get the probabilities of each class:
#Probs
prob <- dpois(classes,meanval)
Poisson distribution can go to infinite values, so we must compute the probability for the values that can be greater than our last group in order to have probabilities that sum to one:
prhs <- 1-sum(prob)
This probability can be easily added to the last value of our group in order to transform to account for values greater or equal to it (For example, instead of only having the probability that y equals to 20 we can have the probability that y is greater or equal to 20):
#Add probability
prob[length(prob)]<-prob[length(prob)]+prhs
With this we can conduct a goodness of fit test using chisq.test() function in R. It requires the observed values O and the probabilities prob that we have computed. Just a reminder that this test uses to set wrong degrees of freedom, so we can correct it by the formulation of the test that uses k-q-1 degrees. Where k is the number of groups and q is the number of parameters computed (we have computed one parameter with MLE). Next the test:
chisq.test(O,p=prob)
The output:
Chi-squared test for given probabilities
data: O
X-squared = 7.6692, df = 17, p-value = 0.9731
The key value from the test is the X-squared value which is the test statistic. We can reuse the value to obtain the real p-value (In our example, we have k=18 and minus 2, the degrees of freedom are 16).
The p.value can be obtained with next code:
p.value <- 1-pchisq(7.6692, 16)
The output:
[1] 0.9581098
As this value is not greater that known significance levels we do not reject the null hypothesis and we can affirm that y comes from a Poisson distribution.

Regression with weights: Less standardized residuals then observations

I modelled a multiple Regression based on the Mincer-Wage-Equation and I added a weighting-factor to make it representative for the whole population.
But when I'm adding the weights function into my modell, R calculates less standardized residuals than I have observations.
Here's my modell:
lm(log(earings) ~ Gender + Age + Age^2 + Education, weights= phrf)
So I got problems to analyze the residuals because when I'm trying to plot the rstandard against the fitted.values R is telling: Different Variable Length in rstandard() found.
This Problem ist only by rstandard and rstudent, when I'm plotting the normal resid() against fitted.values there is no problem.
And when I'm leaving out the weights function I have not problems, too.
In the help file for rstudent():
Note that cases with weights == 0 are dropped from all these functions, but that if a linear model has been fitted with na.action = na.exclude, suitable values are filled in for the cases excluded during fitting.
A simple example to demonstrate:
set.seed(123)
x <- 1:100
y <- x + rnorm(100)
w <- runif(100)
w[44] <- 0
fit <- lm(y ~ x, weights=w)
length(fitted(fit))
length(rstudent(fit))
Gives:
> length(fitted(fit))
[1] 100
> length(rstudent(fit))
[1] 99
And this makes sense. If you have a weight of 0, the theoretical variance is 0 which is an infinite studentized or standardized residual.
Since you are effectively deleting those observations, you can subset the call to lm with subset=w!=0 or you can use that flag for the fitted values:
plot(fitted(fit)[w!=0], rstudent(fit))

Set contrasts in glm

I have binomial count data, coming from a set of conditions, that are overdisperesed. To simulate them I use the beta binomial distribution implemented by the rbetabinom function of the emdbook R package:
library(emdbook)
set.seed(1)
df <- data.frame(p = rep(runif(3,0,1)),
n = as.integer(runif(30,100,200)),
theta = rep(runif(3,1,5)),
cond = rep(LETTERS[1:3],10),
stringsAsFactors=F)
df$k <- sapply(1:nrow(df), function(x) rbetabinom(n=1, prob=df$p[x], size=df$n[x],theta = df$theta[x], shape1=1, shape2=1))
I want to find the effect of each condition (cond) on the counts (k).
I think the glm.nb model of the MASS R package allows modelling that:
library(MASS)
fit <- glm.nb(k ~ cond + offset(log(n)), data = df)
My question is how to set the contrasts such that I get the effect of each condition relative to the mean effects over all conditions rather than relative to the dummy condition A?
Two things: (1) if you want contrasts relative to the mean, use contr.sum rather than the default contr.treatment; (2) you probably shouldn't fit beta-binomial data with a negative binomial model; use a beta-binomial model instead (e.g. via VGAM or bbmle)!
library(emdbook)
set.seed(1)
df <- data.frame(p = rep(runif(3,0,1)),
n = as.integer(runif(30,100,200)),
theta = rep(runif(3,1,5)),
cond = rep(LETTERS[1:3],10),
stringsAsFactors=FALSE)
## slightly abbreviated
df$k <- rbetabinom(n=nrow(df), prob=df$p,
size=df$n,theta = df$theta, shape1=1, shape2=1)
With VGAM:
library(VGAM)
## note dbetabinom/rbetabinom from emdbook are masked
options(contrasts=c("contr.sum","contr.poly"))
vglm(cbind(k,n-k)~cond,data=df,
family=betabinomialff(zero=2)
## hold shape parameter 2 constant
)
## Coefficients:
## (Intercept):1 (Intercept):2 cond1 cond2
## 0.4312181 0.5197579 -0.3121925 0.3011559
## Log-likelihood: -147.7304
Here intercept is the mean shape parameter across the levels; cond1 and cond2 are the differences of levels 1 and 2 from the mean (this doesn't give you the difference of level 3 from the mean, but by construction it should be (-cond1-cond2) ...)
I find the parameterization with bbmle (with logit-probability and dispersion parameter) a little easier:
detach("package:VGAM")
library(bbmle)
mle2(k~dbetabinom(k, prob=plogis(lprob),
size=n, theta=exp(ltheta)),
parameters=list(lprob~cond),
data=df,
start=list(lprob=0,ltheta=0))
## Coefficients:
## lprob.(Intercept) lprob.cond1 lprob.cond2 ltheta
## -0.09606536 -0.31615236 0.17353311 1.15201809
##
## Log-likelihood: -148.09
The log-likelihoods are about the same (the VGAM parameterization is a bit better); in theory, if we allowed both shape1 and shape2 (VGAM) or lprob and ltheta (bbmle) to vary across conditions, we'd get the same log-likelihoods for both parameterizations.
Effects must be estimated relative to some base level. The effect of having any of the 3 conditions would be the same as a constant in the regression.
Since the intercept is the expected mean value when cond is = 0 for both estimated levels (i.e. "B" and "C"), it is the mean value only for the reference group (i.e. "A").
Therefore, you basically already have this information in your model, or at least as close to it as you can get.
The mean value of a comparison group is the intercept plus the comparison group's coefficient. The comparison groups' coefficients, as you know, therefore give you the effect of having the comparison group = 1 (bearing in mind that each level of your categorical variable is a dummy variable which = 1 when that level is present) relative to the reference group.
So your results give you the means and relative effects of each level. You can of course switch out the reference level according to your presence.
That should hopefully give you all the information you need. If not then you need to ask yourself precisely what information it is that you're after.

glm model dataset summarisation

first post, so go easy.
In the insurance world of GLMing, the classic approach is to model claims frequency and average severity. With that in mind, I built a couple of models to experiment for myself and now have a question.
Could somebody please explain how GLM handles varying levels of summarisation of a dataset, particularly with regard to error estimates?
Consider the example below. The data exhibits strong severity trends for both variables:
- A has more expensive claims than B
- Ford > Kia > Vaux > Jag
I fitted a model to unsummarised and a summarised version of the dataset, and accordingly GLM fitted the same parameters in both cases
However, GLM indicates a well fitted model to the unsummarised data. But when I summarise and use a weighted mean, ie average severity, the model fits poorly. Maybe this is as you would expect, after all the unsummarised data has more points to model with. Also, it appears the weighted mean is used to indicate RELATIVE strength, so here, specifiying the weighted mean is pointless, since they are all the same weights.
But more fundementally, can I not model average severity with GLM? I mean, I know the result of fitting a GLM to an unsummarised dataset will be a average severity, but I was hoping to fit a model to already summarised data. It appears that modelling on aggregated datasets will not give a true indication of the model fit.
Apologies if this a stupid question, I'm not a statistician, so don't fully understand the Hessian Matrix.
Please see code below:
library(boot)
library(reshape)
dataset <- data.frame(
Person = rep(c("A", "B"), each=200),
Car = rep(c("Ford", "Kia", "Vaux", "Jag"), 2, each=50),
Amount = c(rgamma(50, 200), rgamma(50, 180), rgamma(50, 160), rgamma(50, 140),
rgamma(50, 100), rgamma(50, 80), rgamma(50, 60), rgamma(50, 40))
)
Agg1 <- ddply(dataset, .(Person, Car), summarise, mean=mean(Amount), length=length(Amount))
m1 <- glm(Amount ~ Person + Car, data = dataset, family = Gamma(link="log"))
m2 <- glm(mean ~ Person + Car, data = Agg1, family = Gamma(link="log"), weights=length)
summary(m1)
summary(m2)
Thanks,
Nick
Bottom line is that both models are identical - the reason the aggregated model "fits poorly" is entirely due to the reduction in degrees of freedom due to aggregation.
Before getting into why the models are identical, I should point out that this does not necessarily mean that either model is a good fit. You should run diagnostics on both, especially using:
par(mfrow=c(2,2))
plot(m1)
When you do this. you'll see that the residuals are normally distributed (which is essential), but that they follow a pattern (-, +, -), which is disturbing. I would want to understand that before declaring that this is a good model. [Admittedly, this is made up data, but the principles apply nevertheless.]
Comparing the aggregated to base models, look at the values of the coefficients.
coef.m1 <- summary(m1)$coefficients
coef.m2 <- summary(m2)$coefficients
cbind(coef.m1[,1],coef.m2[,1])
# [,1] [,2]
# (Intercept) 5.4096980 5.4096976
# PersonB -0.9249371 -0.9249366
# CarJag -0.6144606 -0.6144602
# CarKia -0.1786556 -0.1786555
# CarVaux -0.3597925 -0.3597923
The reason you think the aggregated model is "worse" is because of the p-values, but these depend on t = coeff/se . The ratio of se in m1 vs. m2 is the same for all coefficients:
coef.m2[,2]/coef.m1[,2]
# (Intercept) PersonB CarJag CarKia CarVaux
# 7.836171 7.836171 7.836171 7.836171 7.836171
Since
se ~ sd / √ df
the ratio of se for the two models should be approx
sem1/sem2 = √( (nm1-1) / (nm2-1) )
sqrt((nrow(dataset)-1)/(nrow(Agg1)-1))
# [1] 7.549834
Frankly I'm puzzled why the ratio is not exactly equal to 7.55.
Put another way, glm(...) has no way of knowing that you aggregated your data. It thinks you are trying to fit a model with 4 parameters and an intercept to 8 data points.

Choice of statistical test (in R) of two apparently different distributions

I have the following list of data each has 10 samples.
The values indicate binding strength of a particular molecule.
What I want so show is that 'x' is statistically different from
'y', 'z' and 'w'. Which it does if you look at X it has
more values greater than zero (2.8,1.00,5.4, etc) than others.
I tried t-test, but all of them shows insignificant difference
with high P-value.
What's the appropriate test for that?
Below is my code:
#!/usr/bin/Rscript
x <-c(2.852672123,0.076840264,1.009542943,0.430716968,5.4016,0.084281843,0.065654548,0.971907344,3.325405405,0.606504718)
y <- c(0.122615039,0.844203734,0.002128992,0.628740077,0.87752229,0.888600425,0.728667099,0.000375047,0.911153571,0.553786408);
z <- c(0.766445916,0.726801899,0.389718652,0.978733927,0.405585807,0.408554832,0.799010791,0.737676439,0.433279599,0.947906524)
w <- c(0.000124984,1.486637663,0.979713013,0.917105894,0.660855127,0.338574774,0.211689885,0.434050179,0.955522972,0.014195184)
t.test(x,y)
t.test(x,z)
You have not specified in what way you expect the samples to differ. One typically assumes you mean the mean differs across samples. In that case, the t-test is appropriate. While x has some high values, it also has some low values which pull the mean in. It seems what you thought was a significant difference (visually) is actually a larger variance.
If your question is about variance, then you need an F-test.
The classic test for this type of data is analysis of variance. Analysis of variance tells you if the means of all four categories are the likely the same (failure to reject null hypothesis) or if at least one mean likely differs from the others (rejection of the null hypothesis).
If the anova is significant, you will often want to perform the Tukey HSD post-hoc test to figure out which category differs from the others. Tukey HSD yields p-values that are already adjusted for multiple comparisons.
library(ggplot2)
library(reshape2)
x <- c(2.852672123,0.076840264,1.009542943,0.430716968,5.4016,0.084281843,
0.065654548,0.971907344,3.325405405,0.606504718)
y <- c(0.122615039,0.844203734,0.002128992,0.628740077,0.87752229,
0.888600425,0.728667099,0.000375047,0.911153571,0.553786408);
z <- c(0.766445916,0.726801899,0.389718652,0.978733927,0.405585807,
0.408554832,0.799010791,0.737676439,0.433279599,0.947906524)
w <- c(0.000124984,1.486637663,0.979713013,0.917105894,0.660855127,
0.338574774,0.211689885,0.434050179,0.955522972,0.014195184)
dat = data.frame(x, y, z, w)
mdat = melt(dat)
anova_results = aov(value ~ variable, data=mdat)
summary(anova_results)
# Df Sum Sq Mean Sq F value Pr(>F)
# variable 3 5.83 1.9431 2.134 0.113
# Residuals 36 32.78 0.9105
The anova p-value is 0.113 and the Tukey test p-values for your "x" category are in a similar range. This is the quantification of your intuition that "x" is different from the others. Most researchers would find p = 0.11 to be suggestive but still have too high risk of being a false positive. Note that the large difference in means (diff column) along with the boxplot figure below might be more persuasive than the p-value.
TukeyHSD(anova_results)
# Tukey multiple comparisons of means
# 95% family-wise confidence level
#
# Fit: aov(formula = value ~ variable, data = mdat)
#
# $variable
# diff lwr upr p adj
# y-x -0.92673335 -2.076048 0.2225815 0.1506271
# z-x -0.82314118 -1.972456 0.3261737 0.2342515
# w-x -0.88266565 -2.031981 0.2666492 0.1828672
# z-y 0.10359217 -1.045723 1.2529071 0.9948795
# w-y 0.04406770 -1.105247 1.1933826 0.9995981
# w-z -0.05952447 -1.208839 1.0897904 0.9990129
plot_1 = ggplot(mdat, aes(x=variable, y=value, colour=variable)) +
geom_boxplot() +
geom_point(size=5, shape=1)
ggsave("plot_1.png", plot_1, height=3.5, width=7, units="in")
In your question you referred to the distributions being different b/c some of them had more values greater than 0. Defining the distributions according to the "number of values greater than 0", then you would use the binomial distribution (after converting the values to 1's and 0's). A function you could then use would be prop.test()

Resources