Comparing point patterns, each with differing windows - r

I am working on patterning statistics in R, using the spatstat package. I have a bunch of ppp objects, and would like to compare them all to find slight differences in the patterns of them that I might miss by just looking at the heatmaps, etc. I would also like to quantify the differences between the patterns somehow.
One problem is that the windows are differently shaped (slightly) for each pattern.
I am applying spatstat to plant leaves; here are the resulting tesselations to give you an idea of window shape, etc.:
Tesselation, Leaf 1
Tesselation, Leaf 2
Tesselation, Leaf 3
How would I go about comparing the patterns, and seeing where they differ?
I would also like to see, for example, if after analyzing 10 patterns, there's commonly a band of increased density across the midsection of the leaf, that is hard to detect by simply looking at individual density images. Is there a way to go about this?

Some ideas with two artificial datasets:
library(spatstat)
set.seed(42)
W1 <- ellipse(1,2)
W2 <- rotate(W1, angle = pi/4)
P1 <- rpoispp(20, win = W1)
P2 <- rpoispp(20, win = W2)
plot(solist(P1=P1, P2=P2), main = "", equal.scales = TRUE)
Manually make a relevant center line for each window (you can use
ends <- clickppp(2, add = TRUE) on top of a plot of each individual pattern
to interactively choose the end points by clicking the plot and then use the
coordinates to create the line with psp):
L1 <- psp(0, 1, 0, -1, window = W1)
L2 <- rotate(L1, angle = pi/4)
plot(solist(L1=L1, L2=L2), main = "", equal.scales = TRUE)
Define the distance from the center line:
D1 <- distfun(L1)
D2 <- distfun(L2)
plot(solist(D1=D1, D2=D2), main = "", equal.scales = TRUE)
Then you can fit point process models with this covariate.
E.g. the simple log-linear model for one of the datasets:
ppm(P1 ~ D1)
#> Nonstationary Poisson process
#>
#> Log intensity: ~D1
#>
#> Fitted trend coefficients:
#> (Intercept) D1
#> 3.1368706 -0.1768286
#>
#> Estimate S.E. CI95.lo CI95.hi Ztest Zval
#> (Intercept) 3.1368706 0.1912389 2.7620491 3.5116920 *** 16.4028872
#> D1 -0.1768286 0.3328005 -0.8291056 0.4754484 -0.5313351
Where the log-linear effect of distance to the center is insignificant as
expected (since the data was generated with homogeneous intensity).
From here you can explore different types of models (e.g. proportional to
distance via offset(), Gibbs models, models for replicated experiments via
mppm(), which may be very relevant here, etc.). E.g. a joint log-linear
model for the two datasets:
dat <- hyperframe(points = list(P1, P2), linedist = list(D1, D2))
mppm(points ~ linedist, data = dat)
#> Point process model fitted to 2 point patterns
#> Call: mppm(points ~ linedist, data = dat)
#> Log trend formula: ~linedist
#> Fitted trend coefficients:
#> (Intercept) linedist
#> 2.98511651 0.06387864
#>
#> Interaction for all patterns: Poisson process
Chapters 9 (freely available online), 13 and 16 in the
spatstat book
may be useful (disclaimer: I’m a co-author).

Related

Troubles predicting fixed effects from a hierarchical GAM in mgcv

I have been fitting different hierarchical GAMs (hereafter: HGAM) using mgcv in R. I can extract and plot their predictions for their random effects without problems. Conversely, extracting and plotting their predictions for their fixed effects only works for some models, and I don't know why.
Here is a practical example, which refers to the color spectra of flowers from two species (Taxon) sampled at various localities (also discussed here):
rm(list=ls()) # wipe R's memory clean
library(pacman) # load packages, installing them from CRAN if needed
p_load(RCurl) # allows accessing data from URL
ss <- read.delim(text=getURL("https://raw.githubusercontent.com/marcoplebani85/datasets/master/flower_color_spectra.txt"))
head(ss)
ss$density <- ifelse(ss$density<0, 0, ss$density) # set spurious negative reflectance values to zero
ss$clr <- ifelse(ss$Taxon=="SpeciesB", "red", "black")
ss <- with(ss, ss[order(Locality, wl), ])
These are the mean color spectra at the population level for the two species (rolling means were used):
Each color refers to a different species. Each line refers to a different locality.
The following model is a HGAM of type G according to Pedersen et al.'s classification (2019) and it does not give any issues:
gam_G1 <- bam(density ~ Taxon # main effect
+ s(wl, by = Taxon, k = 20) # interaction
+ s(Locality, bs="re"), # "re" is short for "random effect"
data = ss, method = 'REML',
family="quasipoisson"
)
# gam.check(gam_G1)
# k.check(gam_G1)
# MuMIn::AICc(gam_G1)
# gratia::draw(gam_G1)
# plot(gam_G1, pages=1)
# use gam_G1 to predict wl by Locality
# dataset of predictor values to estimate response values for:
nn <- unique(ss[, c("wl", "Taxon", "Locality", "clr")])
# predict:
pred <- predict(object= gam_G1, newdata=nn, type="response", se.fit=T)
nn$fit <- pred$fit
nn$se <- pred$se.fit
# use gam_G1 to predict wl by Taxon
# dataset of predictor values to estimate response values for:
nn <- unique(ss[, c("wl",
"Taxon",
"Locality",
"clr")])
nn$Locality=0 # turns random effect off
# after https://stats.stackexchange.com/q/131106/214127
# predict:
pred <- predict(object = gam_G1,
type="response",
newdata=nn,
se.fit=T)
nn$fit <- pred$fit
nn$se <- pred$se.fit
R warns me that factor levels 0 not in original fit, but it executes the task without issues:
Left panel: gam_G1 predictions at the Locality level. Right panel: gam_G1 predictions for the fixed effects.
Troublesome models
The following model is a HGAM of type "GI" sensu Pedersen et al. (2019). It produces more accurate predictions at the Locality level, but I can only get NA as predictions at the level of fixed effects:
# GI: models with a global smoother for all observations,
# plus group-level smoothers, the wiggliness of which is estimated individually
start_time <- Sys.time()
gam_GI1 <- bam(density ~ Taxon # main effect
+ s(wl, by = Taxon, k = 20) # interaction
+ s(wl, by = Locality, bs="tp", m=1)
# "tp" is short for "thin plate [regression spline]"
+ s(Locality, bs="re"),
family="quasipoisson",
data = ss, method = 'REML'
)
end_time <- Sys.time()
end_time - start_time # it took ~2.2 minutes on my computer
# gam.check(gam_GI1)
# k.check(gam_GI1)
# MuMIn::AICc(gam_GI1)
Attempt at drawing predictions for the fixed effects (Taxon and wl) according to gam_GI1:
# dataset of predictor values to estimate response values for:
nn <- unique(ss[, c("wl",
"Taxon",
"Locality",
"clr")])
nn$Locality=0 # turns random effect off
# after https://stats.stackexchange.com/q/131106/214127
# predict:
pred <- predict(object = gam_GI1,
type="response",
# exclude="c(Locality)",
# # this should turn random effect off
# # (doesn't work for me)
newdata=nn,
se.fit=T)
nn$fit <- pred$fit
nn$se <- pred$se.fit
head(nn)
# wl Taxon Locality clr fit se
# 1 298.34 SpeciesB 0 red NA NA
# 2 305.82 SpeciesB 0 red NA NA
# 3 313.27 SpeciesB 0 red NA NA
# 4 320.72 SpeciesB 0 red NA NA
# 5 328.15 SpeciesB 0 red NA NA
# 6 335.57 SpeciesB 0 red NA NA
Left panel: gam_GI1 predictions at the Locality level. Right panel (blank): gam_GI1 predictions for the fixed effects.
The following model, which includes a global smoother for all observations, plus group-level smoothers, all with the same "wiggliness", doesn't provide fixed-effect predictions either:
gam_GS1 <- bam(density ~ Taxon # main effect
+ s(wl, by = Taxon, k = 20) # interaction
+ s(wl, by = Locality, bs="fs", m=1),
# "fs" is short for "factor-smoother [interaction]"
family="quasipoisson",
data = ss, method = 'REML'
)
Why don't gam_GI1 and gam_GS1 produce predictions for their fixed effects, and how can I obtain them?
The models can take a few minutes to run. To save time, their output can be downloaded from here as an RData file. My R scripts (which include the code for plotting the figures) are available here.
I think you are conflating several things here; The by trick to turn off random effects only works for bs = "re" smooths. Locality is a factor (otherwise your random effect isn't a random intercept) and setting it to 0 is creating a new level (although it could be creating an NA as 0 isn't among the original levels.
If what you want to do is turn off anything to do with Locality, you should use exclude; however you have the invocation wrong. The reason why it's not working is because you are creating a character vector with a single element "c(Locality)". This fails for obvious reasons once you realize that c(Locality) doesn't related to anything in your model. What you need to provide here is a vector of smooth names as printed by summary(). For example, to exclude the smooth s(Locality, bs = "re"), {mgcv} knows this as s(Locality), so you would use exclude = "s(Locality)".
In your case, it is tedious to type out all the "s(wl):LocalityLevelX" labels for each smooth. As you have only two taxa, it would be easier to use the complimentary argument terms, where you list smooth labels that you want to include in the model. So you could do terms = c("s(wl):TaxonSpeciesB", "s(wl):TaxonSpeciesC") or whatever summary() displays for these smooths.
You also need to include the Taxon term in terms, which I think needs to be:
terms = c("TaxonSpeciesB", TaxonSpeciesC",
"s(wl):TaxonSpeciesB", "s(wl):TaxonSpeciesC")
If you install and load my {gratia} package, you can use smooths(gam_GI1) to list all the smooth labels as far as {mgcv} knows them.
The by trick works like this:
gam(y ~ x + s(z) + s(id, bs = "re", by = dummy)
where dummy is set to a numeric value 1 when fitting and to 0 when you are predicting. As this is a numeric by variable you are multiplying the smooth by dummy and hence why setting it to 0 excludes the term. The reason why your code isn't working is because you really want separate smooths for wl for each Locality; Locality is an actual variable of interest in your data/model, not a dummy variable we create to achieve the aim of excluding a term from the model.
Hopefully now you can see why exclude and terms are much better solutions than this dummy trick.
FYI, in bs = "tp", the "tp" doesn't mean tensor product smooth. It mean thin plate regression spline (TPRS). You only get tensor product smooths through te(), t2(), or ti() terms.

Simulate data for mixed-effects model with predefined parameter

I'm trying to simulate data for a model expressed with the following formula:
lme4::lmer(y ~ a + b + (1|subject), data) but with a set of given parameters:
a <- rnorm() measured at subject level (e.g nSubjects = 50)
y is measured at the observation level (e.g. nObs = 7 for each subject
b <- rnorm() measured at observation level and correlated at a given r with a
variance ratio of the random effects in lmer(y ~ 1 + (1 | subject), data) is fixed at for example 50/50 or 10/90 (and so on)
some random noise is present (so that a full model does not explain all the variance)
effect size of the fixed effects can be set at a predefined level (e.g. dCohen=0.5)
I played with various packages like: powerlmm, simstudy or simr but still fail to find a working solution that will accommodate the amount of parameters I'd like to define beforehand.
Also for my learning purposes I'd prefer a base R method than a package solution.
The closest example I found is a blog post by Ben Ogorek "Hierarchical linear models and lmer" which looks great but I can't figure out how to control for parameters listed above.
Any help would be appreciated.
Also if there a package that I don't know of, that can do these type of simulations please let me know.
Some questions about the model definition:
How do we specify a correlation between two random vectors that are different lengths? I'm not sure: I'll sample 350 values (nObs*nSubject) and throw away most of the values for the subject-level effect.
Not sure about "variance ratio" here. By definition, the theta parameters (standard deviations of the random effects) are scaled by the residual standard deviation (sigma), e.g. if sigma=2, theta=2, then the residual std dev is 2 and the among-subject std dev is 4
Define parameter/experimental design values:
nSubjects <- 50
nObs <- 7
## means of a,b are 0 without loss of generality
sdvec <- c(a=1,b=1)
rho <- 0.5 ## correlation
betavec <- c(intercept=0,a=1,b=2)
beta_sc <- betavec[-1]*sdvec ## scale parameter values by sd
theta <- 0.4 ## = 20/50
sigma <- 1
Set up data frame:
library(lme4)
set.seed(101)
## generate a, b variables
mm <- MASS::mvrnorm(nSubjects*nObs,
mu=c(0,0),
Sigma=matrix(c(1,rho,rho,1),2,2)*outer(sdvec,sdvec))
subj <- factor(rep(seq(nSubjects),each=nObs)) ## or ?gl
## sample every nObs'th value of a
avec <- mm[seq(1,nObs*nSubjects,by=nObs),"a"]
avec <- rep(avec,each=nObs) ## replicate
bvec <- mm[,"b"]
dd <- data.frame(a=avec,b=bvec,Subject=subj)
Simulate:
dd$y <- simulate(~a+b+(1|Subject),
newdata=dd,
newparams=list(beta=beta_sc,theta=theta,sigma=1),
family=gaussian)[[1]]

R, AR(1) and FGLS Computations. Unable to create a lagged variable

Problem
Data
Hello everyone. I have been tasked with trying to solve this problem for the past two weeks with little to no help from my professor. I am not really asking for the exact code, but I can't start the problem...
We are given a set of data of the variables (M,R,Y) and we are asked to run the regression shown in the image. My issue now is that I can't seem to create a lagged variable. I have been told that 1) we don't need to use time series objects 2) I should be using time series objects (based off my research). Furthermore, when we get to the part where it asks us to run the regression for t=2 and onwards to get the p_hat, the only coefficient I can get is 1.0000 and the B1 that we should be getting ends up being something like 3.14e-11, something incredibly wrong. Here is what I have currently for my variables, based off the data given. If anyone is able to help guide me in the right direction I would really appreciate it.
#Creates Time Series Objects Which Can Be Lagged using lag()
Mt2 <- ts(data=DATA$M, start=1,end=180,frequency=1)
Mt1 <- ts(data=DATA$M, start=1,end=180,frequency=1)
Rt2 <- ts(data=DATA$R, start=1,end=180,frequency=1)
Yt2 <- ts(data=DATA$Y, start=1,end=180,frequency=1)
#Dependent Variable starts at t=2 and ends at t=181
#Lag Variable starts at t=1 and ends at t=180
Model_A <- lm( Mt2 ~ lag(Mt2,1) + Rt2 + Yt2, data=DATA)
bgtest(Model_A) #Conclude there is Autocorrelation
e <- resid(Model_A)
et <- ts(e,start=2,end=180,frequency=1)
et2 <- ts(e,start=1, end=179, frequency=1)
Model_e <- lm(et ~ et2)
Using lag() inside a plain lm() formula does not work because lag() just shifts the time index but keeps the data vector unchanged. And as lm() (or more precisely model.frame()) ignores the time index, this leads to including the response variable as one of the regressors and hence a regression coefficient of 1. As an illustration:
lm(Nile ~ lag(Nile, -1))
## Call:
## lm(formula = Nile ~ lag(Nile, -1))
##
## Coefficients:
## (Intercept) lag(Nile, -1)
## -1.819e-13 1.000e+00
There are various solutions to this problem, including:
Using dedicated time series infrastructure such as ar() or arima(). The latter can also easily incorporate additional regressors via xreg:
arima(Nile, c(1, 0, 0))
## Call:
## arima(x = Nile, order = c(1, 0, 0))
##
## Coefficients:
## ar1 intercept
## 0.5063 919.5685
## s.e. 0.0867 29.1410
##
## sigma^2 estimated as 21125: log likelihood = -639.95, aic = 1285.9
Using convenience interfaces to lm() with additional model.frame
processing such as dyn or dynlm:
library("dynlm")
dynlm(Nile ~ L(Nile, 1))
## Time series regression with "ts" data:
## Start = 1872, End = 1970
##
## Call:
## dynlm(formula = Nile ~ L(Nile, 1))
##
## Coefficients:
## (Intercept) L(Nile, 1)
## 452.7668 0.5043
Doing the lag pre-processing "by hand" before calling lm():
nile <- ts.intersect(Nile = Nile, NileLag = lag(Nile, -1))
lm(Nile ~ NileLag, data = nile)
## Call:
## lm(formula = Nile ~ NileLag, data = nile)
##
## Coefficients:
## (Intercept) NileLag
## 452.7668 0.5043
Depending on whether you need a particular estimation method or returned model class for further computations, one or the other strategy might be more handy. In your particular case I would recommend the last solution and then go through the different steps in the exercise by using plain lm().
Use a data.frame and create your lagged variable "by hand" or using something like dplyr::lag. Like this:
X <- 1:10
Xlag <- c(NA, X[1:9])
The base R lag function is not very useful for anything other than ts objects and they aren't very useful except for the specific modeling functions that support them.

Using a survival tree from the 'rpart' package in R to predict new observations

I'm attempting to use the "rpart" package in R to build a survival tree, and I'm hoping to use this tree to then make predictions for other observations.
I know there have been a lot of SO questions involving rpart and prediction; however, I have not been able to find any that address a problem that (I think) is specific to using rpart with a "Surv" object.
My particular problem involves interpreting the results of the "predict" function. An example is helpful:
library(rpart)
library(OIsurv)
# Make Data:
set.seed(4)
dat = data.frame(X1 = sample(x = c(1,2,3,4,5), size = 1000, replace=T))
dat$t = rexp(1000, rate=dat$X1)
dat$t = dat$t / max(dat$t)
dat$e = rbinom(n = 1000, size = 1, prob = 1-dat$t )
# Survival Fit:
sfit = survfit(Surv(t, event = e) ~ 1, data=dat)
plot(sfit)
# Tree Fit:
tfit = rpart(formula = Surv(t, event = e) ~ X1 , data = dat, control=rpart.control(minsplit=30, cp=0.01))
plot(tfit); text(tfit)
# Survival Fit, Broken by Node in Tree:
dat$node = as.factor(tfit$where)
plot( survfit(Surv(dat$t, event = dat$e)~dat$node) )
So far so good. My understanding of what's going on here is that rpart is attempting to fit exponential survival curves to subsets of my data. Based on this understanding, I believe that when I call predict(tfit), I get, for each observation, a number corresponding to the parameter for the exponential curve for that observation. So, for example, if predict(fit)[1] is .46, then this means for the first observation in my original dataset, the curve is given by the equation P(s) = exp(−λt), where λ=.46.
This seems like exactly what I'd want. For each observation (or any new observation), I can get the predicted probability that this observation will be alive/dead for a given time point. (EDIT: I'm realizing this is probably a misconception— these curves don't give the probability of alive/dead, but the probability of surviving an interval. This doesn't change the problem described below, though.)
However, when I try and use the exponential formula...
# Predict:
# an attempt to use the rates extracted from the tree to
# capture the survival curve formula in each tree node.
rates = unique(predict(tfit))
for (rate in rates) {
grid= seq(0,1,length.out = 100)
lines(x= grid, y= exp(-rate*(grid)), col=2)
}
What I've done here is split the dataset in the same way the survival tree did, then used survfit to plot a non-parametric curve for each of these partitions. That's the black lines. I've also drawn lines corresponding to the result of plugging in (what I thought was) the 'rate' parameter into (what I thought was) the survival exponential formula.
I understand that the non-parametric and the parametric fit shouldn't necessarily be identical, but this seems more than that: it seems like I need to scale my X variable or something.
Basically, I don't seem to understand the formula that rpart/survival is using under the hood. Can anyone help me get from (1) rpart model to (2) a survival equation for any arbitrary observation?
The survival data are scaled internally exponentially so that the predicted rate in the root node is always fixed to 1.000. The predictions reported by the predict() method are then always relative to the survival in the root node, i.e., higher or lower by a certain factor. See Section 8.4 in vignette("longintro", package = "rpart") for more details. In any case, the Kaplan-Meier curves you are reported correspond exactly to what is also reported in the rpart vignette.
If you want to obtain directly the plots of the Kaplan-Meier curves in the tree and get predicted median survival times, you can coerce the rpart tree to a constparty tree as provided by the partykit package:
library("partykit")
(tfit2 <- as.party(tfit))
## Model formula:
## Surv(t, event = e) ~ X1
##
## Fitted party:
## [1] root
## | [2] X1 < 2.5
## | | [3] X1 < 1.5: 0.192 (n = 213)
## | | [4] X1 >= 1.5: 0.082 (n = 213)
## | [5] X1 >= 2.5: 0.037 (n = 574)
##
## Number of inner nodes: 2
## Number of terminal nodes: 3
##
plot(tfit2)
The print output shows the median survival time and the visualization the corresponding Kaplan-Meier curve. Both can also be obtained with the predict() method setting the type argument to "response" and "prob" respectively.
predict(tfit2, type = "response")[1]
## 5
## 0.03671885
predict(tfit2, type = "prob")[[1]]
## Call: survfit(formula = y ~ 1, weights = w, subset = w > 0)
##
## records n.max n.start events median 0.95LCL 0.95UCL
## 574.0000 574.0000 574.0000 542.0000 0.0367 0.0323 0.0408
As an alternative to the rpart survival trees you might also consider the non-parametric survival trees based on conditional inference in ctree() (using logrank scores) or fully parametric survival trees using the general mob() infrastructure from the partykit package.
#Achim Zeileis's answer is very helpful, but it seems that the exact #jwdink's question was not answered. I understood it as "If RPart tree splits by best exponential survival fit, what are the Lambdas for these fits in absolute terms, so we can use these exponential survival functions to make predictions". The RPart summary does show the estimated rate, but only in relative terms assuming that the entire population has rate of 1. To overcome, one can fit an exponential survreg, take the referenced lambda from there and then multiply RPart predicted rates by that number (see code below).
That said, this is not how survival rates in RPart are predicted out of a tree. I did not find survival prediction function directly in RPart, however as Achim pointed above, partykit uses Kaplan-Meier estimates, i.e. non-parametric survival from those ending up in a respective final leaf. I think it is the same in survival random forest trees, where K-M curves are used in the final leaves.
The simulated data in this question uses exponential distribution, so K-M and exponential survival curves will be similar by design, however for a different simulated or real-life distribution estimated exponential rates by RPart tree and using K-M curves in the final leaves (of the same tree) will give different survival rates.
sfit = survfit(Surv(t, event = e) ~ 1, data=dat)
tfit = rpart(formula = Surv(t, event = e) ~ X1 , data = dat, control=rpart.control(minsplit=30, cp=0.01))
plot(tfit); text(tfit)
# Survival Fit, Broken by Node in Tree:
dat$node = as.factor(tfit$where)
table(dat$node)
s0 = survreg(Surv(t,e)~ 1, data = dat, dist = "exponential") #-0.6175
e0 = exp(-summary(s0)$coefficients[1]); e0 #1.854
rates = unique(predict(tfit))
#1) plot K-M curves by node (black):
plot( survfit(Surv(dat$t, event = dat$e)~dat$node) )
#2) plot exponential survival with rates = e0 * RPart rates (red):
for (rate in rates) {
grid= seq(0,1,length.out = 100)
lines(x= grid, y= exp(-e0*rate*(grid)), col=2)
}
#3) plot partykit survival curves based on RPart tree (green)
library(partykit)
tfit2 <- as.party(tfit)
col_n = 1
for (node in names(table(dat$node))){
predict_curve = predict(tfit2, newdata = dat[dat$node == node, ], type = "prob")
surv_esitmated = approxfun(predict_curve[[1]]$time, predict_curve[[1]]$surv)
lines(x= grid, y= surv_esitmated(grid), col = 2+col_n)
col_n=+1
}

glm model dataset summarisation

first post, so go easy.
In the insurance world of GLMing, the classic approach is to model claims frequency and average severity. With that in mind, I built a couple of models to experiment for myself and now have a question.
Could somebody please explain how GLM handles varying levels of summarisation of a dataset, particularly with regard to error estimates?
Consider the example below. The data exhibits strong severity trends for both variables:
- A has more expensive claims than B
- Ford > Kia > Vaux > Jag
I fitted a model to unsummarised and a summarised version of the dataset, and accordingly GLM fitted the same parameters in both cases
However, GLM indicates a well fitted model to the unsummarised data. But when I summarise and use a weighted mean, ie average severity, the model fits poorly. Maybe this is as you would expect, after all the unsummarised data has more points to model with. Also, it appears the weighted mean is used to indicate RELATIVE strength, so here, specifiying the weighted mean is pointless, since they are all the same weights.
But more fundementally, can I not model average severity with GLM? I mean, I know the result of fitting a GLM to an unsummarised dataset will be a average severity, but I was hoping to fit a model to already summarised data. It appears that modelling on aggregated datasets will not give a true indication of the model fit.
Apologies if this a stupid question, I'm not a statistician, so don't fully understand the Hessian Matrix.
Please see code below:
library(boot)
library(reshape)
dataset <- data.frame(
Person = rep(c("A", "B"), each=200),
Car = rep(c("Ford", "Kia", "Vaux", "Jag"), 2, each=50),
Amount = c(rgamma(50, 200), rgamma(50, 180), rgamma(50, 160), rgamma(50, 140),
rgamma(50, 100), rgamma(50, 80), rgamma(50, 60), rgamma(50, 40))
)
Agg1 <- ddply(dataset, .(Person, Car), summarise, mean=mean(Amount), length=length(Amount))
m1 <- glm(Amount ~ Person + Car, data = dataset, family = Gamma(link="log"))
m2 <- glm(mean ~ Person + Car, data = Agg1, family = Gamma(link="log"), weights=length)
summary(m1)
summary(m2)
Thanks,
Nick
Bottom line is that both models are identical - the reason the aggregated model "fits poorly" is entirely due to the reduction in degrees of freedom due to aggregation.
Before getting into why the models are identical, I should point out that this does not necessarily mean that either model is a good fit. You should run diagnostics on both, especially using:
par(mfrow=c(2,2))
plot(m1)
When you do this. you'll see that the residuals are normally distributed (which is essential), but that they follow a pattern (-, +, -), which is disturbing. I would want to understand that before declaring that this is a good model. [Admittedly, this is made up data, but the principles apply nevertheless.]
Comparing the aggregated to base models, look at the values of the coefficients.
coef.m1 <- summary(m1)$coefficients
coef.m2 <- summary(m2)$coefficients
cbind(coef.m1[,1],coef.m2[,1])
# [,1] [,2]
# (Intercept) 5.4096980 5.4096976
# PersonB -0.9249371 -0.9249366
# CarJag -0.6144606 -0.6144602
# CarKia -0.1786556 -0.1786555
# CarVaux -0.3597925 -0.3597923
The reason you think the aggregated model is "worse" is because of the p-values, but these depend on t = coeff/se . The ratio of se in m1 vs. m2 is the same for all coefficients:
coef.m2[,2]/coef.m1[,2]
# (Intercept) PersonB CarJag CarKia CarVaux
# 7.836171 7.836171 7.836171 7.836171 7.836171
Since
se ~ sd / √ df
the ratio of se for the two models should be approx
sem1/sem2 = √( (nm1-1) / (nm2-1) )
sqrt((nrow(dataset)-1)/(nrow(Agg1)-1))
# [1] 7.549834
Frankly I'm puzzled why the ratio is not exactly equal to 7.55.
Put another way, glm(...) has no way of knowing that you aggregated your data. It thinks you are trying to fit a model with 4 parameters and an intercept to 8 data points.

Resources