how to add regression lines for each factor on a plot - r

I've created a model and I'm trying to add curves that fit the two parts of the data, insulation and no insulation. I was thinking about using the insulation coefficient as a true/false term, but I'm not sure how to translate that into code. Entries 1:56 are "w/o" and 57:101 are "w/". I'm not sure how to include the data I'm using but here's the head and tail:
month year kwh days est cost avgT dT.yr kWhd.1 id insulation
1 8 2003 476 21 a 33.32 69 -8 22.66667 1 w/o
2 9 2003 1052 30 e 112.33 73 -1 35.05172 2 w/o
3 10 2003 981 28 a 24.98 60 -6 35.05172 3 w/o
4 11 2003 1094 32 a 73.51 53 2 34.18750 4 w/o
5 12 2003 1409 32 a 93.23 44 6 44.03125 5 w/o
6 1 2004 1083 32 a 72.84 34 3 33.84375 6 w/o
month year kwh days est cost avgT dT.yr kWhd.1 id insulation
96 7 2011 551 29 e 55.56 72 0 19.00000 96 w/
97 8 2011 552 27 a 61.17 78 1 20.44444 97 w/
98 9 2011 666 34 e 73.87 71 -2 19.58824 98 w/
99 10 2011 416 27 a 48.03 64 0 15.40741 99 w/
100 11 2011 653 31 e 72.80 53 1 21.06452 100 w/
101 12 2011 751 33 a 83.94 45 2 22.75758 101 w/
bill$id <- seq(1:101)
bill$insulation <- as.factor(ifelse(bill$id > 56, c("w/"), c("w/o")))
m1 <- lm(kWhd.1 ~ avgT + insulation + I(avgT^2), data=bill)
with(bill, plot(kWhd.1 ~ avgT, xlab="Average Temperature (F)",
ylab="Daily Energy Use (kWh/d)", col=insulation))
no_ins <- data.frame(bill$avgT[1:56], bill$insulation[1:56])
curve(predict(m1, no_ins=x), add=TRUE, col="red")
ins <- data.frame(bill$avgT[57:101], bill$insulation[57:101])
curve(predict(m1, ins=x), add=TRUE, lty=2)
legend("topright", inset=0.01, pch=21, col=c("red", "black"),
legend=c("No Insulation", "Insulation"))

ggplot2 makes this a lot easier than base plotting. Something like this should work:
ggplot(bill, aes(x = avgT, y = kWhd.1, color = insulation)) +
geom_smooth(method = "lm", formula = y ~ x + I(x^2), se = FALSE) +
geom_point()
In base, I'd create a data frame with point you want to predict on, something like
pred_data = expand.grid(
kWhd.1 = seq(min(bill$kWhd.1), max(bill$kWhd.1), length.out = 100),
insulation = c("w/", "w/o")
)
pred_data$prediction = predict(m1, newdata = pred_data)
And then use lines to add the predictions to your plot. My base graphics is pretty rusty, so I'll leave that to you (or another answerer) if you want it.

In base R it's important to order the x-values. Since this is to be done on multiple factors, we can do this with by, resulting in a list L.
Since your example data is not complete, here's an example with iris where we consider Species as the "factor".
L <- by(iris, iris$Species, function(x) x[order(x$Petal.Length), ])
Now we can do the plot and add loess predictions as lines with a sapply.
with(iris, plot(Sepal.Width ~ Petal.Length, col=Species))
sapply(seq(L), function(x)
lines(L[[x]]$Petal.Length,
predict(loess(Sepal.Width ~ Petal.Length, L[[x]], span=1.1)), # span=1.1 for smoothing
col=x))
Yields

Related

Creating and plotting confidence intervals

I have fitted a gaussian GLM model to my data, i now wish to create 95% CIs and fit them to my data. Im having a couple of issues with this when plotting as i cant get them to capture my data, they just seem to plot the same line as the model without captuing the data points. Also Im also unsure that I've created my CIs the correct way here for the mean. I entered my data and code below if anyone knows how to fix this
data used
aids
cases quarter date
1 2 1 83.00
2 6 2 83.25
3 10 3 83.50
4 8 4 83.75
5 12 1 84.00
6 9 2 84.25
7 28 3 84.50
8 28 4 84.75
9 36 1 85.00
10 32 2 85.25
11 46 3 85.50
12 47 4 85.75
13 50 1 86.00
14 61 2 86.25
15 99 3 86.50
16 95 4 86.75
17 150 1 87.00
18 143 2 87.25
19 197 3 87.50
20 159 4 87.75
21 204 1 88.00
22 168 2 88.25
23 196 3 88.50
24 194 4 88.75
25 210 1 89.00
26 180 2 89.25
27 277 3 89.50
28 181 4 89.75
29 327 1 90.00
30 276 2 90.25
31 365 3 90.50
32 300 4 90.75
33 356 1 91.00
34 304 2 91.25
35 307 3 91.50
36 386 4 91.75
37 331 1 92.00
38 368 2 92.25
39 416 3 92.50
40 374 4 92.75
41 412 1 93.00
42 358 2 93.25
43 416 3 93.50
44 414 4 93.75
45 496 1 94.00
my code used to create the model and intervals before plotting
#creating the model
model3 = glm(cases ~ date,
data = aids,
family = poisson(link='log'))
#now to add approx. 95% confidence envelope around this line
#predict again but at the linear predictor level along with standard errors
my_preds <- predict(model3, newdata=data.frame(aids), se.fit=T, type="link")
#calculate CI limit since linear predictor is approx. Gaussian
upper <- my_preds$fit+1.96*my_preds$se.fit #this might be logit not log
lower <- my_preds$fit-1.96*my_preds$se.fit
#transform the CI limit to get one at the level of the mean
upper <- exp(upper)/(1+exp(upper))
lower <- exp(lower)/(1+exp(lower))
#plotting data
plot(aids$date, aids$cases,
xlab = 'Date', ylab = 'Cases', pch = 20)
#adding CI lines
plot(aids$date, exp(my_preds$fit), type = "link",
xlab = 'Date', ylab = 'Cases') #add title
lines(aids$date,exp(my_preds$fit+1.96*my_preds$se.fit),lwd=2,lty=2)
lines(aids$date,exp(my_preds$fit-1.96*my_preds$se.fit),lwd=2,lty=2)
outcome i currently get with no data points, the model is correct here but the CI isnt as i have no data points, so the CIs are made incorrectly i think somewhere
Edit: Response to OP's providing full data set.
This started out as a question about plotting data and models on the same graph, but has morphed considerably. You seem you have an answer to the original question. Below is one way to address the rest.
Looking at your (and my) plots it seems clear that poisson glm is just not a good model. To say it differently, the number of cases may vary with date, but is also influenced by other things not in your model (external regressors).
Plotting just your data suggests strongly that you have at least two and perhaps more regimes: time frames where the growth in cases follows different models.
ggplot(aids, aes(x=date)) + geom_point(aes(y=cases))
This suggests segmented regression. As with most things in R, there is a package for that (more than one actually). The code below uses the segmented package to build successive poisson glm using 1 breakpoint (two regimes).
library(data.table)
library(ggplot2)
library(segmented)
setDT(aids) # convert aids to a data.table
aids[, pred:=
predict(
segmented(glm(cases~date, .SD, family = poisson), seg.Z = ~date, npsi=1),
type='response', se.fit=TRUE)$fit]
ggplot(aids, aes(x=date))+ geom_line(aes(y=pred))+ geom_point(aes(y=cases))
Note that we need to tell segmented the count of breakpoints, but not where they are - the algorithm figures that out for you. So here, we see a regime prior to 3Q87 which is well modeled using poission glm, and a regime after that which is not. This is a fancy way of saying that "something happened" around 3Q87 which changed the course of the disease (at least in this data).
The code below does the same thing but for between 1 and 4 breakpoints.
get.pred <- \(p.n, p.DT) {
fit <- glm(cases~date, p.DT, family=poisson)
seg.fit <- segmented(fit, seg.Z = ~date, npsi=p.n)
predict(seg.fit, type='response', se.fit=TRUE)[c('fit', 'se.fit')]
}
gg.dt <- rbindlist(lapply(1:4, \(x) { copy(aids)[, c('pred', 'se'):=get.pred(x, .SD)][, npsi:=x] } ))
ggplot(gg.dt, aes(x=date))+
geom_ribbon(aes(ymin=pred-1.96*se, ymax=pred+1.96*se), fill='grey80')+
geom_line(aes(y=pred))+
geom_point(aes(y=cases))+
facet_wrap(~npsi)
Note that the location of the first breakpoint does not seem to change, and also that, notwithstanding the use of the poisson glm the growth appears linear in all but the first regime.
There are goodness-of-fit metrics described in the package documentation which can help you decide how many break points are most consistent with your data.
Finally, there is also the mcp package which is a bit more powerful but also a bit more complex to use.
Original Response: Here is one way that builds the model predictions and std. error in a data.table, then plots using ggplot.
library(data.table)
library(ggplot2)
setDT(aids) # convert aids to a data.table
aids[, c('pred', 'se', 'resid.scale'):=predict(glm(cases~date, data=.SD, family=poisson), type='response', se.fit=TRUE)]
ggplot(aids, aes(x=date))+
geom_ribbon(aes(ymin=pred-1.96*se, ymax=pred+1.96*se), fill='grey80')+
geom_line(aes(y=pred))+
geom_point(aes(y=cases))
Or, you could let ggplot do all the work for you.
ggplot(aids, aes(x=date, y=cases))+
stat_smooth(method = glm, method.args=list(family=poisson))+
geom_point()

Add category mean value to faceted scatter plots in ggplot

I am using facet wrap to plot Weight Gain versus Caloric Intake for four different diets. Diet is a four-level factor, Weight Gain and Caloric Intake are numeric. I am adding a regression line to each plot facet. What I want to do is add a horizontal line for the group mean weight gain for each diet in the plot (4 different mean values). The problem is when I use the geom_hline function it puts the global mean on all of the plots, which is not what I want.
I tried using stat_summary(fun.y=mean,geom="line"), but it gives me line segments joining each of the points in every plot.
Below is the code I am using that is giving me the single global mean on all plots. Also the data set I am using. I've included the labeller code for completeness but I really just need help with drawing the group mean lines.
Thanks in advance for any help.
# Calculate slopes and means to use for facet labels
#
wgSlope<-rep(NA,nlevels(vitaminData$Diet))
dietMeans<-rep(NA,nlevels(vitaminData$Diet))
for (i in 1:nlevels(vitaminData$Diet)){
dietMeans[i]<-mean(filter(vitaminData,Diet==i)$WeightGain)
#
# Get regression lines and coefficients for each facet
#
lm<-lm(WeightGain~CaloricIntake,data=filter(vitaminData,Diet==i))
wgSlope[i]<-lm$coefficients[2]
}
#
# Build facet labels
#
dietLabel<-c(`1`=
paste("Diet 1, Slope=",round(wgSlope[1],2),", Mean=",round(dietMeans[1],1)),
`2`=paste("Diet 2, Slope=",round(wgSlope[2],2),", Mean=",round(dietMeans[2],1)),
`3`=paste("Diet 3, Slope =",round(wgSlope[3],2),", Mean=",round(dietMeans[3],1)),
`4`=paste("Diet 4, Slope =",round(wgSlope[4],2),", Mean=",round(dietMeans[4],1)))
#
# Draw the plots
#
ggplot(data=vitaminData,
aes(y=WeightGain,x=CaloricIntake,color=Diet))+
theme_bw()+
geom_point(aes(color=Diet,fill=Diet,shape=Diet))+
geom_smooth(method="lm",se=FALSE,linetype=2,alpha=0.5)+
labs(x="Caloric Intake",y="Weight Gain")+
scale_color_manual(values=c("red","blue","orange","darkgreen"))+
geom_hline(yintercept=mean(vitaminData$WeightGain))+
facet_wrap(~Diet,labeller=labeller(Diet=dietLabel))+
theme(legend.position="none")
Diet WeightGain CaloricIntake
<fct> <dbl> <dbl>
1 1 48 35
2 1 67 44
3 1 78 44
4 1 69 51
5 1 53 47
6 2 65 40
7 2 49 45
8 2 37 37
9 2 73 53
10 2 63 42
11 3 79 51
12 3 52 41
13 3 63 47
14 3 65 47
15 3 67 48
16 4 59 53
17 4 50 52
18 4 59 52
19 4 42 45
20 4 34 38
Here's an approach using dplyr. (Add library(dplyr) or library(tidyverse) if not already loaded.)
geom_hline(data = vitaminData %>%
group_by(Diet) %>%
summarize(mean = mean(WeightGain)),
aes(yintercept = mean)) +

fit a normal distribution to grouped data, giving expected frequencies

I have a frequency distribution of observations, grouped into counts within class intervals.
I want to fit a normal (or other continuous) distribution, and find the expected frequencies in each interval according to that distribution.
For example, suppose the following, where I want to calculate another column, expected giving the
expected number of soldiers with chest circumferences in the interval given by chest, where these
are assumed to be centered on the nominal value. E.g., 35 = 34.5 <= y < 35.5. One analysis I've seen gives the expected frequency in this cell as 72.5 vs. the observed 81.
> data(ChestSizes, package="HistData")
>
> ChestSizes
chest count
1 33 3
2 34 18
3 35 81
4 36 185
5 37 420
6 38 749
7 39 1073
8 40 1079
9 41 934
10 42 658
11 43 370
12 44 92
13 45 50
14 46 21
15 47 4
16 48 1
>
> # ungroup to a vector of values
> chests <- vcdExtra::expand.dft(ChestSizes, freq="count")
There are quite a number of variations of this question, most of which relate to plotting the normal density on top of a histogram, scaled to represent counts not density. But none explicitly show the calculation of the expected frequencies. One close question is R: add normal fits to grouped histograms in ggplot2
I can perfectly well do the standard plot (below), but for other things, like a Chi-square test or a vcd::rootogram plot, I need the expected frequencies in the same class intervals.
> bw <- 1
n_obs <- nrow(chests)
xbar <- mean(chests$chest)
std <- sd(chests$chest)
plt <-
ggplot(chests, aes(chest)) +
geom_histogram(color="black", fill="lightblue", binwidth = bw) +
stat_function(fun = function(x)
dnorm(x, mean = xbar, sd = std) * bw * n_obs,
color = "darkred", size = 1)
plt
here is how you could calculate the expected frequencies for each group assuming Normality.
xbar <- with(ChestSizes, weighted.mean(chest, count))
sdx <- with(ChestSizes, sd(rep(chest, count)))
transform(ChestSizes, Expected = diff(pnorm(c(32, chest) + .5, xbar, sdx)) * sum(count))
chest count Expected
1 33 3 4.7600583
2 34 18 20.8822328
3 35 81 72.5129162
4 36 185 199.3338028
5 37 420 433.8292832
6 38 749 747.5926687
7 39 1073 1020.1058521
8 40 1079 1102.2356155
9 41 934 943.0970605
10 42 658 638.9745241
11 43 370 342.7971793
12 44 92 145.6089948
13 45 50 48.9662992
14 46 21 13.0351612
15 47 4 2.7465640
16 48 1 0.4579888

Writing a function to compare differences of a series of numeric variables

I am working on a problem set and absolutely cannot figure this one out. I think I've fried my brain to the point where it doesn't even make sense anymore.
Here is a look at the data ...
sex age chol tg ht wt sbp dbp vldl hdl ldl bmi
<chr> <int> <int> <int> <dbl> <dbl> <int> <int> <int> <int> <int> <dbl>
1 M 60 137 50 68.2 112. 110 70 10 53 74 2.40
2 M 26 154 202 82.8 185. 88 64 34 31 92 2.70
3 M 33 198 108 64.2 147 120 80 22 34 132 3.56
4 F 27 154 47 63.2 129 110 76 9 57 88 3.22
5 M 36 212 79 67.5 176. 130 100 16 37 159 3.87
6 F 31 197 90 64.5 121 122 78 18 58 111 2.91
7 M 28 178 163 66.5 167 118 68 19 30 135 3.78
8 F 28 146 60 63 105. 120 80 12 46 88 2.64
9 F 25 231 165 64 126 130 72 23 70 137 3.08
10 M 22 163 30 68.8 173 112 70 6 50 107 3.66
# … with 182 more rows
I must write a function, myTtest, to perform the following task:
Perform a two-sample t-tests to compare the differences of a series of numeric variables between each level of a classification variable
The first argument, dat, is a data frame
The second argument, classVar, is a character vector of length 1. It is the name of the classification variable, such as 'sex.'
The third argument, numVar, is a character vector that contains the name of the numeric variables, such as c("age", "chol", "tg"). This means I need to perform three t-tests to compare the difference of those between males and females.
The function should return a data frame with the following variables: Varname, F.mean, M.mean, t (for t-statistics), df (for degrees of freedom), and p (for p-value).
I should be able to run this ...
myTtest(dat = chol, classVar = "sex", numVar = c("age", "chol", "tg")
... and then get the data frame to appear.
Any help is greatly appreciated. I am pulling my hair out over this one! As well, as noted in my comment below, this has to be done without Tidyverse ... which is why I'm having so much trouble to begin with.
The intuition for this solution is that you can loop over your dependent variables, and call t.test() in each loop. Then save the results from each DV and stack them together in one big data frame.
I'll leave out some bits for you to fill in, but here's the gist:
First, some example data:
set.seed(123)
n <- 20
grp <- sample(c("m", "f"), n, replace = TRUE)
df <- data.frame(grp = grp, age = rnorm(n), chol = rnorm(n), tg = rnorm(n))
df
grp age chol tg
1 m 1.2240818 0.42646422 0.25331851
2 m 0.3598138 -0.29507148 -0.02854676
3 m 0.4007715 0.89512566 -0.04287046
4 f 0.1106827 0.87813349 1.36860228
5 m -0.5558411 0.82158108 -0.22577099
6 f 1.7869131 0.68864025 1.51647060
7 f 0.4978505 0.55391765 -1.54875280
8 f -1.9666172 -0.06191171 0.58461375
9 m 0.7013559 -0.30596266 0.12385424
10 m -0.4727914 -0.38047100 0.21594157
Now make a container that each of the model outputs will go into:
fits_df <- data.frame()
Loop over each DV and append the model output to fits_df each time with rbind:
for (dv in c("age", "chol", "tg")) {
frml <- as.formula(paste0(dv, " ~ grp")) # make a model formula: dv ~ grp
fit <- t.test(frml, two.sided = TRUE, data = df) # perform the t-test
# hint: use str(fit) to figure out how to pull out each value you care about
fit_df <- data.frame(
dv = col,
f_mean = xxx,
m_mean = xxx,
t = xxx,
df = xxx,
p = xxx
)
fits_df <- rbind(fits_df, fit_df)
}
Your output will look like this:
fits_df
dv f_mean m_mean t df p
1 age -0.18558068 -0.04446755 -0.297 15.679 0.7704954
2 chol 0.07731514 0.22158672 -0.375 17.828 0.7119400
3 tg 0.09349567 0.23693052 -0.345 14.284 0.7352112
One note: When you're pulling out values from fit, you may get odd row names in your output data frame. This is due to the names property of the various fit attributes. You can get rid of these by using as.numeric() or as.character() wrappers around the values you pull from fit (for example, fit$statistic can be cleaned up with as.character(round(fit$statistic, 3))).

How to give a function a specific column of a list using a for-loop but prevent that output is named according to the iterator command

Given the following example:
library(metafor)
dat <- escalc(measure = "RR", ai = tpos, bi = tneg, ci = cpos, di = cneg, data = dat.bcg, append = TRUE)
dat
rma(yi, vi, data = dat, mods = ~dat[[8]], subset = (alloc=="systematic"), knha = TRUE)
trial author year tpos tneg cpos cneg ablat alloc yi vi
1 1 Aronson 1948 4 119 11 128 44 random -0.8893 0.3256
2 2 Ferguson & Simes 1949 6 300 29 274 55 random -1.5854 0.1946
3 3 Rosenthal et al 1960 3 228 11 209 42 random -1.3481 0.4154
4 4 Hart & Sutherland 1977 62 13536 248 12619 52 random -1.4416 0.0200
5 5 Frimodt-Moller et al 1973 33 5036 47 5761 13 alternate -0.2175 0.0512
6 6 Stein & Aronson 1953 NA NA NA NA 44 alternate NA NA
7 7 Vandiviere et al 1973 8 2537 10 619 19 random -1.6209 0.2230
8 8 TPT Madras 1980 505 87886 499 87892 NA random 0.0120 0.0040
9 9 Coetzee & Berjak 1968 29 7470 45 7232 27 random -0.4694 0.0564
10 10 Rosenthal et al 1961 17 1699 65 1600 42 systematic -1.3713 0.0730
11 11 Comstock et al 1974 186 50448 141 27197 18 systematic -0.3394 0.0124
12 12 Comstock & Webster 1969 5 2493 3 2338 33 systematic 0.4459 0.5325
13 13 Comstock et al 1976 27 16886 29 17825 33 systematic -0.0173 0.0714
Now what i basically want is to iterate with the rma() command (only for mods argument) from - let's say - [7:8] and to store this result in a variable equal to the columnname.
Two problems:
1) When i enter the command:
rma(yi, vi, data = dat, mods = ~dat[[8]], subset = (alloc=="systematic"), knha = TRUE)
The modname is named as dat[[8]]. But I want the modname to be the columname (i.e. colnames(dat[i]))
Model Results:
estimate se tval pval ci.lb ci.ub
intrcpt 0.5543 1.4045 0.3947 0.7312 -5.4888 6.5975
dat[[8]] -0.0312 0.0435 -0.7172 0.5477 -0.2185 0.1560
2) Now imagine that I have a lot of columns more and I want to iterate from [8:53], such that each result gets stored in a variable named equal to the columnname.
Problem 2) has been solved:
for(i in 7:8){
assign(paste(colnames(dat[i]), i, sep=""), rma(yi, vi, data = dat, mods = ~dat[[i]], subset = (alloc=="systematic"), knha = TRUE))}
To answers 1st part of your question, you can change the names by accessing the attributes of the model object.
In this case
# inspect the attributes
attr(model$vb, which = "dimnames")
# assign the name
attr(model$vb, which = "dimnames")[[1]][2] <- paste(colnames(dat)[8])

Resources