How to get predicted values from geom_smooth graph in R? [duplicate] - r

Is there a way to extract the values of the fitted line returned from stat_smooth?
The code I am using looks like this:
p <- ggplot(df1, aes(x=Days, y= Qty,group=Category,color=Category))
p <- p + stat_smooth(method=glm, fullrange=TRUE)+ geom_point())
This new r user would greatly appreciate any guidance.

Riffing off of #James example
p <- qplot(hp,wt,data=mtcars) + stat_smooth()
You can use the intermediate stages of the ggplot building process to pull out the plotted data. The results of ggplot_build is a list, one component of which is data which is a list of dataframes which contain the computed values to be plotted. In this case, the list is two dataframes since the original qplot creates one for points and the stat_smooth creates a smoothed one.
> ggplot_build(p)$data[[2]]
geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
x y ymin ymax se PANEL group
1 52.00000 1.993594 1.149150 2.838038 0.4111133 1 1
2 55.58228 2.039986 1.303264 2.776709 0.3586695 1 1
3 59.16456 2.087067 1.443076 2.731058 0.3135236 1 1
4 62.74684 2.134889 1.567662 2.702115 0.2761514 1 1
5 66.32911 2.183533 1.677017 2.690049 0.2465948 1 1
6 69.91139 2.232867 1.771739 2.693995 0.2244980 1 1
7 73.49367 2.282897 1.853241 2.712552 0.2091756 1 1
8 77.07595 2.333626 1.923599 2.743652 0.1996193 1 1
9 80.65823 2.385059 1.985378 2.784740 0.1945828 1 1
10 84.24051 2.437200 2.041282 2.833117 0.1927505 1 1
11 87.82278 2.490053 2.093808 2.886297 0.1929096 1 1
12 91.40506 2.543622 2.145018 2.942225 0.1940582 1 1
13 94.98734 2.597911 2.196466 2.999355 0.1954412 1 1
14 98.56962 2.652852 2.249260 3.056444 0.1964867 1 1
15 102.15190 2.708104 2.303465 3.112744 0.1969967 1 1
16 105.73418 2.764156 2.357927 3.170385 0.1977705 1 1
17 109.31646 2.821771 2.414230 3.229311 0.1984091 1 1
18 112.89873 2.888224 2.478136 3.298312 0.1996493 1 1
19 116.48101 2.968745 2.531045 3.406444 0.2130917 1 1
20 120.06329 3.049545 2.552102 3.546987 0.2421773 1 1
21 123.64557 3.115893 2.573577 3.658208 0.2640235 1 1
22 127.22785 3.156368 2.601664 3.711072 0.2700548 1 1
23 130.81013 3.175495 2.625951 3.725039 0.2675429 1 1
24 134.39241 3.181411 2.645191 3.717631 0.2610560 1 1
25 137.97468 3.182252 2.658993 3.705511 0.2547460 1 1
26 141.55696 3.186155 2.670350 3.701961 0.2511175 1 1
27 145.13924 3.201258 2.687208 3.715308 0.2502626 1 1
28 148.72152 3.235698 2.721744 3.749652 0.2502159 1 1
29 152.30380 3.291766 2.782767 3.800765 0.2478037 1 1
30 155.88608 3.353259 2.857911 3.848607 0.2411575 1 1
31 159.46835 3.418409 2.938257 3.898561 0.2337596 1 1
32 163.05063 3.487074 3.017321 3.956828 0.2286972 1 1
33 166.63291 3.559111 3.092367 4.025855 0.2272319 1 1
34 170.21519 3.634377 3.165426 4.103328 0.2283065 1 1
35 173.79747 3.712729 3.242093 4.183364 0.2291263 1 1
36 177.37975 3.813399 3.347232 4.279565 0.2269509 1 1
37 180.96203 3.910849 3.447572 4.374127 0.2255441 1 1
38 184.54430 3.977051 3.517784 4.436318 0.2235917 1 1
39 188.12658 4.037302 3.583959 4.490645 0.2207076 1 1
40 191.70886 4.091635 3.645111 4.538160 0.2173882 1 1
41 195.29114 4.140082 3.700184 4.579981 0.2141624 1 1
42 198.87342 4.182676 3.748159 4.617192 0.2115424 1 1
43 202.45570 4.219447 3.788162 4.650732 0.2099688 1 1
44 206.03797 4.250429 3.819579 4.681280 0.2097573 1 1
45 209.62025 4.275654 3.842137 4.709171 0.2110556 1 1
46 213.20253 4.295154 3.855951 4.734357 0.2138238 1 1
47 216.78481 4.308961 3.861497 4.756425 0.2178456 1 1
48 220.36709 4.317108 3.859541 4.774675 0.2227644 1 1
49 223.94937 4.319626 3.851025 4.788227 0.2281358 1 1
50 227.53165 4.316548 3.836964 4.796132 0.2334829 1 1
51 231.11392 4.308435 3.818728 4.798143 0.2384117 1 1
52 234.69620 4.302276 3.802201 4.802351 0.2434590 1 1
53 238.27848 4.297902 3.787395 4.808409 0.2485379 1 1
54 241.86076 4.292303 3.772103 4.812503 0.2532567 1 1
55 245.44304 4.282505 3.754087 4.810923 0.2572576 1 1
56 249.02532 4.269040 3.733184 4.804896 0.2608786 1 1
57 252.60759 4.253361 3.710042 4.796680 0.2645121 1 1
58 256.18987 4.235474 3.684476 4.786473 0.2682509 1 1
59 259.77215 4.215385 3.656265 4.774504 0.2722044 1 1
60 263.35443 4.193098 3.625161 4.761036 0.2764974 1 1
61 266.93671 4.168621 3.590884 4.746357 0.2812681 1 1
62 270.51899 4.141957 3.553134 4.730781 0.2866658 1 1
63 274.10127 4.113114 3.511593 4.714635 0.2928472 1 1
64 277.68354 4.082096 3.465939 4.698253 0.2999729 1 1
65 281.26582 4.048910 3.415849 4.681971 0.3082025 1 1
66 284.84810 4.013560 3.361010 4.666109 0.3176905 1 1
67 288.43038 3.976052 3.301132 4.650972 0.3285813 1 1
68 292.01266 3.936392 3.235952 4.636833 0.3410058 1 1
69 295.59494 3.894586 3.165240 4.623932 0.3550782 1 1
70 299.17722 3.850639 3.088806 4.612473 0.3708948 1 1
71 302.75949 3.804557 3.006494 4.602619 0.3885326 1 1
72 306.34177 3.756345 2.918191 4.594499 0.4080510 1 1
73 309.92405 3.706009 2.823813 4.588205 0.4294926 1 1
74 313.50633 3.653554 2.723308 4.583801 0.4528856 1 1
75 317.08861 3.598987 2.616650 4.581325 0.4782460 1 1
76 320.67089 3.542313 2.503829 4.580796 0.5055805 1 1
77 324.25316 3.483536 2.384853 4.582220 0.5348886 1 1
78 327.83544 3.422664 2.259739 4.585589 0.5661643 1 1
79 331.41772 3.359701 2.128512 4.590891 0.5993985 1 1
80 335.00000 3.294654 1.991200 4.598107 0.6345798 1 1
Knowing a priori where the one you want is in the list isn't easy, but if nothing else you can look at the column names.
It is still better to do the smoothing outside the ggplot call, though.
EDIT:
It turns out replicating what ggplot2 does to make the loess is not as straightforward as I thought, but this will work. I copied it out of some internal functions in ggplot2.
model <- loess(wt ~ hp, data=mtcars)
xrange <- range(mtcars$hp)
xseq <- seq(from=xrange[1], to=xrange[2], length=80)
pred <- predict(model, newdata = data.frame(hp = xseq), se=TRUE)
y = pred$fit
ci <- pred$se.fit * qt(0.95 / 2 + .5, pred$df)
ymin = y - ci
ymax = y + ci
loess.DF <- data.frame(x = xseq, y, ymin, ymax, se = pred$se.fit)
ggplot(mtcars, aes(x=hp, y=wt)) +
geom_point() +
geom_smooth(aes_auto(loess.DF), data=loess.DF, stat="identity")
That gives a plot that looks identical to
ggplot(mtcars, aes(x=hp, y=wt)) +
geom_point() +
geom_smooth()
(which is the expanded form of the original p).

stat_smooth does produce output that you can use elsewhere, and with a slightly hacky way, you can put it into a variable in the global environment.
You enclose the output variable in .. on either side to use it. So if you add an aes in the stat_smooth call and use the global assign, <<-, to assign the output to a varible in the global environment you can get the the fitted values, or others - see below.
qplot(hp,wt,data=mtcars) + stat_smooth(aes(outfit=fit<<-..y..))
fit
[1] 1.993594 2.039986 2.087067 2.134889 2.183533 2.232867 2.282897 2.333626
[9] 2.385059 2.437200 2.490053 2.543622 2.597911 2.652852 2.708104 2.764156
[17] 2.821771 2.888224 2.968745 3.049545 3.115893 3.156368 3.175495 3.181411
[25] 3.182252 3.186155 3.201258 3.235698 3.291766 3.353259 3.418409 3.487074
[33] 3.559111 3.634377 3.712729 3.813399 3.910849 3.977051 4.037302 4.091635
[41] 4.140082 4.182676 4.219447 4.250429 4.275654 4.295154 4.308961 4.317108
[49] 4.319626 4.316548 4.308435 4.302276 4.297902 4.292303 4.282505 4.269040
[57] 4.253361 4.235474 4.215385 4.193098 4.168621 4.141957 4.113114 4.082096
[65] 4.048910 4.013560 3.976052 3.936392 3.894586 3.850639 3.804557 3.756345
[73] 3.706009 3.653554 3.598987 3.542313 3.483536 3.422664 3.359701 3.294654
The outputs you can obtain are:
y, predicted value
ymin, lower pointwise confidence interval around
the mean
ymax, upper pointwise confidence interval around the mean
se, standard error
Note that by default it predicts on 80 data points, which may not be aligned with your original data.

A more general approach could be to simply use the predict() function to predict any range of values that are interesting.
# define the model
model <- loess(wt ~ hp, data = mtcars)
# predict fitted values for each observation in the original dataset
modelFit <- data.frame(predict(model, se = TRUE))
# define data frame for ggplot
df <- data.frame(cbind(hp = mtcars$hp
, wt = mtcars$wt
, fit = modelFit$fit
, upperBound = modelFit$fit + 2 * modelFit$se.fit
, lowerBound = modelFit$fit - 2 * modelFit$se.fit
))
# build the plot using the fitted values from the predict() function
# geom_linerange() and the second geom_point() in the code are built using the values from the predict() function
# for comparison ggplot's geom_smooth() is also shown
g <- ggplot(df, aes(hp, wt))
g <- g + geom_point()
g <- g + geom_linerange(aes(ymin = lowerBound, ymax = upperBound))
g <- g + geom_point(aes(hp, fit, size = 1))
g <- g + geom_smooth(method = "loess")
g
# Predict any range of values and include the standard error in the output
predict(model, newdata = 100:300, se = TRUE)

If you want to bring in the power of the tidyverse, you can use the "broom" library to add the predicted values from the loess function to your original dataset. This is building on #phillyooo's solution.
library(tidyverse)
library(broom)
# original graph with smoother
ggplot(data=mtcars, aes(hp,wt)) +
stat_smooth(method = "loess", span = 0.75)
# Create model that will do the same thing as under the hood in ggplot2
model <- loess(wt ~ hp, data = mtcars, span = 0.75)
# Add predicted values from model to original dataset using broom library
mtcars2 <- augment(model, mtcars)
# Plot both lines
ggplot(data=mtcars2, aes(hp,wt)) +
geom_line(aes(hp, .fitted), color = "red") +
stat_smooth(method = "loess", span = 0.75)

Save the graph object and use ggplot_build() or layer_data() to obtain the elements/estimates for the layers. e.g.
pp<-ggplot(mtcars, aes(x=hp, y=wt)) + geom_point() + geom_smooth();
ggplot_build(pp)

Related

geom_hline with multiple points and facet_wrap

i am trying to plot horizontal lines at specific points of my data. The idea is that i would like a horizontal line from the first value of equivalent iterations(i.e 0) at y intercept for each of my axis; SA, VLA, HLA. My question will become clearer with data.
iterations subsets equivalent_iterations axis ratio1 ratio2
0 0 0 SA 0.023569024 0.019690577
0 0 0 SA 0.023255814 0.019830028
0 0 0 VLA 0.025362319 0.020348837
0 0 0 HLA 0.022116904 0.021472393
2 2 4 SA 0.029411765 0.024911032
2 2 4 SA 0.024604569 0.022838499
2 2 4 VLA 0.026070764 0.022727273
2 2 4 HLA 0.027833002 0.027888446
4 15 60 SA 0.019746121 0.014403292
4 15 60 SA 0.018691589 0.015538291
4 15 60 VLA 0.021538462 0.01686747
4 15 60 HLA 0.017052375 0.017326733
16 5 80 SA 0.019021739 0.015021459
16 5 80 SA 0.020527859 0.015384615
16 5 80 VLA 0.023217247 0.017283951
16 5 80 HLA 0.017391304 0.016298021
and this is my plot using ggplot
ggplot(df)+
aes(x = equivalent_iterations, y = ratio1, color = equivalent_iterations)+
geom_point() +
facet_wrap(~axis) +
expand_limits(x = 0, y = 0)
What i want is for each axis SA, VLA, HLA (i.e. each facet_wrap) a horizontal line from the first point (which is at 0 equivalent iterations) at the y intercept (which is given by the ratio1 in column 5 in the first 4 values). Any help will be greatly appreciated. Thank you in advance
You can treat it like any other geom_*. Just create a new column with the value of ratio1 at which you want to plot the horizontal line. I do this by sub setting the the data by those where iterations = 0 (note SA has 2 of these) and joining the ratio1 column onto the original dataframe. This column can then be passed to the aesthetics call in geom_hline().
library(tidyverse)
df %>%
left_join(df %>%
filter(iterations == 0) %>%
select(axis, intercept = ratio1)) %>%
ggplot(aes(x = equivalent_iterations, y = ratio1,
color = equivalent_iterations)) +
geom_point() +
geom_hline(aes(yintercept = intercept)) +
facet_wrap(~axis) +
expand_limits(x = 0, y = 0)

geom_bar removed 3 rows with missing values

I'm trying to create a histogram using ggplot2 in R.
This is the code I'm using:
library(tidyverse)
dat_male$explicit_truncated <- trunc(dat_male$explicit_mean)
means2 <- aggregate(dat_male$IAT_D, by=list(dat_male$explicit_truncated,dat_male$id), mean, na.rm=TRUE)
colnames(means2) <- c("explicit", "id", "IAT_D")
sd2 <- aggregate(dat_male$IAT_D, by=list(dat_male$explicit_truncated,dat_male$id), sd, na.rm=TRUE)
length2 <- aggregate(dat_male$IAT_D, by=list(dat_male$explicit_truncated,dat_male$id), length)
se2 <- sd2$x / sqrt(length$x)
means2$lo <- means2$IAT_D - 1.6*se2
means2$hi <- means2$IAT_D + 1.6*se2
ggplot(data = means2, aes(x = factor(explicit), y = IAT_D, fill = factor(id))) +
geom_bar(stat = "identity", position = position_dodge()) +
geom_errorbar(aes(ymin=lo,ymax=hi, width=.2), position=position_dodge(0.9), data=means2) +
xlab("Explicit attitude score") +
ylab("D-score")
For some reason I get the following warning message:
Removed 3 rows containing missing values (geom_bar).
And I get the following histogram:
I really have no clue what is going on.
Please let me know if you need to see anything else of my code, I'm never really sure what to include.
dat_male is a dataset that looks like this (I have only included the variables that I mentioned in this question, as the dataset contains 68 variables):
id explicit_mean IAT_D explicit_truncated
5 1 3.1250 0.366158652 3
6 1 3.3125 0.373590066 3
9 1 3.6250 0.208096230 3
11 1 3.1250 0.661983618 3
15 1 2.3125 0.348246184 2
19 1 3.7500 0.562406383 3
28 1 2.5625 -0.292888526 2
35 1 4.3750 0.560039531 4
36 1 3.8125 -0.117455439 3
37 1 3.1250 0.074375196 3
46 1 2.5625 0.488265849 2
47 1 4.2500 -0.131005579 4
53 1 2.0625 0.193040876 2
55 1 2.6875 0.875420303 2
62 1 3.8750 0.579146056 3
63 1 3.3125 0.666095380 3
66 1 2.8125 0.115607820 2
68 1 4.3750 0.259929946 4
80 1 3.0000 0.502709149 3
means2 is a dataset I have used to calculate means, and that looks like this:
explicit id IAT_D lo hi
1 0 0 NaN NaN NaN
2 2 0 0.23501191 0.1091807 0.3608431
3 3 0 0.31478389 0.2311406 0.3984272
4 4 0 -0.24296625 -0.3241166 -0.1618159
5 1 1 -0.04010111 NA NA
6 2 1 0.21939286 0.1109138 0.3278719
7 3 1 0.29097806 0.1973051 0.3846511
8 4 1 0.22965463 0.1209229 0.3383864
Now that I see it front of me, it probably has something to do with the NaN's?
From your dataset it seems like everything is alright.
The errors that you get are an indication that your data.frame has empty values (i.e. NaN and NA).
I actually got two warning messages:
Warning messages:
1: Removed 1 rows containing missing values
(geom_bar).
2: Removed 2 rows containing missing values
(geom_errorbar).
Regarding the plot, because you don't have any zero values under explicit, you don't see it in the graph. Similarly, because you have NAs under lo and hi for one in explicit, you don't get the corresponding error bar.
Dataset:
means2 <- read.table(text = " explicit id IAT_D lo hi
1 0 0 NaN NaN NaN
2 2 0 0.23501191 0.1091807 0.3608431
3 3 0 0.31478389 0.2311406 0.3984272
4 4 0 -0.24296625 -0.3241166 -0.1618159
5 1 1 -0.04010111 NA NA
6 2 1 0.21939286 0.1109138 0.3278719
7 3 1 0.29097806 0.1973051 0.3846511
8 4 1 0.22965463 0.1209229 0.3383864",
header = TRUE)
plot:
means2 %>%
ggplot(aes(x = factor(explicit), y = IAT_D, fill = factor(id))) +
geom_bar(stat = "identity", position = position_dodge()) +
geom_errorbar(aes(ymin=lo,ymax=hi, width=.2),
position=position_dodge(0.9)) +
xlab("Explicit attitude score") +
ylab("D-score")

Remove inflections at end of lines from geom_line()

I am trying to plot the predictions of a lmer model with the following code:
p1 <- ggplot(Mac_Data_Tracking, aes(x = Rspan, y = SubjEff, colour = NsCond)) +
geom_point(size=3) +
geom_line(data=newdat, aes(y=predict(SubjEff.model,newdata=newdat)),lineend="round")
print(p1)
I get weird inflections at the end of each line, is there a way to remove them? I have changed the data in newdat, but the lines always have these inflections.
Lines with Inflections at ends:
Note that you have geom_line(data=newdat, aes(y=predict(SubjEff.model,newdata=newdat)). So you've fed newdat to geom_line as the data frame to use for plotting. But then for your y-value you provide a separate vector of predictions (based on newdat), when y should actually be just a column of newdat. I'm not sure why that's causing the inflections at the ends (probably there are, somehow, two different y-values being provided for each of the endpoint x-values), but that's probably the source of your problem.
Instead, you should create a column in newdat with the predictions (if you haven't already) and feed that column name to ggplot as the y in geom_line. To add a column of predictions, do the following:
newdat$pred = predict(SubjEff.model,newdata=newdat)
You should also give geom_line the x values that correspond to the y values in newdat. So your code would be:
geom_line(data=newdat, aes(y=pred, x=Rspan), lineend="round")
(Where Rspan will (automatically) be the Rspan column in newdat.)
It was a problem with having 2 x values, actually...it was having 2 subject values.
The linear mixed model is:
Mixed.model <- lmer(Outcome ~ NsCond + Rspan + (1|Subject), data=Data))
For newdat, I was intially using:
newdat <- expand.grid(Subject=c(min(Data$Subject),max(Data$Subject)),Rspan=c(min(Data$Rspan), max(Data$Rspan)),NsCond=unique(Data$NsCond))
Which gave me:
Subject Rspan NsCond
1 1 0.2916667 Pink
2 18 0.2916667 Pink
3 1 1.0000000 Pink
4 18 1.0000000 Pink
5 1 0.2916667 Babble
6 18 0.2916667 Babble
7 1 1.0000000 Babble
8 18 1.0000000 Babble
9 1 0.2916667 Loss
10 18 0.2916667 Loss
11 1 1.0000000 Loss
12 18 1.0000000 Loss
For each Rspan (x) there are 2 "Subjects" (1 and 18).
I changed newdat to:
newdat <- expand.grid(Subject=1,Rspan=c(min(Data$Rspan), max(Data$Rspan)),NsCond=unique(Data$NsCond))
Which results in:
Subject Rspan NsCond
1 1 0.2916667 Pink
2 1 1.0000000 Pink
3 1 0.2916667 Babble
4 1 1.0000000 Babble
5 1 0.2916667 Loss
6 1 1.0000000 Loss
Now it looks good

Plotting Logistic Regression in R, but I keep getting errors

I'm trying to plot a logistic regression in R, for a continuous independent variable and a dichotomous dependent variable. I have very limited experience with R, but my professor has asked me to add this graph to a paper I'm writing, and he said R would probably be the best way to create it. Anyway, I'm sure there are tons of mistakes here, but this is the sort of this previously suggested on StackOverflow:
ggplot(vvv, aes(x = vvv$V1, y=vvv$V2)) + geom_point() + stat_smooth(method="glm", family="binomial", se=FALSE)
curve(predict(ggg, data.frame(V1=x), type="response"), add=TRUE)
where vvv is the name of my csv file (31 obs. of 2 variables), V1 is the continuous variable, and V2 is the dichotomous one. Also, ggg (List of 30?) is the following:
ggg<- glm(formula = vvv$V2 ~ vvv$V1, family = "binomial", data = vvv)
The ggplot function produces a graph of my data points, but no logistic regression curve. The curve function results in the following error:
"Error in curve(predict(ggg, data.frame(V1 = x), type = "resp"), add = TRUE) : 'expr' did not evaluate to an object of length 'n'
In addition: Warning message:'newdata' had 101 rows but variables found have 31 rows"
I'm not sure what the problem is, and I'm having trouble finding resources for this specific issue. Can anybody help? It would be greatly appreciated :)
Edit: Thanks to anyone who responded! My data, vvv, is the following, where the percent was the initial probability for presence/absence of a species in a specific area, and the 1 and 0 indicate whether or not a species ended up being observed.:
V1 V2
1 95.00% 1
2 95.00% 0
3 95.00% 1
4 92.00% 1
5 92.00% 1
6 92.00% 1
7 92.00% 1
8 92.00% 1
9 92.00% 1
10 92.00% 1
11 85.00% 1
12 85.00% 1
13 85.00% 1
14 85.00% 1
15 85.00% 1
16 80.00% 1
17 80.00% 0
18 77.00% 1
19 77.00% 1
20 75.00% 0
21 70.00% 1
22 70.00% 0
23 70.00% 0
24 70.00% 1
25 70.00% 0
26 69.00% 1
27 65.00% 0
28 60.00% 1
29 50.00% 1
30 35.00% 0
31 25.00% 0
As #MrFlick commented, V1 is probably a factor. So, first you have to change it to numeric class. This just substitutes "%" for nothing and divides by 100, so you will have proportions as numeric class:
vvv$V1<-as.numeric(sub("%","",vvv$V1))/100
Doing this, you can use your own code and you will have a plot for a logistic regression:
ggplot(vvv, aes(x = vvv$V1, y=vvv$V2)) + geom_point() + stat_smooth(method="glm", family="binomial", se=F)
This should print not only the points, but also the logistic regression curve. I don't understand what is the point of using curves. From what I could understand from your question, this is enough for what you need.

How to plot relative frequencies in R or Stata

I have this dataset :
> head(xc)
wheeze3 SmokingGroup_Kai TG2000 TG2012 PA_Score asthma3 tres3 age3 bmi bmi3
1 0 1 2 2 2 0 0 47 20.861 21.88708
2 0 5 2 3 3 0 0 57 20.449 23.05175
3 0 1 2 3 2 0 0 45 25.728 26.06168
4 0 2 1 1 3 0 0 48 22.039 23.50780
5 1 4 2 2 1 0 1 61 25.391 25.63692
6 0 4 2 2 2 0 0 54 21.633 23.66144
education3 group_change
1 2 0
2 2 3
3 3 3
4 3 0
5 1 0
6 2 0
Here
asthma3 is a variable that takes values 0,1 ;
group_change takes values 0,1,2,3,4,5,6 ;
age3 represents the age.
I would like to plot the percentage of people with asthma3==1 as a function of the variable age3.
I would like 6 lines on the same plot obtained dividing the samples by group_change.
I think that this should be possible using ggplot2.
Here's a ggplot2 approach:
library(ggplot2)
library(dplyr)
# Create fake data
set.seed(10)
xc=data.frame(age3=sample(40:50, 500, replace=TRUE),
asthma3=sample(0:1,500, replace=TRUE),
group_change=sample(0:6, 500, replace=TRUE))
# Summarize asthma percent by group_change and age3 (using dplyr)
xc1 = xc %.%
group_by(group_change, age3) %.%
summarize(asthma.pct=mean(asthma3)*100)
# Plot using ggplot2
ggplot(xc1, aes(x=age3, y=asthma.pct, colour=as.factor(group_change))) +
geom_line() +
geom_point() +
scale_x_continuous(breaks=40:50) +
xlab("Age") + ylab("Asthma Percent") +
scale_colour_discrete(name="Group Change")
Here's another ggplot2 approach that works directly with the original data frame and calculates the percentages on the fly. I've also formatted the y-axis in percent format.
library(scales) # Need this for "percent_format()"
ggplot(xc, aes(x=age3, y=asthma3, colour=as.factor(group_change))) +
stat_summary(fun.y=mean, geom='line') +
stat_summary(fun.y=mean, geom='point') +
scale_x_continuous(breaks=40:50) +
scale_y_continuous(labels=percent_format()) +
xlab("Age") + ylab("Asthma Percent") +
scale_colour_discrete(name="Group Change")
Here is one way using Stata. The example data has three groups.
The proportions are computed taking the mean of asthma3 which you identify as a binary variable.
clear all
set more off
*----- example data -----
set obs 500
set seed 135
gen age3 = floor((50-40+1)*runiform() + 40)
gen asthma3 = round(runiform())
egen group_change = seq(), to(3)
*----- pretty list -----
order age3 group_change
sort age3 group_change asthma3
list, sepby(age3)
*----- compute proportions -----
collapse (mean) asthma3, by(age3 group_change)
list
*----- syntax for graph and graph -----
levelsof(group_change), local(gc)
local i = 1
foreach g of local gc {
local call `call' || connected asthma3 age3 if group_change == `g', sort
local leg `leg' label(`i++' "Group`g'") // syntax for legend
}
twoway `call' legend(`leg') /// graph
title("Proportion with asthma by group")
This coincides with one of my first questions in Statalist. In Nick's words, you "build up the syntax" using a local macro and then feed that to twoway.
#NickCox, in a comment, suggests an alternative:
<snip>
*----- compute proportions -----
collapse (mean) asthma3, by(age3 group_change)
list
*----- graph -----
separate asthma3, by(group_change) veryshortlabel
twoway connected asthma31-asthma33 age3, sort ///
title("Proportion with asthma by group")
<snip>
This second alternative creates new variables from the original asthma3 which I abbreviate in the call to twoway connected as asthma31-asthma33.
Both alternatives produce a legend identifying groups. Labels I leave to you (see help graph).

Resources