How to plot relative frequencies in R or Stata - r

I have this dataset :
> head(xc)
wheeze3 SmokingGroup_Kai TG2000 TG2012 PA_Score asthma3 tres3 age3 bmi bmi3
1 0 1 2 2 2 0 0 47 20.861 21.88708
2 0 5 2 3 3 0 0 57 20.449 23.05175
3 0 1 2 3 2 0 0 45 25.728 26.06168
4 0 2 1 1 3 0 0 48 22.039 23.50780
5 1 4 2 2 1 0 1 61 25.391 25.63692
6 0 4 2 2 2 0 0 54 21.633 23.66144
education3 group_change
1 2 0
2 2 3
3 3 3
4 3 0
5 1 0
6 2 0
Here
asthma3 is a variable that takes values 0,1 ;
group_change takes values 0,1,2,3,4,5,6 ;
age3 represents the age.
I would like to plot the percentage of people with asthma3==1 as a function of the variable age3.
I would like 6 lines on the same plot obtained dividing the samples by group_change.
I think that this should be possible using ggplot2.

Here's a ggplot2 approach:
library(ggplot2)
library(dplyr)
# Create fake data
set.seed(10)
xc=data.frame(age3=sample(40:50, 500, replace=TRUE),
asthma3=sample(0:1,500, replace=TRUE),
group_change=sample(0:6, 500, replace=TRUE))
# Summarize asthma percent by group_change and age3 (using dplyr)
xc1 = xc %.%
group_by(group_change, age3) %.%
summarize(asthma.pct=mean(asthma3)*100)
# Plot using ggplot2
ggplot(xc1, aes(x=age3, y=asthma.pct, colour=as.factor(group_change))) +
geom_line() +
geom_point() +
scale_x_continuous(breaks=40:50) +
xlab("Age") + ylab("Asthma Percent") +
scale_colour_discrete(name="Group Change")
Here's another ggplot2 approach that works directly with the original data frame and calculates the percentages on the fly. I've also formatted the y-axis in percent format.
library(scales) # Need this for "percent_format()"
ggplot(xc, aes(x=age3, y=asthma3, colour=as.factor(group_change))) +
stat_summary(fun.y=mean, geom='line') +
stat_summary(fun.y=mean, geom='point') +
scale_x_continuous(breaks=40:50) +
scale_y_continuous(labels=percent_format()) +
xlab("Age") + ylab("Asthma Percent") +
scale_colour_discrete(name="Group Change")

Here is one way using Stata. The example data has three groups.
The proportions are computed taking the mean of asthma3 which you identify as a binary variable.
clear all
set more off
*----- example data -----
set obs 500
set seed 135
gen age3 = floor((50-40+1)*runiform() + 40)
gen asthma3 = round(runiform())
egen group_change = seq(), to(3)
*----- pretty list -----
order age3 group_change
sort age3 group_change asthma3
list, sepby(age3)
*----- compute proportions -----
collapse (mean) asthma3, by(age3 group_change)
list
*----- syntax for graph and graph -----
levelsof(group_change), local(gc)
local i = 1
foreach g of local gc {
local call `call' || connected asthma3 age3 if group_change == `g', sort
local leg `leg' label(`i++' "Group`g'") // syntax for legend
}
twoway `call' legend(`leg') /// graph
title("Proportion with asthma by group")
This coincides with one of my first questions in Statalist. In Nick's words, you "build up the syntax" using a local macro and then feed that to twoway.
#NickCox, in a comment, suggests an alternative:
<snip>
*----- compute proportions -----
collapse (mean) asthma3, by(age3 group_change)
list
*----- graph -----
separate asthma3, by(group_change) veryshortlabel
twoway connected asthma31-asthma33 age3, sort ///
title("Proportion with asthma by group")
<snip>
This second alternative creates new variables from the original asthma3 which I abbreviate in the call to twoway connected as asthma31-asthma33.
Both alternatives produce a legend identifying groups. Labels I leave to you (see help graph).

Related

How to get predicted values from geom_smooth graph in R? [duplicate]

Is there a way to extract the values of the fitted line returned from stat_smooth?
The code I am using looks like this:
p <- ggplot(df1, aes(x=Days, y= Qty,group=Category,color=Category))
p <- p + stat_smooth(method=glm, fullrange=TRUE)+ geom_point())
This new r user would greatly appreciate any guidance.
Riffing off of #James example
p <- qplot(hp,wt,data=mtcars) + stat_smooth()
You can use the intermediate stages of the ggplot building process to pull out the plotted data. The results of ggplot_build is a list, one component of which is data which is a list of dataframes which contain the computed values to be plotted. In this case, the list is two dataframes since the original qplot creates one for points and the stat_smooth creates a smoothed one.
> ggplot_build(p)$data[[2]]
geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
x y ymin ymax se PANEL group
1 52.00000 1.993594 1.149150 2.838038 0.4111133 1 1
2 55.58228 2.039986 1.303264 2.776709 0.3586695 1 1
3 59.16456 2.087067 1.443076 2.731058 0.3135236 1 1
4 62.74684 2.134889 1.567662 2.702115 0.2761514 1 1
5 66.32911 2.183533 1.677017 2.690049 0.2465948 1 1
6 69.91139 2.232867 1.771739 2.693995 0.2244980 1 1
7 73.49367 2.282897 1.853241 2.712552 0.2091756 1 1
8 77.07595 2.333626 1.923599 2.743652 0.1996193 1 1
9 80.65823 2.385059 1.985378 2.784740 0.1945828 1 1
10 84.24051 2.437200 2.041282 2.833117 0.1927505 1 1
11 87.82278 2.490053 2.093808 2.886297 0.1929096 1 1
12 91.40506 2.543622 2.145018 2.942225 0.1940582 1 1
13 94.98734 2.597911 2.196466 2.999355 0.1954412 1 1
14 98.56962 2.652852 2.249260 3.056444 0.1964867 1 1
15 102.15190 2.708104 2.303465 3.112744 0.1969967 1 1
16 105.73418 2.764156 2.357927 3.170385 0.1977705 1 1
17 109.31646 2.821771 2.414230 3.229311 0.1984091 1 1
18 112.89873 2.888224 2.478136 3.298312 0.1996493 1 1
19 116.48101 2.968745 2.531045 3.406444 0.2130917 1 1
20 120.06329 3.049545 2.552102 3.546987 0.2421773 1 1
21 123.64557 3.115893 2.573577 3.658208 0.2640235 1 1
22 127.22785 3.156368 2.601664 3.711072 0.2700548 1 1
23 130.81013 3.175495 2.625951 3.725039 0.2675429 1 1
24 134.39241 3.181411 2.645191 3.717631 0.2610560 1 1
25 137.97468 3.182252 2.658993 3.705511 0.2547460 1 1
26 141.55696 3.186155 2.670350 3.701961 0.2511175 1 1
27 145.13924 3.201258 2.687208 3.715308 0.2502626 1 1
28 148.72152 3.235698 2.721744 3.749652 0.2502159 1 1
29 152.30380 3.291766 2.782767 3.800765 0.2478037 1 1
30 155.88608 3.353259 2.857911 3.848607 0.2411575 1 1
31 159.46835 3.418409 2.938257 3.898561 0.2337596 1 1
32 163.05063 3.487074 3.017321 3.956828 0.2286972 1 1
33 166.63291 3.559111 3.092367 4.025855 0.2272319 1 1
34 170.21519 3.634377 3.165426 4.103328 0.2283065 1 1
35 173.79747 3.712729 3.242093 4.183364 0.2291263 1 1
36 177.37975 3.813399 3.347232 4.279565 0.2269509 1 1
37 180.96203 3.910849 3.447572 4.374127 0.2255441 1 1
38 184.54430 3.977051 3.517784 4.436318 0.2235917 1 1
39 188.12658 4.037302 3.583959 4.490645 0.2207076 1 1
40 191.70886 4.091635 3.645111 4.538160 0.2173882 1 1
41 195.29114 4.140082 3.700184 4.579981 0.2141624 1 1
42 198.87342 4.182676 3.748159 4.617192 0.2115424 1 1
43 202.45570 4.219447 3.788162 4.650732 0.2099688 1 1
44 206.03797 4.250429 3.819579 4.681280 0.2097573 1 1
45 209.62025 4.275654 3.842137 4.709171 0.2110556 1 1
46 213.20253 4.295154 3.855951 4.734357 0.2138238 1 1
47 216.78481 4.308961 3.861497 4.756425 0.2178456 1 1
48 220.36709 4.317108 3.859541 4.774675 0.2227644 1 1
49 223.94937 4.319626 3.851025 4.788227 0.2281358 1 1
50 227.53165 4.316548 3.836964 4.796132 0.2334829 1 1
51 231.11392 4.308435 3.818728 4.798143 0.2384117 1 1
52 234.69620 4.302276 3.802201 4.802351 0.2434590 1 1
53 238.27848 4.297902 3.787395 4.808409 0.2485379 1 1
54 241.86076 4.292303 3.772103 4.812503 0.2532567 1 1
55 245.44304 4.282505 3.754087 4.810923 0.2572576 1 1
56 249.02532 4.269040 3.733184 4.804896 0.2608786 1 1
57 252.60759 4.253361 3.710042 4.796680 0.2645121 1 1
58 256.18987 4.235474 3.684476 4.786473 0.2682509 1 1
59 259.77215 4.215385 3.656265 4.774504 0.2722044 1 1
60 263.35443 4.193098 3.625161 4.761036 0.2764974 1 1
61 266.93671 4.168621 3.590884 4.746357 0.2812681 1 1
62 270.51899 4.141957 3.553134 4.730781 0.2866658 1 1
63 274.10127 4.113114 3.511593 4.714635 0.2928472 1 1
64 277.68354 4.082096 3.465939 4.698253 0.2999729 1 1
65 281.26582 4.048910 3.415849 4.681971 0.3082025 1 1
66 284.84810 4.013560 3.361010 4.666109 0.3176905 1 1
67 288.43038 3.976052 3.301132 4.650972 0.3285813 1 1
68 292.01266 3.936392 3.235952 4.636833 0.3410058 1 1
69 295.59494 3.894586 3.165240 4.623932 0.3550782 1 1
70 299.17722 3.850639 3.088806 4.612473 0.3708948 1 1
71 302.75949 3.804557 3.006494 4.602619 0.3885326 1 1
72 306.34177 3.756345 2.918191 4.594499 0.4080510 1 1
73 309.92405 3.706009 2.823813 4.588205 0.4294926 1 1
74 313.50633 3.653554 2.723308 4.583801 0.4528856 1 1
75 317.08861 3.598987 2.616650 4.581325 0.4782460 1 1
76 320.67089 3.542313 2.503829 4.580796 0.5055805 1 1
77 324.25316 3.483536 2.384853 4.582220 0.5348886 1 1
78 327.83544 3.422664 2.259739 4.585589 0.5661643 1 1
79 331.41772 3.359701 2.128512 4.590891 0.5993985 1 1
80 335.00000 3.294654 1.991200 4.598107 0.6345798 1 1
Knowing a priori where the one you want is in the list isn't easy, but if nothing else you can look at the column names.
It is still better to do the smoothing outside the ggplot call, though.
EDIT:
It turns out replicating what ggplot2 does to make the loess is not as straightforward as I thought, but this will work. I copied it out of some internal functions in ggplot2.
model <- loess(wt ~ hp, data=mtcars)
xrange <- range(mtcars$hp)
xseq <- seq(from=xrange[1], to=xrange[2], length=80)
pred <- predict(model, newdata = data.frame(hp = xseq), se=TRUE)
y = pred$fit
ci <- pred$se.fit * qt(0.95 / 2 + .5, pred$df)
ymin = y - ci
ymax = y + ci
loess.DF <- data.frame(x = xseq, y, ymin, ymax, se = pred$se.fit)
ggplot(mtcars, aes(x=hp, y=wt)) +
geom_point() +
geom_smooth(aes_auto(loess.DF), data=loess.DF, stat="identity")
That gives a plot that looks identical to
ggplot(mtcars, aes(x=hp, y=wt)) +
geom_point() +
geom_smooth()
(which is the expanded form of the original p).
stat_smooth does produce output that you can use elsewhere, and with a slightly hacky way, you can put it into a variable in the global environment.
You enclose the output variable in .. on either side to use it. So if you add an aes in the stat_smooth call and use the global assign, <<-, to assign the output to a varible in the global environment you can get the the fitted values, or others - see below.
qplot(hp,wt,data=mtcars) + stat_smooth(aes(outfit=fit<<-..y..))
fit
[1] 1.993594 2.039986 2.087067 2.134889 2.183533 2.232867 2.282897 2.333626
[9] 2.385059 2.437200 2.490053 2.543622 2.597911 2.652852 2.708104 2.764156
[17] 2.821771 2.888224 2.968745 3.049545 3.115893 3.156368 3.175495 3.181411
[25] 3.182252 3.186155 3.201258 3.235698 3.291766 3.353259 3.418409 3.487074
[33] 3.559111 3.634377 3.712729 3.813399 3.910849 3.977051 4.037302 4.091635
[41] 4.140082 4.182676 4.219447 4.250429 4.275654 4.295154 4.308961 4.317108
[49] 4.319626 4.316548 4.308435 4.302276 4.297902 4.292303 4.282505 4.269040
[57] 4.253361 4.235474 4.215385 4.193098 4.168621 4.141957 4.113114 4.082096
[65] 4.048910 4.013560 3.976052 3.936392 3.894586 3.850639 3.804557 3.756345
[73] 3.706009 3.653554 3.598987 3.542313 3.483536 3.422664 3.359701 3.294654
The outputs you can obtain are:
y, predicted value
ymin, lower pointwise confidence interval around
the mean
ymax, upper pointwise confidence interval around the mean
se, standard error
Note that by default it predicts on 80 data points, which may not be aligned with your original data.
A more general approach could be to simply use the predict() function to predict any range of values that are interesting.
# define the model
model <- loess(wt ~ hp, data = mtcars)
# predict fitted values for each observation in the original dataset
modelFit <- data.frame(predict(model, se = TRUE))
# define data frame for ggplot
df <- data.frame(cbind(hp = mtcars$hp
, wt = mtcars$wt
, fit = modelFit$fit
, upperBound = modelFit$fit + 2 * modelFit$se.fit
, lowerBound = modelFit$fit - 2 * modelFit$se.fit
))
# build the plot using the fitted values from the predict() function
# geom_linerange() and the second geom_point() in the code are built using the values from the predict() function
# for comparison ggplot's geom_smooth() is also shown
g <- ggplot(df, aes(hp, wt))
g <- g + geom_point()
g <- g + geom_linerange(aes(ymin = lowerBound, ymax = upperBound))
g <- g + geom_point(aes(hp, fit, size = 1))
g <- g + geom_smooth(method = "loess")
g
# Predict any range of values and include the standard error in the output
predict(model, newdata = 100:300, se = TRUE)
If you want to bring in the power of the tidyverse, you can use the "broom" library to add the predicted values from the loess function to your original dataset. This is building on #phillyooo's solution.
library(tidyverse)
library(broom)
# original graph with smoother
ggplot(data=mtcars, aes(hp,wt)) +
stat_smooth(method = "loess", span = 0.75)
# Create model that will do the same thing as under the hood in ggplot2
model <- loess(wt ~ hp, data = mtcars, span = 0.75)
# Add predicted values from model to original dataset using broom library
mtcars2 <- augment(model, mtcars)
# Plot both lines
ggplot(data=mtcars2, aes(hp,wt)) +
geom_line(aes(hp, .fitted), color = "red") +
stat_smooth(method = "loess", span = 0.75)
Save the graph object and use ggplot_build() or layer_data() to obtain the elements/estimates for the layers. e.g.
pp<-ggplot(mtcars, aes(x=hp, y=wt)) + geom_point() + geom_smooth();
ggplot_build(pp)

How to make a bar plot using ggplot that uses multiple columns for the x-axis?

I am trying to use multiple column names as the x-axis in a barplot. So each column name will be the "factor" and the data it contains is the count for that.
I have tried iterations of this:
ggplot(aes( x = names, y = count)) + geom_bar()
I tried concatenating the x values I want to show with aes(c(col1, col2))
but the aesthetics length does not match and won't work.
library(dplyr)
library(ggplot2)
head(dat)
Sample Week Response_1 Response_2 Response_3 Response_4 Vaccine_Type
1 1 1 300 0 2000 100 1
2 2 1 305 0 320 15 1
3 3 1 310 0 400 35 1
4 4 1 400 1 410 35 1
5 5 1 405 0 180 35 2
6 6 1 410 2 800 75 2
dat %>%
group_by(Week) %>%
ggplot(aes(c(Response_1, Response_2, Response_3, Response_4)) +
geom_boxplot() +
facet_grid(.~Week)
dat %>%
group_by(Week) %>%
ggplot(aes(Response_1, Response_2, Response_3, Response_4)) +
geom_boxplot() +
facet_grid(.~Week)
> Error: Aesthetics must be either length 1 or the same as the data
> (24): x
Both of these failed (kind of expected based on aes length error code), but hopefully you know the direction I was aiming for and can help out.
Goal is to have 4 separate groups, each with their own boxplot (1 for every response). And also have them faceted by week.
Using the simple code below got mostly what I want. Unfortunately I don't think it would be as easy to include the points and other characteristics to the plot like you can with ggplot.
boxplot(dat[,3:6], use.cols = TRUE)
And I could pretty easily just filter by the different weeks and use mfrow for faceting. Not as informative as ggplot, but gets the job done. If anyone else has other workarounds, I'd be interested in seeing.

How to account for categorical variables in calculating a risk score from regression model?

I have a data set which has a number of variables that I'd like to use to generate a risk score for getting a disease.
I have created a basic version of what I'm trying to do.
The dataset looks like this:
ID DISEASE_STATUS AGE SEX LOCATION
1 1 20 1 FRANCE
2 0 22 1 GERMANY
3 0 24 0 ITALY
4 1 20 1 GERMANY
5 1 20 0 ITALY
So the model I ran was:
glm(disease_status ~ age + sex + location, data=data, family=binomial(link='logit'))
The beta values produced by this model were as follows:
bage = −0.193
bsex = −0.0497
blocation= 1.344
To produce a risk score, I want to multiply the values for each individual by the beta values, eg:
risk score = (-0.193 * 20 (age)) + (-0.0497 * 1 (sex)) + (1.344 * ??? (location))
However, what value would I use to multiply the beta score for location by?
Thank you!

Plotting Logistic Regression in R, but I keep getting errors

I'm trying to plot a logistic regression in R, for a continuous independent variable and a dichotomous dependent variable. I have very limited experience with R, but my professor has asked me to add this graph to a paper I'm writing, and he said R would probably be the best way to create it. Anyway, I'm sure there are tons of mistakes here, but this is the sort of this previously suggested on StackOverflow:
ggplot(vvv, aes(x = vvv$V1, y=vvv$V2)) + geom_point() + stat_smooth(method="glm", family="binomial", se=FALSE)
curve(predict(ggg, data.frame(V1=x), type="response"), add=TRUE)
where vvv is the name of my csv file (31 obs. of 2 variables), V1 is the continuous variable, and V2 is the dichotomous one. Also, ggg (List of 30?) is the following:
ggg<- glm(formula = vvv$V2 ~ vvv$V1, family = "binomial", data = vvv)
The ggplot function produces a graph of my data points, but no logistic regression curve. The curve function results in the following error:
"Error in curve(predict(ggg, data.frame(V1 = x), type = "resp"), add = TRUE) : 'expr' did not evaluate to an object of length 'n'
In addition: Warning message:'newdata' had 101 rows but variables found have 31 rows"
I'm not sure what the problem is, and I'm having trouble finding resources for this specific issue. Can anybody help? It would be greatly appreciated :)
Edit: Thanks to anyone who responded! My data, vvv, is the following, where the percent was the initial probability for presence/absence of a species in a specific area, and the 1 and 0 indicate whether or not a species ended up being observed.:
V1 V2
1 95.00% 1
2 95.00% 0
3 95.00% 1
4 92.00% 1
5 92.00% 1
6 92.00% 1
7 92.00% 1
8 92.00% 1
9 92.00% 1
10 92.00% 1
11 85.00% 1
12 85.00% 1
13 85.00% 1
14 85.00% 1
15 85.00% 1
16 80.00% 1
17 80.00% 0
18 77.00% 1
19 77.00% 1
20 75.00% 0
21 70.00% 1
22 70.00% 0
23 70.00% 0
24 70.00% 1
25 70.00% 0
26 69.00% 1
27 65.00% 0
28 60.00% 1
29 50.00% 1
30 35.00% 0
31 25.00% 0
As #MrFlick commented, V1 is probably a factor. So, first you have to change it to numeric class. This just substitutes "%" for nothing and divides by 100, so you will have proportions as numeric class:
vvv$V1<-as.numeric(sub("%","",vvv$V1))/100
Doing this, you can use your own code and you will have a plot for a logistic regression:
ggplot(vvv, aes(x = vvv$V1, y=vvv$V2)) + geom_point() + stat_smooth(method="glm", family="binomial", se=F)
This should print not only the points, but also the logistic regression curve. I don't understand what is the point of using curves. From what I could understand from your question, this is enough for what you need.

R stacked percentage bar plot with percentage of binary factor and labels (with ggplot)

I want to produce a graphic that looks something like this:
My original data set looks something like this:
> bb[sample(nrow(bb), 20), ]
IMG QUANT FIX
25663 1 1 0
7936 2 2 0
23586 3 2 0
23017 2 2 1
31363 1 3 1
7886 2 2 0
23819 3 3 1
29838 2 2 1
8169 2 3 1
9870 2 3 0
31440 2 1 0
35564 3 1 0
24066 1 2 0
12020 3 2 0
6742 3 2 0
6189 2 3 0
26692 2 3 0
1387 3 2 0
31839 2 3 1
28637 3 2 0
So the idea is that the bars display where FIX = 1 per factor QUANT and per
factor IMG.
I've aggregated my data set into percentages using plyr
library(plyr)
bb.perc <- ddply(bb,.(QUANT,IMG),summarise,FIX.PROP = sum(FIX) / length(FIX))
It does almost the right thing:
QUANT IMG FIX.PROP
1 1 1 0.52439024
2 1 2 0.19085366
3 1 3 0.13658537
4 2 1 0.20414201
5 2 2 0.53964497
6 2 3 0.09585799
7 3 1 0.29000000
8 3 2 0.13000000
9 3 3 0.40705882
But now if I make a graph, it doesn't account for the FIX==0 cases, i.e. all bars have the same height, namely 100%, which isn't what I want. Note how the individual QUANT subframes don't add up to 100%:
> sum(bb.perc[1:3,]$FIX.PROP)
[1] 0.8518293
> sum(bb.perc[4:6,]$FIX.PROP)
[1] 0.839645
> sum(bb.perc[7:9,]$FIX.PROP)
[1] 0.8270588
The best I could do with R is to display counts:
# Take only the positive samples
bb.pos <- bb[bb$FIX == 1,]
# Plot the counts
ggplot(bb,aes(factor(QUANT),fill=factor(IMG))) + geom_bar() +
scale_y_continuous(labels=percent)
And results in:
This is also not what I want:
The percentage scale is way off. I need a way to pass the 100% point to the
percent function, but I have no idea how.
It lacks the labels.
There are a great deal of similar questions on SO already, but I seem to lack
the sufficient amount of intelligence (or understanding of R) to extrapolate
from them to a solution to my particular problem.
Thanks for any pointers!
EDIT: Sven Hohenstein provided an answer already, but here's how I ended up doing it myself as well:
> ggplot(bb.perc,aes(x=factor(QUANT),y=FIX.PROP,label=paste(round(FIX.PROP*100),
"%"),fill=factor(IMG)))+ geom_bar(stat="identity") + geom_text(position="stack",
aes(ymax=1),vjust=5) + scale_y_continuous(labels = percent)
Using the bb.perc that I defined further up using plyr. This one has the
advantage that the percentages are computed locally per column, and not
globally.
Thanks everyone for the help. The following two questions and their respective
answers helped me greatly in getting it right:
Stacked Bar Graph Labels with ggplot2
Adding labels to ggplot bar chart
What I did wrong initially, was pass the position = "fill" parameter to
geom_bar(), which for some reason made all the bars have the same height!
This is a way to generate the plot:
ggplot(bb[bb$FIX == 1, ],aes(x = factor(QUANT), fill = factor(IMG),
y = (..count..)/sum(..count..))) +
geom_bar() +
stat_bin(geom = "text",
aes(label = paste(round((..count..)/sum(..count..)*100), "%")),
vjust = 5) +
scale_y_continuous(labels = percent)
Change the value of the vjust parameter to adjust the vertical position of the labels.

Resources