ggplot2 missing data when plotting histogram with custom x axis limits

ggplot2 missing data when plotting histogram with custom x axis limits - r

I am trying to plot six histograms (2 colums of data (calories, sodium) x 3 types (beef, meat, poultry)) with these data and I want to give them the same scale for x and y axis. I'm using scale_x_continuous to limit the x axis, which according to various sources, removes data that won't appear on the plot. Here is my code:
#src.table is the data frame containing my data
histogram <- function(df, dataset, n_bins, label) {
ggplot(df, aes(x=df[[dataset]])) +
geom_histogram(color="darkblue", fill="lightblue", bins = n_bins) + xlab(label)
}
src2_12.beef <- src2_12.table[src2_12.table$Type == "Beef",]
src2_12.meat <- src2_12.table[src2_12.table$Type == "Meat",]
src2_12.poultry <- src2_12.table[src2_12.table$Type == "Poultry",]
src2_12.calories_scale <- lims(x = c(min(src2_12.table$Calories), max(src2_12.table$Calories)), y = c(0, 6))
src2_12.sodium_scale <- lims(x = c(min(src2_12.table$Sodium), max(src2_12.table$Sodium)), y = c(0, 6))
#src2_12.calories_scale <- lims()
#src2_12.sodium_scale <- lims()
src2_12.plots <- list(
histogram(src2_12.beef, "Calories", 10, "Calories-Beef") + src2_12.calories_scale,
histogram(src2_12.meat, "Calories", 10, "Calories-Meat") + src2_12.calories_scale,
histogram(src2_12.poultry, "Calories", 10, "Calories-Poultry") + src2_12.calories_scale,
histogram(src2_12.beef, "Sodium", 10, "Sodium-Beef") + src2_12.sodium_scale,
histogram(src2_12.meat, "Sodium", 10, "Sodium-Meat") + src2_12.sodium_scale,
histogram(src2_12.poultry, "Sodium", 10, "Sodium-Poultry") + src2_12.sodium_scale
)
multiplot(plotlist = src2_12.plots, cols = 2, layout = matrix(c(1, 2, 3, 4, 5, 6), nrow = 2, byrow = TRUE))
Here is the output:
vs. what the data are supposed to look like:
I couldn't understand why some data points are missing since given that the limit I set is already the min and the max of the data.

You probably want to use coord_cartesian instead of lims. Unexpected things can happen when you're fiddling around with the limits on histograms, because a fair bit of fiddly transformations have to happen to get from your raw data to the actual histogram.
Let's peer under the hood for one example:
p <- ggplot(src2_12.beef,aes(x = Calories)) +
geom_histogram(bins = 10)
p1 <- ggplot(src2_12.beef,aes(x = Calories)) +
geom_histogram(bins = 10) +
lims(x = c(86,195))
a <- ggplot_build(p)
b <- ggplot_build(p1)
>a$data[[1]][,1:5]
y count x xmin xmax
1 1 1 114.1111 109.7222 118.5000
2 0 0 122.8889 118.5000 127.2778
3 3 3 131.6667 127.2778 136.0556
4 2 2 140.4444 136.0556 144.8333
5 5 5 149.2222 144.8333 153.6111
6 2 2 158.0000 153.6111 162.3889
7 0 0 166.7778 162.3889 171.1667
8 2 2 175.5556 171.1667 179.9444
9 3 3 184.3333 179.9444 188.7222
10 2 2 193.1111 188.7222 197.5000
> b$data[[1]][,1:5]
y count x xmin xmax
1 0 0 NA NA 90.83333
2 0 0 96.88889 90.83333 102.94444
3 1 1 109.00000 102.94444 115.05556
4 0 0 121.11111 115.05556 127.16667
5 4 4 133.22222 127.16667 139.27778
6 4 4 145.33333 139.27778 151.38889
7 4 4 157.44444 151.38889 163.50000
8 1 1 169.55556 163.50000 175.61111
9 4 4 181.66667 175.61111 187.72222
10 2 2 193.77778 187.72222 NA
>
So now you're wondering, how the heck did that happen, right?
Well, when you tell ggplot that you want 10 bins and the x limits go from 86 to 195, the histogram algorithm tries to create ten bins that span that actual range. That's why it's trying to create bins down below 100 even though there's no data there.
And then further oddities can happen because the bars may extend past the nominal data range (the xmin and xmax values), since the bar widths will generally encompass a little above and a little below your actual data at the high and low ends.
coord_cartesian will adjust the x limits after all this processing has happened, so it bypasses all these little quirks.

Related

Proportion of patients in different strata

I am trying to distribute 20 patients into 4 strata. The covariates are jointly distributed. Below is my code, along with the error. Can anyone help me debug this?
runif(20)
if (0<u<3/20) then {i1} =1
if (3/20<u<14/20) then {i2} =2
if (14/20<u<19/20) then {i3} =3
if (19/20<u<20/20) then {i4} =4
Error: unexpected '<' in "if (0<u<"**

You can use the cut function to assign your strata:
breaks <- c(0, 3/20, 14/20, 19/20, 20)
labels <- 1:4
set.seed(7)
x <- runif(20)
stratum <- cut(x, breaks = breaks, labels = labels)
df <- data.frame(cbind(x, stratum))
df
x stratum
1 0.98890930 4
2 0.39774545 2
3 0.11569778 1
4 0.06974868 1
5 0.24374939 2
6 0.79201043 3
As AnilGoya mentioned, you have clean up your comparison operators.

Plot lines with their standard deviation on different x axis

In the following example, I want have zone on y axis, then plot D1 with its standard deviation (shading) D1sd on x axis. Next, I want to add D1b and its standard deviation on the second x axis. My second question is that, is it possible to plot the second set of data D2 in a panel next to first one. I'm thinking of the way spplot puts the panels next to each other. Thanks!
zone D1 D1sd D1b D1bsd D2 D2sd D2b D2bsd
-10 6.018198819 1.353674355 0.820238734 0.299921523 6.149905542 1.559112995 0.71903318 0.281436916
-9 6.016694189 1.348320178 0.790463895 0.320471326 6.225247218 1.810133214 0.690944285 0.291123921
-8 6.075920068 1.268199241 0.792396958 0.295767298 6.452827975 1.890055573 0.698130383 0.285354803
-7 6.014926533 1.15754388 0.826652396 0.269340472 6.364786271 1.677836628 0.748784125 0.262342978
-6 5.934024155 1.097224151 0.876312952 0.287715603 6.167672962 1.558124318 0.755995918 0.265152681
-5 6.180879693 1.115373166 0.911045374 0.302416557 6.429580579 1.485044161 0.783518016 0.255475422
-4 6.215761357 1.287465467 0.930981232 0.302896699 6.579955644 1.388358072 0.810873074 0.234479504
-3 6.191414137 1.297136068 0.859521028 0.301839757 6.72533907 1.383269712 0.786424272 0.242793151
-2 6.249558839 1.484243431 0.870789671 0.315339266 6.738830636 1.39348093 0.822833797 0.28853238
-1 6.279693424 1.462642241 0.890051094 0.313090388 6.665698185 1.272444414 0.849884276 0.309606843
0 6.389352438 1.653046732 0.911295197 0.332748249 6.623842834 1.3384852 0.860175975 0.311888845
1 6.421109477 1.954238381 0.917046385 0.349039084 6.633736605 1.627187751 0.880706612 0.346350393
2 6.187522396 1.994178951 0.881417644 0.38571426 6.422238767 1.685610306 0.875399565 0.351651773
3 5.975654953 2.180870669 0.871365681 0.444535385 6.245207747 1.925609129 0.915266481 0.424662193
4 5.681784682 2.182018258 0.846469896 0.38550673 6.004553419 1.947533306 0.890484046 0.404342645
5 5.550390285 2.189799132 0.834608476 0.340348644 5.831848009 1.849502381 0.887486532 0.387460845
6 5.382758749 2.460409982 0.832118248 0.360057614 5.810419947 2.06423957 0.954814407 0.38078381
7 4.819027419 2.643911373 0.78895866 0.38043413 5.42194855 2.259929373 0.935858628 0.37891625
8 3.782918423 2.584426217 0.643611576 0.335647266 4.418220284 2.186679796 0.790979174 0.364691895
9 3.064023314 2.528951519 0.496242154 0.294101493 3.64670387 2.091471213 0.592464821 0.341064247
10 2.62392179 2.707531426 0.380282732 0.249942178 3.159422995 2.392110771 0.452474888 0.334645666

Load in data
dat <- read.table(text = "zone D1 D1sd D1b D1bsd D2 D2sd D2b D2bsd
-10 6.018198819 1.353674355 0.820238734 0.299921523 6.149905542 1.559112995 0.71903318 0.281436916
-9 6.016694189 1.348320178 0.790463895 0.320471326 6.225247218 1.810133214 0.690944285 0.291123921
-8 6.075920068 1.268199241 0.792396958 0.295767298 6.452827975 1.890055573 0.698130383 0.285354803
-7 6.014926533 1.15754388 0.826652396 0.269340472 6.364786271 1.677836628 0.748784125 0.262342978
-6 5.934024155 1.097224151 0.876312952 0.287715603 6.167672962 1.558124318 0.755995918 0.265152681
-5 6.180879693 1.115373166 0.911045374 0.302416557 6.429580579 1.485044161 0.783518016 0.255475422
-4 6.215761357 1.287465467 0.930981232 0.302896699 6.579955644 1.388358072 0.810873074 0.234479504
-3 6.191414137 1.297136068 0.859521028 0.301839757 6.72533907 1.383269712 0.786424272 0.242793151
-2 6.249558839 1.484243431 0.870789671 0.315339266 6.738830636 1.39348093 0.822833797 0.28853238
-1 6.279693424 1.462642241 0.890051094 0.313090388 6.665698185 1.272444414 0.849884276 0.309606843
0 6.389352438 1.653046732 0.911295197 0.332748249 6.623842834 1.3384852 0.860175975 0.311888845
1 6.421109477 1.954238381 0.917046385 0.349039084 6.633736605 1.627187751 0.880706612 0.346350393
2 6.187522396 1.994178951 0.881417644 0.38571426 6.422238767 1.685610306 0.875399565 0.351651773
3 5.975654953 2.180870669 0.871365681 0.444535385 6.245207747 1.925609129 0.915266481 0.424662193
4 5.681784682 2.182018258 0.846469896 0.38550673 6.004553419 1.947533306 0.890484046 0.404342645
5 5.550390285 2.189799132 0.834608476 0.340348644 5.831848009 1.849502381 0.887486532 0.387460845
6 5.382758749 2.460409982 0.832118248 0.360057614 5.810419947 2.06423957 0.954814407 0.38078381
7 4.819027419 2.643911373 0.78895866 0.38043413 5.42194855 2.259929373 0.935858628 0.37891625
8 3.782918423 2.584426217 0.643611576 0.335647266 4.418220284 2.186679796 0.790979174 0.364691895
9 3.064023314 2.528951519 0.496242154 0.294101493 3.64670387 2.091471213 0.592464821 0.341064247
10 2.62392179 2.707531426 0.380282732 0.249942178 3.159422995 2.392110771 0.452474888 0.334645666", header = T)
First simple solution
A first attempt. This first way is the 'normal' way of doing this. Normally we could flip x and y with coord_flip(), but that doesn't work with facets and free scales, unfortunately.
library(ggplot2)
dat2 <- data.frame(D = rep(c("D1", "D1b", "D2", "D2b"), each = nrow(dat)),
group = rep(c('1', '2'), each = nrow(dat) * 2),
zone = dat$zone,
value = unlist(dat[c(2, 4, 6, 8)]),
SD = unlist(dat[c(3, 5, 7, 9)]))
ggplot(dat2, aes(zone, value, ymin = value - SD, ymax = value + SD, fill = group)) +
geom_point() + geom_line() + geom_ribbon(alpha = 0.2) +
facet_wrap(~D, scales = 'free') +
theme_bw()
A solution with flipped axes
You can actually get flipped axes when you manually draw the polygons. This code is hardly pretty, but you should get the idea.
polydat <- data.frame(D = rep(c("D1", "D1b", "D2", "D2b"), each = nrow(dat) * 2),
value = c(dat$D1 - dat$D1sd, rev(dat$D1 + dat$D1sd),
dat$D1b - dat$D1bsd, rev(dat$D1b + dat$D1bsd),
dat$D2 - dat$D2sd, rev(dat$D2 + dat$D2sd),
dat$D2b - dat$D2bsd, rev(dat$D2b + dat$D2bsd)),
zone = c(dat$zone, rev(dat$zone)),
group = rep(c('1', '2'), each = nrow(dat) * 4))
ggplot(dat2, aes(value, zone, fill = group)) +
geom_point() + geom_path() +
geom_polygon(data = polydat, alpha = 0.2) +
facet_wrap(~D, scales = 'free') +
theme_bw()
One way of getting this into two plots is to normalize the data into a common x-axis first (using scale for example).

Adding vertical line in plot ggplot

I am plotting a graph using the following piece of code:
library (ggplot2)
png (filename = "graph.png")
stats <- read.table("processed-r.dat", header=T, sep=",")
attach (stats)
stats <- stats[order(best), ]
sp <- stats$A / stats$B
index <- seq (1, sum (sp >= 1.0))
stats <- data.frame (x=index, y=sp[sp>=1.0])
ggplot (data=stats, aes (x=x, y=y, group=1)) + geom_line()
dev.off ()
1 - How one can add a vertical line in the plot which intersects at a particular value of y (for example 2)?
2 - How one can make the y-axis start at 0.5 instead of 1?

You can add vertical line with geom_vline(). In your case:
+ geom_vline(xintercept=2)
If you want to see also number 0.5 on your y axis, add scale_y_continuous() and set limits= and breaks=
+ scale_y_continuous(breaks=c(0.5,1,2,3,4,5),limits=c(0.5,6))

Regarding the first question:
This answer is assuming that the value of Y you desire is specifically within your data set. First, let's create a reproducible example as I cannot access your data set:
set.seed(9999)
stats <- data.frame(y = sort(rbeta(250, 1, 10)*10 ,decreasing = TRUE), x = 1:250)
ggplot(data=stats, aes (x=x, y=y, group=1)) + geom_line()
What you need to do is to use the y column in your data frame to search for the specific value. Essentially you will need to use
ggplot(data=stats, aes (x=x, y=y, group=1)) + geom_line() +
geom_vline(xintercept = stats[stats$y == 2, "x"])
Using the data I generated above, here's an example. Since my data frame does not likely contain the exact value 2, I will use the trunc function to search for it:
stats[trunc(stats$y) == 2, ]
# y x
# 9 2.972736 9
# 10 2.941141 10
# 11 2.865942 11
# 12 2.746600 12
# 13 2.741729 13
# 14 2.693501 14
# 15 2.680031 15
# 16 2.648504 16
# 17 2.417008 17
# 18 2.404882 18
# 19 2.370218 19
# 20 2.336434 20
# 21 2.303528 21
# 22 2.301500 22
# 23 2.272696 23
# 24 2.191114 24
# 25 2.136638 25
# 26 2.067315 26
Now we know where all the values of 2 are. Since this graph is decreasing, we will reverse it, then the value closest to 2 will be at the beginning:
rev(stats[trunc(stats$y) == 2, 1])
# y x
# 26 2.067315 26
And we can use that value to specify where the x intercept should be:
ggplot(data=stats, aes (x=x, y=y, group=1)) + geom_line() +
geom_vline(xintercept = rev(stats[trunc(stats$y) == 2, "x"])[1])
Hope that helps!

Coloring line segments in ggplot2

Suppose I have following data for a student's score on a test.
set.seed(1)
df <- data.frame(question = 0:10,
resp = c(NA,sample(c("Correct","Incorrect"),10,replace=TRUE)),
score.after.resp=50)
for (i in 1:10) {
ifelse(df$resp[i+1] == "Correct",
df$score.after.resp[i+1] <- df$score.after.resp[i] + 5,
df$score.after.resp[i+1] <- df$score.after.resp[i] - 5)
}
df
.
question resp score.after.resp
1 0 <NA> 50
2 1 Correct 55
3 2 Correct 60
4 3 Incorrect 55
5 4 Incorrect 50
6 5 Correct 55
7 6 Incorrect 50
8 7 Incorrect 45
9 8 Incorrect 40
10 9 Incorrect 35
11 10 Correct 40
I want to get following graph:
library(ggplot2)
ggplot(df,aes(x = question, y = score.after.resp)) + geom_line() + geom_point()
My problem is: I want to color segments of this line according to student response. If correct (increasing) line segment will be green and if incorrect response (decreasing) line should be red.
I tried following code but did not work:
ggplot(df,aes(x = question, y = score.after.resp, color=factor(resp))) +
geom_line() + geom_point()
Any ideas?

I would probably approach this a little differently, and use geom_segment instead:
df1 <- as.data.frame(with(df,cbind(embed(score.after.resp,2),embed(question,2))))
colnames(df1) <- c('yend','y','xend','x')
df1$col <- ifelse(df1$y - df1$yend >= 0,'Decrease','Increase')
ggplot(df1) +
geom_segment(aes(x = x,y = y,xend = xend,yend = yend,colour = col)) +
geom_point(data = df,aes(x = question,y = score.after.resp))
A brief explanation:
I'm using embed to transform the x and y variables into starting and ending points for each line segment, and then simply adding a variable that indicates whether each segment went up or down. Then I used the previous data frame to add the original points themselves.
Alternatively, I suppose you could use geom_line something like this:
df$resp1 <- c(as.character(df$resp[-1]),NA)
ggplot(df,aes(x = question, y = score.after.resp, color=factor(resp1),group = 1)) +
geom_line() + geom_point(color = "black")

By default ggplot2 groups the data according to the aesthetics that are mapped to factors. You can override this default by setting group explicitly,
last_plot() + aes(group=NA)

How to create summary tables and graphs in R by looping through the response variables (in columns)

I have a dataset with multiple response variables and three treatments. Treatment2 is nested within treatment1 and treatment3 is nested within treatment 2. I have shown only three response variables for the sake of simplicity. I would like to run this over 22 response variable of which 3 are shown in the demo table.
My objective:
To visualize how the response variable(s) change based on the treatment combination. I have created a script to perform this on one response variable. I am copy pasting this code to run through other columns which to me is an extremely crude way to do it. Which leads to my second objective.
Automate or modify the following script so that it can automatically loops through the column and produce desired table and graphs.
Demo data:
demo.table
Here is my script:
library(doBy)
length2 <- function (x, na.rm=FALSE) {
if (na.rm) sum(!is.na(x))
else length(x)
}
attach (demo)
cdataNA <- summaryBy(tyr ~ spp + wat + ins, data=demo, FUN=c(length2,mean,sd), na.rm=TRUE)
# Rename column change.length to just N
names(cdataNA)[names(cdataNA)=="tyr.length2"] <- "N"
# Calculate standard error of the mean
cdataNA$tyr.SE <- cdataNA$tyr.sd / sqrt(cdataNA$N)
cdataNA
# Now create a barplot using ggplot2
library(ggplot2)
a <- ggplot(cdataNA, aes(x = wat, y = tyr.mean, fill = ins))
b <- a + geom_bar(stat = "identity", position = "dodge") + facet_grid (~ spp)
# Now put errorbars.
c <- b + geom_errorbar(aes(ymin=tyr.mean-tyr.SE, ymax=tyr.mean+tyr.SE),
width=.2, # Width of the error bars
position=position_dodge(.9)) +
xlab ("wat") +
ylab ("tyr (PA/PA std)")
c
## esc
library(doBy)
length2 <- function (x, na.rm=FALSE) {
if (na.rm) sum(!is.na(x))
else length(x)
}
cdataNA1 <- summaryBy(esc ~ spp + wat + ins, data=demo, FUN=c(length2,mean,sd), na.rm=TRUE)
# Rename column change.length to just N
names(cdataNA1)[names(cdataNA1)=="esc.length2"] <- "N"
# Calculate standard error of the mean
cdataNA1$esc.SE <- cdataNA1$esc.sd / sqrt(cdataNA1$N)
cdataNA1
# Now create a barplot using ggplot2
library(ggplot2)
a1 <- ggplot(cdataNA1, aes(x = wat, y = esc.mean, fill = ins))
b1 <- a1 + geom_bar(stat = "identity", position = "dodge") + facet_grid (~ spp)
# Now put errorbars.
c1 <- b1 + geom_errorbar(aes(ymin=esc.mean-esc.SE, ymax=esc.mean+esc.SE),
width=.2, # Width of the error bars
position=position_dodge(.9)) +
xlab ("wat") +
ylab ("esc (PA/PA std)")
c1
Resulting table for tyr:
spp wat ins N tyr.mean tyr.sd tyr.SE
1 Bl High No 4 0.305325 0.034102041 0.017051020
2 Bl High Yes 5 0.186140 0.045165894 0.020198802
3 Bl Low No 5 0.310540 0.061810096 0.027642315
4 Bl Low Yes 5 0.202840 0.029034944 0.012984822
5 Man High No 4 0.122725 0.075867005 0.037933503
6 Man High Yes 5 0.081800 0.013463469 0.006021046
7 Man Low No 5 0.079880 0.009569587 0.004279650
8 Man Low Yes 4 0.083550 0.018431947 0.009215973
Resulting graph for esc:
demo figure for esc
So the whole thing works but still requires considerable manual labor which impedes the work-flow. it would be great to achieve automation.
Thanks in advance.

You can organize the data in just two lines:
melt.dta <- melt(dta, id.vars = c("spp", "wat", "ins"), measure.vars = "tyr")
cast(melt.dta, spp + wat + ins ~ .,
function (x) c("N" = sum(!is.na(x)),
"mean" = mean(x, na.rm = TRUE),
"sd" = sd(x, na.rm = TRUE),
"se" = sd(x, na.rm = TRUE)/sqrt(sum(!is.na(x)))))
It returns:
spp wat ins N mean sd se
1 Bl High No 4 0.3053 0.03410 0.01705
2 Bl High Yes 5 0.1861 0.04517 0.02020
3 Bl Low No 5 0.3105 0.06181 0.02764
4 Bl Low Yes 5 0.2028 0.02903 0.01298
5 Man High No 4 0.1227 0.07587 0.03793
6 Man High Yes 5 0.0818 0.01346 0.00602
7 Man Low No 5 0.0799 0.00957 0.00428
8 Man Low Yes 4 0.0835 0.01843 0.00922

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

ggplot2 missing data when plotting histogram with custom x axis limits - r

Related

Proportion of patients in different strata

Plot lines with their standard deviation on different x axis

Adding vertical line in plot ggplot

Coloring line segments in ggplot2

How to create summary tables and graphs in R by looping through the response variables (in columns)

Categories

Resources