I would like to plot a grouped boxplot using ggplot. Something like the picture below:
Below please see a sample (10 rows) from my data:
alpha colsample_bytree best_F1
35 0.00 0.5 0.5825656
78 0.10 0.3 0.4716612
68 0.00 0.3 0.4714286
27 0.40 1.0 0.4786216
49 0.15 0.5 0.4943968
62 0.00 0.3 0.4938805
70 0.00 0.3 0.4849785
73 0.10 0.3 0.4997061
59 0.30 0.5 0.4856369
88 0.20 0.3 0.4552402
sort(unique(data$alpha))
0 0.1 0.15 0.2 0.3 0.4
sort(unique(data$colsample_bytree))
0.3 0.5 1
My code is the following:
library(ggplot2)
library(ggthemes)
ggplot(data, aes(x= colsample_bytree, y = best_F1, fill = as.factor(alpha))) +
geom_boxplot(alpha = 0.5, position=position_dodge(1)) + theme_economist() +
ggtitle("F1 for alpha and colsample_bytree")
This produces the following plot:
and the following Warning:
Warning message:
"position_dodge requires non-overlapping x intervals"
Since the variable colsample_bytree takes 3 discrete values and the variable alpha takes 6 I would expect to see 3 groups of boxplots --each group comprised from 6 boxplots corresponding to the different alpa values and each group positioned at a different value of colsample_bytree,i.e. 0.3, 0.5 and 1.
I would expect the boxplots to not overlap just like in the example I cited.
You just have to include data$colsample_bytree <- as.factor(data$colsample_bytree) before you plot your data with the ggplot command.
I have this data frame "df" (showing 15 of the 1000 tuples)
inf sup frec prob
1 1.000318 1.005308 12 0.060
2 1.005308 1.010297 5 0.025
3 1.010297 1.015286 5 0.025
4 1.015286 1.020276 2 0.010
5 1.020276 1.025265 3 0.015
6 1.025265 1.030254 3 0.015
7 1.030254 1.035244 8 0.040
8 1.035244 1.040233 2 0.010
9 1.040233 1.045223 3 0.015
10 1.045223 1.050212 0 0.000
11 1.050212 1.055201 4 0.020
12 1.055201 1.060191 1 0.005
13 1.060191 1.065180 1 0.005
14 1.065180 1.070169 0 0.000
15 1.070169 1.075159 1 0.005
And i want to plot a segment in the interval of x = [ inf[ i ]:sup[ i ] ], and in the y axis = prob[i], for each row.
I tried this solution, using a "for loop" to plot each segment:
plot <- ggplot(data = df)
for(i in 1:15){
plot <- plot + geom_segment(aes(x = df$inf[i], xend = df$sup[i], y = df$prob[i], yend = df$prob[i]))
}
plot
But all i get is a single line in y = 0; i assume because my "prob" has values close to zero. The other problem is that if the for loop goes up to a decent value, an error pops saying:
Error: nested evaluation too deep; Infinite recursion options (expressions =)?
Is there any way to plot those segments by its x intervals?
Or maybe abandon the idea of intervals and plot some points per interval would be better?
I am using ggplot 2.1.0 to plot histograms, and I have an unexpected behaviour concerning the histogram bins.
I put here an example with left-closed bins (i.e. [ 0, 0.1 [ ) with a binwidth of 0.1.
mydf <- data.frame(myvar=c(-1,-0.5,-0.4,-0.1,-0.1,0.05,0.1,0.1,0.25,0.5,1))
myplot <- ggplot(mydf, aes(myvar)) + geom_histogram(aes(y=..count..),binwidth = 0.1, boundary=0.1,closed="left")
myplot
ggplot_build(myplot)$data[[1]]
On this example, one may expect the value -0.4 to be within the bin [-0.4, -0.3[, but it falls instead (mysteriously) in the bin [-0.5,-0.4[. Same thing for the value -0.1 which falls in [-0.2,-0.1[ instead of [-0.1,0[...etc.
Is there something here I do not fully understand (especially with the new "center" and "boundary" params)? Or is ggplot2 doing weird things there?
Thanks in advance,
Best regards,
Arnaud
PS: Also asked here: https://github.com/hadley/ggplot2/issues/1651
Edit: The problem described below was fixed in a recent release of ggplot2.
Your issue is reproducible and appears to be caused by rounding errors, as suggested in the comments by Roland. At this point, this looks to me like a bug introduced in version ggplot2_2.0.0. I speculate below about its origin, but first let me present a workaround based on the boundary option.
PROBLEM:
df <- data.frame(var = seq(-100,100,10)/100)
as.list(df) # check the data
$var
[1] -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2
[10] -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
[19] 0.8 0.9 1.0
library("ggplot2")
p <- ggplot(data = df, aes(x = var)) +
geom_histogram(aes(y = ..count..),
binwidth = 0.1,
boundary = 0.1,
closed = "left")
p
SOLUTION
Tweak the boundary parameter. In this example, setting just below 1, say 0.99, works. Your use case should be amenable to tweaking too.
ggplot(data = df, aes(x = var)) +
geom_histogram(aes(y = ..count..),
binwidth = 0.05,
boundary = 0.99,
closed = "left")
(I have made the binwidth narrower for better visual)
Another workaround is to introduce your own fuzziness, e.g. multiply the data by 1 plus slightly less than the machine zero (see eps below). In ggplot2 the fuzziness multiplies by 1e-7 (earlier versions) or 1e-8 (later versions).
CAUSE:
The problem appears clearly in ncount:
str(ggplot_build(p)$data[[1]])
## 'data.frame': 20 obs. of 17 variables:
## $ y : num 1 1 1 1 1 2 1 1 1 0 ...
## $ count : num 1 1 1 1 1 2 1 1 1 0 ...
## $ x : num -0.95 -0.85 -0.75 -0.65 -0.55 -0.45 -0.35 -0.25 -0.15 -0.05 ...
## $ xmin : num -1 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 ...
## $ xmax : num -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 ...
## $ density : num 0.476 0.476 0.476 0.476 0.476 ...
## $ ncount : num 0.5 0.5 0.5 0.5 0.5 1 0.5 0.5 0.5 0 ...
## $ ndensity: num 1.05 1.05 1.05 1.05 1.05 2.1 1.05 1.05 1.05 0 ...
## $ PANEL : int 1 1 1 1 1 1 1 1 1 1 ...
## $ group : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ ymin : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ymax : num 1 1 1 1 1 2 1 1 1 0 ...
## $ colour : logi NA NA NA NA NA NA ...
## $ fill : chr "grey35" "grey35" "grey35" "grey35" ...
## $ size : num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ linetype: num 1 1 1 1 1 1 1 1 1 1 ...
## $ alpha : logi NA NA NA NA NA NA ...
ggplot_build(p)$data[[1]]$ncount
## [1] 0.5 0.5 0.5 0.5 0.5 1.0 0.5 0.5 0.5 0.0 1.0 0.5
## [13] 0.5 0.5 0.0 1.0 0.5 0.0 1.0 0.5
ROUNDING ERRORS?
Looks like:
df <- data.frame(var = as.integer(seq(-100,100,10)))
# eps <- 1.000000000000001 # on my system
eps <- 1+10*.Machine$double.eps
p <- ggplot(data = df, aes(x = eps*var/100)) +
geom_histogram(aes(y = ..count..),
binwidth = 0.05,
closed = "left")
p
(I have removed the boundary option altogether)
This behaviour appears some time after ggplot2_1.0.1. Looking at the source code, e.g. bin.R and stat-bin.r in https://github.com/hadley/ggplot2/blob/master/R, and tracing the computations of count leads to function bin_vector(), which contains the following lines:
bin_vector <- function(x, bins, weight = NULL, pad = FALSE) {
... STUFF HERE I HAVE DELETED FOR CLARITY ...
cut(x, bins$breaks, right = bins$right_closed,
include.lowest = TRUE)
... STUFF HERE I HAVE DELETED FOR CLARITY ...
}
By comparing the current versions of these functions with older ones, you should be able to find the reason for the different behaviour... to be continued...
SUMMING UP DEBUGGING
By "patching" the bin_vector function and printing the output to screen, it appears that:
bins$fuzzy correctly stores the fuzzy parameters
The non-fuzzy bins$breaks are used in the computations, but as far as I can see (and correct me if I'm wrong) the bins$fuzzy are not.
If I simply replace bins$breaks with bins$fuzzy at the top of bin_vector, the correct plot is returned. Not a proof of a bug, but a suggestion that perhaps more could be done to emulate the behaviour of previous versions of ggplot2.
At the top of bin_vector I expected to find a condition upon which to return either bins$breaks or bins$fuzzy. I think that's missing now.
PATCHING
To "patch" the bin_vector function, copy the function definition from the github source or, more conveniently, from the terminal, with:
ggplot2:::bin_vector
Modify it (patch it) and assign it into the namespace:
library("ggplot2")
bin_vector <- function (x, bins, weight = NULL, pad = FALSE)
{
... STUFF HERE I HAVE DELETED FOR CLARITY ...
## MY PATCH: Replace bins$breaks with bins$fuzzy
bin_idx <- cut(x, bins$fuzzy, right = bins$right_closed,
include.lowest = TRUE)
... STUFF HERE I HAVE DELETED FOR CLARITY ...
ggplot2:::bin_out(bin_count, bin_x, bin_widths)
## THIS IS THE PATCHED FUNCTION
}
assignInNamespace("bin_vector", bin_vector, ns = "ggplot2")
df <- data.frame(var = seq(-100,100,10)/100)
ggplot(data = df, aes(x = var)) + geom_histogram(aes(y = ..count..), binwidth = 0.05, boundary = 1, closed = "left")
Just to be clear, the code above is edited for clarity: the function has a lot of type-checking and other calculations which I have removed, but which you would need to patch the function. Before you run the patch, restart your R session or detach your currently loaded ggplot2.
OLD VERSIONS
The unexpected behaviour is NOT observed in versions 2.0.9.3 or 2.1.0.1 and appears to originate in the current release 2.2.0.1 (or perhaps the earlier 2.2.0.0, which gave me an error when I tried to call it).
To install and load an old version, say ggplot2_0.9.3, create a separate directory (no point in overwriting the current version), say ggplot2093:
URL <- "http://cran.r-project.org/src/contrib/Archive/ggplot2/ggplot2_0.9.3.tar.gz"
install.packages(URL, repos = NULL, type = "source",
lib = "~/R/testing/ggplot2093")
To load the old version, call it from your local directory:
library("ggplot2", lib.loc = "~/R/testing/ggplot2093")
I want to get the hexadecimal codes of the colors that the scale_fill_grey function uses to fill the categories of the barplot produced by the following codes:
library(ggplot2)
data <- data.frame(
Meal = factor(c("Breakfast","Lunch","Dinner","Snacks"),
levels=c("Breakfast","Lunch","Dinner","Snacks")),
Cost = c(9.75,13,19,10.20))
ggplot(data=data, aes(x=Meal, y=Cost, fill=Meal)) +
geom_bar(stat="identity") +
scale_fill_grey(start=0.8, end=0.2)
scale_fill_grey() uses grey_pal() from the scales package, which in turn uses grey.colors(). So, you can generate the codes for the scale of four colours that you used as follows:
grey.colors(4, start = 0.8, end = 0.2)
## [1] "#CCCCCC" "#ABABAB" "#818181" "#333333"
This shows a plot with the colours
image(1:4, 1, matrix(1:4), col = grey.colors(4, start = 0.8, end = 0.2))
Using ggplot_build() function:
#assign ggplot to a variable
myplot <- ggplot(data=data, aes(x=Meal, y=Cost, fill=Meal)) +
geom_bar(stat="identity") +
scale_fill_grey(start=0.8, end=0.2)
#get build
myplotBuild <- ggplot_build(myplot)
#see colours
myplotBuild$data
# [[1]]
# fill x y PANEL group ymin ymax xmin xmax colour size linetype alpha
# 1 #CCCCCC 1 9.75 1 1 0 9.75 0.55 1.45 NA 0.5 1 NA
# 2 #ABABAB 2 13.00 1 2 0 13.00 1.55 2.45 NA 0.5 1 NA
# 3 #818181 3 19.00 1 3 0 19.00 2.55 3.45 NA 0.5 1 NA
# 4 #333333 4 10.20 1 4 0 10.20 3.55 4.45 NA 0.5 1 NA
I need to create a matrix of histograms with ggplot and facet_wrap(). More or less the code I have is the following:
df_3<-data.frame(rnorm(1000),...,rnorm(1000))
#The data frame has 1000 observations and 16 variables.
colnames(df_3) <- letters[1:16]
library(ggplot2)
gr12 <- ggplot(df_3, aes(x=observations)) + geom_histogram()
My question is: how can I do to plot the matrix of histograms with facet_wrap() and without a factor variable?
Melt the dataframe into a long format (see reshape2), thus your data frame goes from being
a b c ... p
1 0.1 0.2 0.3 ... 0.16
2 0.1 0.1 0.2 ... 0.00
(My internal randomizer is really bad.)
to
variable value
a 0.1
b 0.2
c 0.3
... ...
p 0.16
a 0.1
b 0.1
c 0.2
... ...
p 0.00
Then variable is your factor that you wish to facet by. If the long formatted data frame is df_4, I imagine you could do
ggplot(df_4, aes(x=value)) + stat_histogram() + facet_wrap(variable)