geom_histogram: wrong bins? - r

I am using ggplot 2.1.0 to plot histograms, and I have an unexpected behaviour concerning the histogram bins.
I put here an example with left-closed bins (i.e. [ 0, 0.1 [ ) with a binwidth of 0.1.
mydf <- data.frame(myvar=c(-1,-0.5,-0.4,-0.1,-0.1,0.05,0.1,0.1,0.25,0.5,1))
myplot <- ggplot(mydf, aes(myvar)) + geom_histogram(aes(y=..count..),binwidth = 0.1, boundary=0.1,closed="left")
myplot
ggplot_build(myplot)$data[[1]]
On this example, one may expect the value -0.4 to be within the bin [-0.4, -0.3[, but it falls instead (mysteriously) in the bin [-0.5,-0.4[. Same thing for the value -0.1 which falls in [-0.2,-0.1[ instead of [-0.1,0[...etc.
Is there something here I do not fully understand (especially with the new "center" and "boundary" params)? Or is ggplot2 doing weird things there?
Thanks in advance,
Best regards,
Arnaud
PS: Also asked here: https://github.com/hadley/ggplot2/issues/1651

Edit: The problem described below was fixed in a recent release of ggplot2.
Your issue is reproducible and appears to be caused by rounding errors, as suggested in the comments by Roland. At this point, this looks to me like a bug introduced in version ggplot2_2.0.0. I speculate below about its origin, but first let me present a workaround based on the boundary option.
PROBLEM:
df <- data.frame(var = seq(-100,100,10)/100)
as.list(df) # check the data
$var
[1] -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2
[10] -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
[19] 0.8 0.9 1.0
library("ggplot2")
p <- ggplot(data = df, aes(x = var)) +
geom_histogram(aes(y = ..count..),
binwidth = 0.1,
boundary = 0.1,
closed = "left")
p
SOLUTION
Tweak the boundary parameter. In this example, setting just below 1, say 0.99, works. Your use case should be amenable to tweaking too.
ggplot(data = df, aes(x = var)) +
geom_histogram(aes(y = ..count..),
binwidth = 0.05,
boundary = 0.99,
closed = "left")
(I have made the binwidth narrower for better visual)
Another workaround is to introduce your own fuzziness, e.g. multiply the data by 1 plus slightly less than the machine zero (see eps below). In ggplot2 the fuzziness multiplies by 1e-7 (earlier versions) or 1e-8 (later versions).
CAUSE:
The problem appears clearly in ncount:
str(ggplot_build(p)$data[[1]])
## 'data.frame': 20 obs. of 17 variables:
## $ y : num 1 1 1 1 1 2 1 1 1 0 ...
## $ count : num 1 1 1 1 1 2 1 1 1 0 ...
## $ x : num -0.95 -0.85 -0.75 -0.65 -0.55 -0.45 -0.35 -0.25 -0.15 -0.05 ...
## $ xmin : num -1 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 ...
## $ xmax : num -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 ...
## $ density : num 0.476 0.476 0.476 0.476 0.476 ...
## $ ncount : num 0.5 0.5 0.5 0.5 0.5 1 0.5 0.5 0.5 0 ...
## $ ndensity: num 1.05 1.05 1.05 1.05 1.05 2.1 1.05 1.05 1.05 0 ...
## $ PANEL : int 1 1 1 1 1 1 1 1 1 1 ...
## $ group : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ ymin : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ymax : num 1 1 1 1 1 2 1 1 1 0 ...
## $ colour : logi NA NA NA NA NA NA ...
## $ fill : chr "grey35" "grey35" "grey35" "grey35" ...
## $ size : num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ linetype: num 1 1 1 1 1 1 1 1 1 1 ...
## $ alpha : logi NA NA NA NA NA NA ...
ggplot_build(p)$data[[1]]$ncount
## [1] 0.5 0.5 0.5 0.5 0.5 1.0 0.5 0.5 0.5 0.0 1.0 0.5
## [13] 0.5 0.5 0.0 1.0 0.5 0.0 1.0 0.5
ROUNDING ERRORS?
Looks like:
df <- data.frame(var = as.integer(seq(-100,100,10)))
# eps <- 1.000000000000001 # on my system
eps <- 1+10*.Machine$double.eps
p <- ggplot(data = df, aes(x = eps*var/100)) +
geom_histogram(aes(y = ..count..),
binwidth = 0.05,
closed = "left")
p
(I have removed the boundary option altogether)
This behaviour appears some time after ggplot2_1.0.1. Looking at the source code, e.g. bin.R and stat-bin.r in https://github.com/hadley/ggplot2/blob/master/R, and tracing the computations of count leads to function bin_vector(), which contains the following lines:
bin_vector <- function(x, bins, weight = NULL, pad = FALSE) {
... STUFF HERE I HAVE DELETED FOR CLARITY ...
cut(x, bins$breaks, right = bins$right_closed,
include.lowest = TRUE)
... STUFF HERE I HAVE DELETED FOR CLARITY ...
}
By comparing the current versions of these functions with older ones, you should be able to find the reason for the different behaviour... to be continued...
SUMMING UP DEBUGGING
By "patching" the bin_vector function and printing the output to screen, it appears that:
bins$fuzzy correctly stores the fuzzy parameters
The non-fuzzy bins$breaks are used in the computations, but as far as I can see (and correct me if I'm wrong) the bins$fuzzy are not.
If I simply replace bins$breaks with bins$fuzzy at the top of bin_vector, the correct plot is returned. Not a proof of a bug, but a suggestion that perhaps more could be done to emulate the behaviour of previous versions of ggplot2.
At the top of bin_vector I expected to find a condition upon which to return either bins$breaks or bins$fuzzy. I think that's missing now.
PATCHING
To "patch" the bin_vector function, copy the function definition from the github source or, more conveniently, from the terminal, with:
ggplot2:::bin_vector
Modify it (patch it) and assign it into the namespace:
library("ggplot2")
bin_vector <- function (x, bins, weight = NULL, pad = FALSE)
{
... STUFF HERE I HAVE DELETED FOR CLARITY ...
## MY PATCH: Replace bins$breaks with bins$fuzzy
bin_idx <- cut(x, bins$fuzzy, right = bins$right_closed,
include.lowest = TRUE)
... STUFF HERE I HAVE DELETED FOR CLARITY ...
ggplot2:::bin_out(bin_count, bin_x, bin_widths)
## THIS IS THE PATCHED FUNCTION
}
assignInNamespace("bin_vector", bin_vector, ns = "ggplot2")
df <- data.frame(var = seq(-100,100,10)/100)
ggplot(data = df, aes(x = var)) + geom_histogram(aes(y = ..count..), binwidth = 0.05, boundary = 1, closed = "left")
Just to be clear, the code above is edited for clarity: the function has a lot of type-checking and other calculations which I have removed, but which you would need to patch the function. Before you run the patch, restart your R session or detach your currently loaded ggplot2.
OLD VERSIONS
The unexpected behaviour is NOT observed in versions 2.0.9.3 or 2.1.0.1 and appears to originate in the current release 2.2.0.1 (or perhaps the earlier 2.2.0.0, which gave me an error when I tried to call it).
To install and load an old version, say ggplot2_0.9.3, create a separate directory (no point in overwriting the current version), say ggplot2093:
URL <- "http://cran.r-project.org/src/contrib/Archive/ggplot2/ggplot2_0.9.3.tar.gz"
install.packages(URL, repos = NULL, type = "source",
lib = "~/R/testing/ggplot2093")
To load the old version, call it from your local directory:
library("ggplot2", lib.loc = "~/R/testing/ggplot2093")

Related

Splitting up a correlation matrix in R

I have a dataset with 400 variables that I am using to produce a correlation matrix (i.e. comparing each variable against one another). The resulting matrix has the following structure:
var_1
var_2
var_3
var_4
var_5
var_6
etc (to 400 vars)
var_1
1
0.1
0.2
0.2
0.4
0.8
var_2
0.1
1
0.15
0.3
0.11
0.6
var_3
0.2
0.15
1
0.47
0.05
0.72
var_4
0.2
0.3
0.47
1
0.25
0.54
var_5
0.4
0.11
0.05
0.25
1
0.84
var_6
0.8
0.6
0.72
0.54
0.84
1
etc (to 400 vars)
I am then generating a figure of the correlation matrix with the corrplot package using the following command:
library(corrplot)
corrplot(df, order = "hclust",
tl.col = "black", tl.srt = 45)
Unsurprisingly, this results in a very large figure that is interpretable. I was hoping to split up my matrix and then create separate correlation matrix figures (realistically 20 pairwise comparisons of variables at a time). I am struggling to find code that will help me split up my matrix and plug it back into corrplot. Any help would be hugely appreciated!
Thank you.
One approach would be to split your correlation matrix up and then plot all submatrices. The challenge with that is that you are setting order=, which you (apparently) do not know a priori. Assuming you want to allow corrplot to determine the order, then here's a method: plot the whole thing first capturing the function's return value (which contains order information), then split the matrix and plot the components.
Helpful: while most plotting functions operate by side-effect (creating a plot, not necessarily returning values), some return information useful for working with or around its plot components. corrplot is no different; from ?corrplot:
Value:
(Invisibly) returns a 'list(corr, corrTrans, arg)'. 'corr' is a
reordered correlation matrix for plotting. 'corrPos' is a data
frame with 'xName, yName, x, y, corr' and 'p.value'(if p.mat is
not NULL) column, which x and y are the position on the
correlation matrix plot. 'arg' is a list of some corrplot() input
parameters' value. Now 'type' is in.
With this, let's get started. I'll be using mtcars
Plot the whole thing. If this takes a long time or you don't want it to try to plot in R's graphics pane, then uncomment the png and dev.off, intended just to dump the plot itself to "nothing". ("NUL" is a windows thing ... I suspect "/dev/null" should work on most other OSes, untested.)
# png("NUL")
CP <- corrplot::corrplot(M, order="hclust", tl.col="black", tl.srt=45)
# dev.off()
str(CP)
# List of 3
# $ corr : num [1:6, 1:6] 1 0.4 0.8 0.1 0.2 0.2 0.4 1 0.84 0.11 ...
# ..- attr(*, "dimnames")=List of 2
# .. ..$ : chr [1:6] "var_1" "var_5" "var_6" "var_2" ...
# .. ..$ : chr [1:6] "var_1" "var_5" "var_6" "var_2" ...
# $ corrPos:'data.frame': 36 obs. of 5 variables:
# ..$ xName: chr [1:36] "var_1" "var_1" "var_1" "var_1" ...
# ..$ yName: chr [1:36] "var_1" "var_5" "var_6" "var_2" ...
# ..$ x : num [1:36] 1 1 1 1 1 1 2 2 2 2 ...
# ..$ y : num [1:36] 6 5 4 3 2 1 6 5 4 3 ...
# ..$ corr : num [1:36] 1 0.4 0.8 0.1 0.2 0.2 0.4 1 0.84 0.11 ...
# $ arg :List of 1
# ..$ type: chr "full"
rownames(CP$corr) (which equals colnames(CP$corr)) provides the column/row order, so we can use those as our ordering (resulting from order="hclust".
M_reord <- M[rownames(CP$corr), colnames(CP$corr)]
M_reord
# var_1 var_5 var_6 var_2 var_3 var_4
# var_1 1.0 0.40 0.80 0.10 0.20 0.20
# var_5 0.4 1.00 0.84 0.11 0.05 0.25
# var_6 0.8 0.84 1.00 0.60 0.72 0.54
# var_2 0.1 0.11 0.60 1.00 0.15 0.30
# var_3 0.2 0.05 0.72 0.15 1.00 0.47
# var_4 0.2 0.25 0.54 0.30 0.47 1.00
Now we split the matrix up. For this example, I'll assume your "20" is really "3". Some helper objects:
grps <- (seq_len(nrow(M)) - 1) %/% 3
grps
# [1] 0 0 0 1 1 1
eg <- expand.grid(row = unique(grps), col = unique(grps))
eg
# row col
# 1 0 0
# 2 1 0
# 3 0 1
# 4 1 1
where the row of eg counts columns-first (top-bottom then left-right), like so:
Given a specific submatrix number (row of eg), plot it. Let's try "2":
subplt <- 2
rows <- which(grps == eg[subplt, "row"])
cols <- which(grps == eg[subplt, "col"])
corrplot::corrplot(M[rows, cols], tl.col="black", tl.srt=45)
If you want to automate plotting these (e.g., in an rmarkdown document, in shiny), then you can loop over them with:
for (subplt in seq_len(nrow(eg))) {
rows <- which(grps == eg[subplt, "row"])
cols <- which(grps == eg[subplt, "col"])
corrplot::corrplot(M[rows, cols], tl.col="black", tl.srt=45)
}

R - NA handing in a mann-whitney u test

I need to compute weighted Mann Whitney U test results a few hundred times. Each iteration involves is a two-sample test for differences between two groups. I can't figure out how to get the existing function to handle missing values without dynamically deleting cases.
The data for a few of the comparisons are here, in a data frame I call dat. All variables with numbers in this sheet are numeric in type.
Here's how I call the sjstats::mannwhitney() function:
mannwhitney(dat, measure1, group)
When I do so, I get the following error:
Error in `[[<-.data.frame`(`*tmp*`, "grp1.label", value = character(0)) :
replacement has 0 rows, data has 1
I suspect this is because of the missing value in the 212th observation of measure1. But wrapping the vector names in na.omit() or !is.na() don't address the problem, perhaps because doing so still results in a data frame where the number of non-NA values of group are greater than the number of non-NA values in measure1.
Any thoughts on how I could incorporate dynamic NA handling into the function call?
I am not sure what class your group column is, but if I do it like this:
library(sjstats)
dat = read.csv("question - Sheet1.csv")
str(dat)
'data.frame': 301 obs. of 5 variables:
$ measure1 : num 2 1.6 2.2 2.7 1.8 1.8 4 4 3.9 -3.7 ...
$ measure2 : num 0.9 0.1 0 0.4 -1 -1.3 2.1 0 -1.1 -3.9 ...
$ measure3 : num 1.1 1.1 2.2 1.2 1.9 1.2 0 3 1.9 -3.8 ...
$ measurre4: num 2 2 2 3 3 2 3 4 3 2.36 ...
$ group : int 0 0 0 0 0 0 0 0 0 0 ...
I get:
mannwhitney(dat, measure1, group)
Error in `[[<-.data.frame`(`*tmp*`, "grp1.label", value = character(0)) :
replacement has 0 rows, data has 1
Factor your group:
dat$group = factor(dat$group)
mannwhitney(dat, measure1, group)
# Mann-Whitney-U-Test
Groups 1 = 0 (n = 110) | 2 = 1 (n = 190):
U = 16913.000, W = 10808.000, p = 0.621, Z = 0.495
effect-size r = 0.029
rank-mean(1) = 153.75
rank-mean(2) = 148.62
Reading the code, the bug comes from this:
labels <- sjlabelled::get_labels(grp, attr.only = F, values = NULL,
non.labelled = T)
If your group is numeric, it doesn't have attributes and hence you get no labels:
sjlabelled::get_labels(0:1)
NULL
sjlabelled::get_labels(factor(0:1))
[1] "0" "1"

X needs to be a numerical value

I have an assignment for a statistics class where I copied code from the teacher, but I keep getting an error where x should be numerical and I don't see where the issue is
Survey=read.csv("ClassSurvey.csv", na.strings = c("", " ", "NA"))
attach(Survey)
library(FSAdata)
op = par(oma=c(0,0,1.5,0), mar=c(3,3,2,1))
hist(Height ~ Gender,
las=1,
nrow=2, ncol=1,
cex.main=0.9, cex.lab=0.8, cex.axis=0.8,
mgp=c(1.8,0.6,0),
xlab = "Height (cms)")
thank you!!
You have a couple of problems. First, you can't make a histogram on a variable that is not numeric. Try this:
data(ToothGrowth)
str(ToothGrowth)
'data.frame': 60 obs. of 3 variables:
$ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
$ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
$ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
attach(ToothGrowth)
Attempt to draw a histogram of supp, a variable of class "factor" (it's not numeric).
> hist(supp)
Error in hist.default(supp ~ dose) : 'x' must be numeric
Your second problem is that the hist function doesn't accept a formula y ~ x. So
hist(len~supp)
fails as well, even though len is numeric. This works though:
hist(len)
I wonder if you copied the teacher's code incorrectly? Or the teacher's code is incorrect? Or something else.

density histogram in R

I'm new to R and probability&statistics. I have a question regarding the histograms...
hist(rbinom(10000, 10, 0.1), freq=FALSE)
it shows the histogram following diagram which is not clear to me:
if the y-axis is density, so the total number should be %100, am I wrong?
But in the histogram, I can see that it is bigger than %100.
Function hist returns a list object with all information necessary to answer the question.
I will set the RNG seed to make the example reproducible.
set.seed(1234)
h <- hist(rbinom(10000, 10, 0.1), freq=FALSE)
str(h)
#List of 6
# $ breaks : num [1:11] 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 ...
# $ counts : int [1:10] 3448 3930 0 1910 0 588 0 112 0 12
# $ density : num [1:10] 0.69 0.786 0 0.382 0 ...
# $ mids : num [1:10] 0.25 0.75 1.25 1.75 2.25 2.75 3.25 3.75 4.25 4.75
# $ xname : chr "rbinom(10000, 10, 0.1)"
# $ equidist: logi TRUE
# - attr(*, "class")= chr "histogram"
The relevant list members are breaks and density.
breaks is a vector of length 11, so there are 10 bins.
density is a vector of length 10, each corresponding to one of the bins.
Now compute the area of each bar by multiplying the bins lengths by the respective densities.
diff(h$breaks) # bins lengths
# [1] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
h$density*diff(h$breaks)
# [1] 0.3448 0.3930 0.0000 0.1910 0.0000 0.0588 0.0000 0.0112 0.0000 0.0012
Total area:
sum(h$density*diff(h$breaks))
#[1] 1
The area under the curve should be 1. Since your boxes appear to have width 1/2, the sum of the heights should be 2. To make this make more sense, use the breaks parameter to hist
hist(rbinom(10000, 10, 0.1), freq=FALSE, breaks = 5)
Or maybe even better
hist(rbinom(10000, 10, 0.1), freq=FALSE, breaks=seq(-0.5,5.5,1))
You can integrate the density function estimated based on your sample. The answer is approximately 1, so no contradiction.
set.seed(444)
s <- rbinom(10000, 10, 0.1)
dens_s <- table(s)/sum(table(s))
sum(as.numeric(names(dens_s))*dens_s)

how can i find the total frequency of a given range in a histogram?

I created a histogram for a simulation, and now I need to find the total number of instances where the x-variable is greater than a given value. Specifically, my data is correlation (ranging from -1 to 1, with bin size 0.05), and I want to find the percent of the events where the correlation is greater than 0.1. Finding the total number of events greater than 0.1 is fine, because it's an easy percent to compute.
library(psych)
library(lessR)
corrData=NULL
for (i in 1:1000){
x1 <- rnorm(mean=0, sd = 1, n=20)
x2 <- rnorm(mean=0, sd = 1, n=20)
data <- data.frame(x1,x2)
r <- with(data, cor(x1, x2))
corrData <- append(corrData,r)
}
describe(corrData)
hist <- hist(corrData, breaks=seq(-1,1,by=.05), main="N=20")
describe(hist) count(0.1, "N=20")
TRy something like this:
N=500
bh=hist(runif(N,-1,1))
#str(bh)
sum(bh$counts[bh$mids>=.1])/N
Look at what hist is actually giving you (see ?hist):
set.seed(10230)
x<-hist(2*runif(1000)-1)
> str(x)
List of 6
$ breaks : num [1:11] -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 ...
$ counts : int [1:10] 92 99 100 105 92 116 95 102 100 99
$ density : num [1:10] 0.46 0.495 0.5 0.525 0.46 0.58 0.475 0.51 0.5 0.495
$ mids : num [1:10] -0.9 -0.7 -0.5 -0.3 -0.1 0.1 0.3 0.5 0.7 0.9
$ xname : chr "2 * runif(1000) - 1"
$ equidist: logi TRUE
- attr(*, "class")= chr "histogram"
The breaks list item tells you the endpoints of the "catching" intervals. The counts item tells you the counts in the (one fewer) bins defined by these breaks.
So, to get as close as you can to what you want using only your hist object, you could do:
sum(x$counts[which(x$breaks>=.1)-1L])/sum(x$counts)
But, as #Frank said, this may be incorrect, particularly if the bin containing .1 does not have an endpoint at .1.

Resources