Getting counts on bins in a heat map using R - r

This question follows from these two topics:
How to use stat_bin2d() to compute counts labels in ggplot2?
How to show the numeric cell values in heat map cells in r
In the first topic, a user wants to use stat_bin2d to generate a heatmap, and then wants the count of each bin written on top of the heat map. The method the user initially wants to use doesn't work, the best answer stating that stat_bin2d is designed to work with geom = "rect" rather than "text". No satisfactory response is given.
The second question is almost identical to the first, with one crucial difference, that the variables in the second question question are text, not numeric. The answer produces the desired result, placing the count value for a bin over the bin in a stat_2d heat map.
To compare the two methods i've prepared the following code:
library(ggplot2)
data <- data.frame(x = rnorm(1000), y = rnorm(1000))
ggplot(data, aes(x = x, y = y))
geom_bin2d() +
stat_bin2d(geom="text", aes(label=..count..))
We know this first gives you the error:
"Error: geom_text requires the following missing aesthetics: x, y".
Same issue as in the first question. Interestingly, changing from stat_bin2d to stat_binhex works fine:
library(ggplot2)
data <- data.frame(x = rnorm(1000), y = rnorm(1000))
ggplot(data, aes(x = x, y = y))
geom_binhex() +
stat_binhex(geom="text", aes(label=..count..))
Which is great and all, but generally, I don't think hex binning is very clear, and for my purposes wont work for the data i'm trying to desribe. I really want to use stat_2d.
To get this to work, i've prepared the following work around based on the second answer:
library(ggplot2)
data <- data.frame(x = rnorm(1000), y = rnorm(1000))
x_t<-as.character(round(data$x,.1))
y_t<-as.character(round(data$y,.1))
x_x<-as.character(seq(-3,3),1)
y_y<-as.character(seq(-3,3),1)
data<-cbind(data,x_t,y_t)
ggplot(data, aes(x = x_t, y = y_t)) +
geom_bin2d() +
stat_bin2d(geom="text", aes(label=..count..))+
scale_x_discrete(limits =x_x) +
scale_y_discrete(limits=y_y)
This works around allows one to bin numerical data, but to do so, you have to determine bin width (I did it via rounding) before bringing it into ggplot. I actually figured it out while writing this question, so I may as well finish.
This is the result: (turns out I can't post images)
So my real question here, is does any one have a better way to do this? I'm happy I at least got it to work, but so far I haven't seen an answer for putting labels on stat_2d bins when using a numerical variable.
Does any one have a method for passing on x and y arguments to geom_text from stat_2dbin without having to use a work around? Can any one explain why it works with text variables but not with numbers?

Another work around (but perhaps less work). Similar to the ..count.. method you can extract the counts from the plot object in two steps.
library(ggplot2)
set.seed(1)
dat <- data.frame(x = rnorm(1000), y = rnorm(1000))
# plot
p <- ggplot(dat, aes(x = x, y = y)) + geom_bin2d()
# Get data - this includes counts and x,y coordinates
newdat <- ggplot_build(p)$data[[1]]
# add in text labels
p + geom_text(data=newdat, aes((xmin + xmax)/2, (ymin + ymax)/2,
label=count), col="white")

Related

ggplot function like points

Is there any way to add points to a ggplot graph like with the points() function in base graphics? I don't often use ggplot and always prefer base graphics, but this time I must to deal with it. With + geom_point(x = c(1,2,3), y = c(1,2,3)) there is an error:
Error: Aesthetics must be either length 1 or the same as the data (33049): x, y
I'm not quite sure what you're looking for, but you can use the data= argument to geom_point() to override the default behaviour (which is to inherit data from the original ggplot call); as #dc37 points out, x and y need to be specified within a data frame, but you can do this on the fly. You might also need to specify the mapping, if the original x and y variables aren't called x and y ...
+ geom_point(data= data.frame(x = c(1,2,3), y = c(1,2,3)),
mapping = aes(x=x, y=y))
Alternatively (and maybe better):
+ annotate( geom="point", x = 1:3, y = 1:3)
From ?annotate:
This function adds geoms to a plot, but unlike [a typical] geom
function, the properties of the geoms are not mapped from
variables of a data frame, but are instead passed in as vectors.
This is useful for adding small annotations (such as text labels)
or if you have your data in vectors, and for some reason don't
want to put them in a data frame.

How do I loop a ggplot2 functon to export and save about 40 plots?

I am trying to loop a ggplot2 plot with a linear regression line over it. It works when I type the y column name manually, but the loop method I am trying does not work. It is definitely not a dataset issue.
I've tried many solutions from various websites on how to loop a ggplot and the one I've attempted is the simplest I could find that almost does the job.
The code that works is the following:
plots <- ggplot(Everything.any, mapping = aes(x = stock_VWRETD, y = stock_10065)) +
geom_point() +
labs(x = 'Market Returns', y = 'Stock Returns', title ='Stock vs Market Returns') +
geom_smooth(method='lm',formula=y~x)
But I do not want to do this another 40 times (and then 5 times more for other reasons). The code that I've found on-line and have tried to modify it for my means is the following:
plotRegression <- function(z,na.rm=TRUE,...){
nm <- colnames(z)
for (i in seq_along(nm)){
plots <- ggplot(z, mapping = aes(x = stock_VWRETD, y = nm[i])) +
geom_point() +
labs(x = 'Market Returns', y = 'Stock Returns', title ='Stock vs Market Returns') +
geom_smooth(method='lm',formula=y~x)
ggsave(plots,filename=paste("regression1",nm[i],".png",sep=" "))
}
}
plotRegression(Everything.any)
I expect it to be the nice graph that I'd expect to get, a Stock returns vs Market returns graph, but instead on the y-axis, I get one value which is the name of the respective column, and the Market value plotted as normally, but as if on a straight number-line across the one y-axis value. Please let me know what I am doing wrong.
Desired Plot:
Actual Plot:
Sample Data is available on Google Drive here:
https://drive.google.com/open?id=1Xa1RQQaDm0pGSf3Y-h5ZR0uTWE-NqHtt
The problem is that when you assign variables to aesthetics in aes, you mix bare names and strings. In this example, both X and Y are supposed to be variables in z:
aes(x = stock_VWRETD, y = nm[i])
You refer to stock_VWRETD using a bare name (as required with aes), however for y=, you provide the name as a character vector produced by colnames. See what happens when we replicate this with the iris dataset:
ggplot(iris, aes(Petal.Length, 'Sepal.Length')) + geom_point()
Since aes expects variable names to be given as bare names, it doesn't interpret 'Sepal.Length' as a variable in iris but as a separate vector (consisting of a single character value) which holds the y-values for each point.
What can you do? Here are 2 options that both give the proper plot
1) Use aes_string and change both variable names to character:
ggplot(iris, aes_string('Petal.Length', 'Sepal.Length')) + geom_point()
2) Use square bracket subsetting to manually extract the appropriate variable:
ggplot(iris, aes(Petal.Length, .data[['Sepal.Length']])) + geom_point()
you need to use aes_string instead of aes, and double-quotes around your x variable, and then you can directly use your i variable. You can also simplify your for loop call. Here is an example using iris.
library(ggplot2)
plotRegression <- function(z,na.rm=TRUE,...){
nm <- colnames(z)
for (i in nm){
plots <- ggplot(z, mapping = aes_string(x = "Sepal.Length", y = i)) +
geom_point()+
geom_smooth(method='lm',formula=y~x)
ggsave(plots,filename=paste("regression1_",i,".png",sep=""))
}
}
myiris<-iris
plotRegression(myiris)

r - scatterplot summary stat (e.g. sum or mean) for each point instead of individual data points

I am looking for a way to summarize data within a ggplot call, not before. I could pre-aggregate the data and then plot it, but I know there is a way to do it within a ggplot call. I'm just unsure how.
In this example, I want to get a mean for each (x,y) combo, and map it onto the colour aes
library(tidyverse)
df <- tibble(x = rep(c(1,2,4,1,5),10),
y = rep(c(1,2,3,1,5),10),
col = sample(c(1:100), 50))
df_summar <- df %>%
group_by(x,y) %>%
summarise(col_mean = mean(col))
ggplot(df_summar, aes(x=x, y=y, col=col_mean)) +
geom_point(size = 5)
I think there must be a better way to avoid the pre-ggplot step (yes, I could also have piped dplyr transformations into the ggplot, but the mechanics would be the same).
For instance, geom_count() counts the instances and plots them onto size aes:
ggplot(df, aes(x=x, y=y)) + geom_count()
I want the same, but mean instead of count, and col instead of size
I'm guessing I need stat_summary() or a stat() call (a replacement for ..xxx.. notation), but I can't get it to give me what I need.
You'll need stat_summary_2d:
ggplot(df, aes(x, y, z = col)) +
stat_summary_2d(aes(col = ..value..), fun = 'mean', geom = 'point', size = 5)
(Or calc(value), if you use the ggplot dev version, or read this in the future.)
You can pass any arbitrary function to fun.
While stat_summary seems like it would be useful, it is not in this case. It is specialized in the common transformation for plotting, summarizing a range of y values, grouped by x, into a set of summary statistics that are plotted as y(, ymin and ymax). You want to group by both x and y, so 2d it is.
Note that this uses binning however, so to get the points to accurately line up, you need to increase bin size (e.g. to 1e3). Unfortunately, there is no non-binning 2d summary stat.

geom_step error for single point

This seems to be a duplicate of R - ggplot geom_step error, which has not been answered yet.
When the data to plot consists of one point only, geom_step throws an error while geom_line gives a warning only:
library(ggplot2)
data <- data.frame(x = 1, y = 2)
# works
ggplot(data = data, aes(x = x, y = y)) + geom_line()
# does not work
ggplot(data = data, aes(x = x, y = y)) + geom_step()
geom_step gives the error message: invalid line type. Is this a bug or the desired behaviour? With this behaviour geom_step looses a part of ggplot's flexibility since the single point case needs to be dealt with manually. One brute force solution is to manually check for number of points to be plotted and only add the step-layer if there are at least two point. But surely there must be a more elegant workaround?!
packageVersion("ggplot2")
[1] ‘1.0.1’

Transform color scale, but keep a nice legend with ggplot2

I have seen somewhat similar questions to this, but I'd like to ask my specific question as directly as I can:
I have a scatter plot with a "z" variable encoded into a color scale:
library(ggplot2)
myData <- data.frame(x = rnorm(1000),
y = rnorm(1000))
myData$z <- with(myData, x * y)
badVersion <- ggplot(myData,
aes(x = x, y = y, colour = z))
badVersion <- badVersion + geom_point()
print(badVersion)
Which produces this:
As you can see, since the "z" variable is normally distributed, very few of the points are colored with the "extreme" colors of the distribution. This is as it should be, but I am interested in emphasizing difference. One way to do this would be to use:
betterVersion <- ggplot(myData,
aes(x = x, y = y, colour = rank(z)))
betterVersion <- betterVersion + geom_point()
print(betterVersion)
Which produces this:
By applying rank() to the "z" variable, I get a much greater emphasis on minor differences within the "z" variable. One could imagine using any transformation here, instead of rank, but you get the idea.
My question is, essentially, what is the most straightforward way, or the most "true ggplot2" way, of getting a legend in the original units (units of z, as opposed to the rank of z), while maintaining the transformed version of the colored points?
I have a feeling this uses rescaler() somehow, but it is not clear to me how to use rescaler() with arbitrary transformations, etc. In general, more clear examples would be useful.
Thanks in advance for your time.
Have a look at the package scales
especially
?trans
I think that a transformation that maps the colour given the probability of getting the value or more extreme should be reasonable (basically pnorm(z))
I think that scale_colour_continuous(trans = probability_trans(distribution = 'norm') should work, but it throws warnings.
So I defined a new transformation (see ?trans_new)
I have to define a transformation and an inverse
library(scales)
norm_trans <- function(){
trans_new('norm', function(x) pnorm(x), function(x) qnorm(x))
}
badVersion + geom_point() + scale_colour_continuous(trans = 'norm'))
Using the supplied probability_trans throws a warning and doesn't seem to work
# this throws a warning
badVersion + geom_point+
scale_colour_continuous(trans = probability_trans(distribution = 'norm'))
## Warning message:
## In qfun(x, ...) : NaNs produced

Resources