ggplot stat_summary_bin glitch? - r

I was happy to discover that ggplot has binned scatter plots, which are useful for exploring and visualizing relationships in large data. Yet the top bin appears to misbehave. Here's an example: All bin averages are roughly linearly aligned, as they should be, but the top one is off on both dimensions:
the code:
library(ggplot2)
# simulate an example of linear data
set.seed(1)
N <- 10^4
x <- runif(N)
y <- x + rnorm(N)
dt <- data.frame(x=x, y=y)
ggplot(dt, aes(x, y)) +
geom_point(alpha = 0.1, size = 0.01) +
stat_summary_bin(fun.y='mean', bins=10, color='orange', size=5, geom='point')
is there a simple workaround (and where should this be posted)?

stat_summary_bin is actually excluding the two rows with the largest x-values from the bins, and those two values are ending up with bin = NA. The mean of those two excluded values is plotted as a separate bin to the right of the regular bins. First, I show what is going wrong in your original plot then I provide a workaround to get the desired behavior.
What's going wrong in the original plot
To see what's going wrong in your original plot, create a plot with two calls to stat_summary_bin where we calculate the mean of each bin and the number of values in each bin. Then use ggplot_build to capture all of the internal data that ggplot generated to create the plot.
p1 = ggplot(dt, aes(x, y)) +
geom_point(alpha = 0.1, size = 0.01) +
stat_summary_bin(fun.y=mean, bins=10, size=5, geom='text',
aes(label=..y..)) +
stat_summary_bin(fun.y=length, bins=10, size=5, geom='text',
aes(label=..y.., y=0))
p1b = ggplot_build(p1)
Now let's look at the data for the mean and length layers, respectively. I've printed only bins 9 through 11 (the three right-most bins) for brevity. Bin 11 is the "extra" bin and you can see that it contains only 2 values (its label is 2 in the second table below), and that the mean of those two values is -0.1309998, as can be seen in the first table below.
p1b$data[[2]][9:11,c(1,2,4,6,7)]
label bin y x width
9 0.8158320 9 0.8158320 0.8498505 0.09998242
10 0.9235531 10 0.9235531 0.9498329 0.09998242
11 -0.1309998 11 -0.1309998 1.0498154 0.09998244
p1b$data[[3]][9:11,c(1,2,4,6,7)]
label bin y x width
9 1025 9 1025 0.8498505 0.09998242
10 1042 10 1042 0.9498329 0.09998242
11 2 11 2 1.0498154 0.09998244
Which two values are those? It looks like they come from the two rows with the highest x values in the original data frame:
mean(dt[order(-dt$x), "y"][1:2])
[1] -0.1309998
I'm not sure how stat_summary_bin is managing to bin the data such that the two highest x values are excluded.
Workaround to get the desired behavior
A workaround is to summarize the data yourself, so you'll have complete control over how the bins are created. The example below uses your original code and then plots pre-summarized values in blue, so you can compare the behavior. I've included the dplyr package so that I can use the chaining operator (%>%) to summarize the data on the fly:
library(dplyr)
ggplot(dt, aes(x, y)) +
geom_point(alpha = 0.1, size = 0.01) +
stat_summary_bin(fun.y='mean', bins=10, color='orange', size=5, geom='point') +
geom_point(data=dt %>%
group_by(bins=cut(x,breaks=seq(min(x),max(x),length.out=11), include.lowest=TRUE)) %>%
summarise(x=mean(x), y=mean(y)),
aes(x,y), size=3, color="blue") +
theme_bw()

#eipi10 has already explained, why this is happening.
Perhaps the simplest solution is to add a scale_x_continuous with limits to your plot, so that the extra "NA" bin is excluded from the plot.
ggplot(dt, aes(x, y)) +
geom_point(alpha = 0.1, size = 0.01) +
stat_summary_bin(fun.y='mean', bins=10, color='orange', size=5, geom='point') +
scale_x_continuous(limits = range(x))
This should be acceptable with large data such as in the example, where the small number of data points that were excluded from the bins will not significantly bias the stats. However, if dealing with situations where missing a couple of data points from the summary statistics is important, then the solution provided by #eipi will be better.

Related

R, ggplot, How do I keep related points together when using jitter?

One of the variables in my data frame is a factor denoting whether an amount was gained or spent. Every event has a "gain" value; there may or may not be a corresponding "spend" amount. Here is an image with the observations overplotted:
Adding some random jitter helps visually, however, the "spend" amounts are divorced from their corresponding gain events:
I'd like to see the blue circles "bullseyed" in their gain circles (where the "id" are equal), and jittered as a pair. Here are some sample data (three days) and code:
library(ggplot2)
ccode<-c(Gain="darkseagreen",Spend="darkblue")
ef<-data.frame(
date=as.Date(c("2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03")),
site=c("Castle","Temple","Temple","Temple","Temple","Temple","Palace","Palace","Castle","Castle","Castle","Temple","Temple","Palace","Palace","Castle","Castle","Castle","Castle","Castle","Temple","Temple","Palace","Castle","Temple","Temple","Temple","Temple","Temple","Palace","Palace","Castle","Castle","Castle","Temple","Temple","Palace","Palace","Castle","Castle","Castle","Castle","Castle","Temple","Temple","Palace"),
id=c("C123","T101","T93","T94","T95","T96","P102","P96","C126","C127","C128","T100","T98","P100","P98","C129","C130","C131","C132","C133","T104","T99","P99","C123","T101","T93","T94","T95","T96","P102","P96","C126","C127","C128","T100","T98","P100","P98","C129","C130","C131","C132","C133","T104","T99","P99"),
gainspend=c("Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend"),
amount=c(6,14,34,31,3,10,6,14,2,16,16,14,1,1,15,11,8,7,2,10,15,4,3,NA,NA,4,5,NA,NA,NA,NA,NA,NA,2,NA,1,NA,3,NA,NA,2,NA,NA,2,NA,3))
#▼ 3 day, points centered
ggplot(ef,aes(date,site)) +
geom_point(aes(size=amount,color=gainspend),alpha=0.5) +
scale_color_manual(values=ccode) +
scale_size_continuous(range=c(1,15),breaks=c(5,10,20))
#▼ 3 day, jitted
ggplot(ef,aes(date,site)) +
geom_point(aes(size=amount,color=gainspend),alpha=0.5,position=position_jitter(w=0,h=0.2)) +
scale_color_manual(values=ccode) +
scale_size_continuous(range=c(1,15),breaks=c(5,10,20))
My main idea is the old "add jitter manually" approach. I'm wondering if a nicer approach could be something like plotting little pie charts as points a la package scatterpie.
In this case you could add a random number for the amount of jitter to each ID so points within groups will be moved the same amount. This takes doing work outside of ggplot2.
First, draw the "jitter" to add for each ID. Since a categorical axis is 1 unit wide, I choose numbers between -.3 and .3. I use dplyr for this work and set the seed so you will get the same results.
library(dplyr)
set.seed(16)
ef2 = ef %>%
group_by(id) %>%
mutate(jitter = runif(1, min = -.3, max = .3)) %>%
ungroup()
Then the plot. I use a geom_blank() layer so that the categorical site axis is drawn before I add the jitter. I convert site to be numeric from a factor and add the jitter on; this only works for factors so luckily categorical axes in ggplot2 are based on factors.
Now paired ID's move together.
ggplot(ef2, aes(x = date, y = site)) +
geom_blank() +
geom_point(aes(size = amount, color = gainspend,
y = as.numeric(factor(site)) + jitter),
alpha=0.5) +
scale_color_manual(values = ccode) +
scale_size_continuous(range = c(1, 15), breaks = c(5, 10, 20))
#> Warning: Removed 15 rows containing missing values (geom_point).
Created on 2021-09-23 by the reprex package (v2.0.0)
You can add some jitter by id outside the ggplot() call.
jj <- data.frame(id = unique(ef$id), jtr = runif(nrow(ef), -0.3, 0.3))
ef <- merge(ef, jj, by = 'id')
ef$sitej <- as.numeric(factor(ef$site)) + ef$jtr
But you need to make site integer/numeric to do this. So when it comes to making the plot, you need to manually add axis labels with scale_y_continuous(). (Update: the geom_blank() trick from aosmith above is a better solution!)
ggplot(ef,aes(date,sitej)) +
geom_point(aes(size=amount,color=gainspend),alpha=0.5) +
scale_color_manual(values=ccode) +
scale_size_continuous(range=c(1,15),breaks=c(5,10,20)) +
scale_y_continuous(breaks = 1:3, labels= sort(unique(ef$site)))
This seems to work, but there are still a few gain/spend circles without a partner--perhaps there is a problem with the id variable.
Perhaps someone else has a better approach!

Again, I have 4 graphs on R, different x axis, but similar trend profile. I tried to overlay them but they are not aligned

I was assisted to overlay two graphs with different x-axis on this link I have 2 graphs on R. They have different x axis, but similar trend profile. how do I overlay them on r?.
However, I am now trying to overlay 4 graphs. I tried to overlay them but they are not aligned.
I need assistance to overlay these four graphs.
My initial trial codes were as follows:
My raw data is in this following link https://drive.google.com/drive/folders/1ZZQAATkbeV-Nvq1YYZMYdneZwMvKVUq1?usp=sharing.
Codes used to execute:
first <- ggplot(data = first,
aes(x, y)) +
geom_line(pch = 1)
second <- ggplot(data = second,
aes(x, y)) +
geom_line(pch = 1)
third <- ggplot(data = third,
aes(x, y)) +
geom_line(pch = 1)
fourth <- ggplot(data = fourth,
aes(x, y)) +
geom_line(pch = 1)
first$match <- first$x
second$match <- second$x - second$x[second$y == max(second$y)] + first$x[first$y == max(first$y)]
third$match <- third$x
fourth$match <- fourth$x
first$series = "first"
second$series = "second"
third$series = "third"
fourth$series = "fourth"
all_data <- rbind(first, second, third, fourth)
ggplot(all_data) + geom_line(aes(x = match, y, color = series)) +
scale_x_continuous(name = "X, arbitrary units") +
theme(axis.text.x = element_blank())
Would greatly appreciate the help indeed.
OP, I thought I would propose a solution for your question. OP has 4 datasets with x and y columns, and wants to align the peaks in each dataset so that they stack on top of one another. Here's what it looks like when we plot all datasets together:
p <- ggplot(mapping=aes(x=x, y=y)) + theme_bw() +
geom_line(data=first, aes(color="first")) +
geom_line(data=second, aes(color="second")) +
geom_line(data=third, aes(color="third")) +
geom_line(data=fourth, aes(color="fourth"))
The approach will be as follows:
Find the peak x value for each dataset
Adjust each peak x value to match that of the first peak x value
Combine the datasets and plot together which respects Tidy Data principles
Finding peaks and adjusting x values
To find the peaks, I like to use the findpeaks() function from the pracma library. You feed the function your dataset's y values (arranged by increasing x value), and the function will return a matrix with each row representing a "peak" and the columns give you height of peak in y value, index or row of dataset for the peak, where the peak begins, and where the peak ends. As an example, here's how we can apply this principle and what the result looks like on one of the datasets:
library(pracma)
first <- arrange(first, x) # arrange first by increasing x
findpeaks(first$y, sortstr = TRUE, npeaks=1)
[,1] [,2] [,3] [,4]
[1,] 1047.54 402 286 515
The argument sortstr= indicates we want the list of peaks sorted by "highest" first, and we are only interested in picking the first peak. In this case, we can see that 402 is the index of the x,y value in first for the peak. So we can access that x value via first[index,]$x.
The one concern we may have here is that this may not work for fourth, since the max value of y is actually not the peak of interest; however, if we run the function and test this out, using the findpeaks() method where we return the highest peak works fine: apparently the function does not find there is a "peak" at the right since it has an "up", but not a "down".
The function below handles all the steps to do what we need to: arranging, finding peaks, and adjusting peaks.
# find the minimum peak. We know it's from third, but here's
# how you do it if you don't "know" that
peaks_first <- findpeaks(first$y, sortstr = TRUE, npeaks=1)
peaks_second <- findpeaks(second$y, sortstr = TRUE, npeaks=1)
peaks_third <- findpeaks(third$y, sortstr = TRUE, npeaks=1)
peaks_fourth <- findpeaks(fourth$y, sortstr = TRUE, npeaks=1)
# minimum peak x value
peak_x <- min(c(first[peaks_first[2],]$x, second[peaks_second[2],]$x, third[peaks_third[2],]$x, fourth[peaks_fourth[2],]$x))
# function to use to fix each dataset
fix_x <- function(peak_x, dataset) {
dataset <- arrange(dataset, x)
d_peak <- findpeaks(dataset$y, sortstr = TRUE, npeaks=1)
d_peak_x <- dataset[d_peak[2],]$x
x_adj <- peak_x - d_peak_x
dataset$x <- dataset$x + x_adj
return(dataset)
}
# apply and fix each dataset
fix_first <- fix_x(peak_x, first)
fix_second <- fix_x(peak_x, second)
fix_third <- fix_x(peak_x, third)
fix_fourth <- fix_x(peak_x, fourth)
# combine datasets
fix_first$measure <- 'First'
fix_second$measure <- 'Second'
fix_third$measure <- 'Third'
fix_fourth$measure <- 'Fourth'
fixed <- rbind(fix_first, fix_second, fix_third, fix_fourth)
fixed$measure <- factor(fixed$measure, levels=c('First','Second','Third','Fourth'))
Plot Together
Now fixed contains all the data, and we can plot them all together:
ggplot(fixed, aes(x=x, y=y, color=measure)) + theme_bw() +
geom_line()
Alternate Plotting Methods
If you want to "stack" the lines on top of one another, this is what is known as a ridgeline plot. There are two methods I can show for how to create the ridgeline plot: faceting or using ggridges and geom_ridgeline(). I can demonstrate both.
# Using facets
ggplot(fixed, aes(x=x, y=y, color=measure)) + theme_bw() +
geom_line(show.legend = FALSE) +
facet_grid(measure~.)
Note I chose not to show the legend, since the strip text indicates this same information.
# Using ggridges and geom_ridgeline
ggplot(fixed, aes(x=x, y=measure, color=measure)) + theme_bw() +
geom_ridgeline(aes(height=y), fill=NA, scale=0.001)
When using geom_ridgeline(), you'll notice that the y= aesthetic becomes the column used for the stacking, and your original y value is instead mapped to the height= aesthetic. I also had to play around with scale=, since for discrete values, each measure will be treated as integers (1, 2, 3, 4). Your height= values are waaaay higher than that, so we have to scale them down so that they are around this range (scaled down by about 1000).

R ggplot2 histogram bin allocation

My problem is that when I construct histograms with ggplot2 of certain bin width greater than the resolution of the data, bins sometimes contain uneven numbers of increments from the underlying data. This results in large peaks in the histogram which five a false impression of how peaky the data are. Is there a built-in way to prevent this? Maybe allocate increments between bins?
require(ggplot2)
require(ggplot2movies)
m <- ggplot(movies, aes(x = rating))
#Original resolution
plot(m + geom_histogram(binwidth = 0.1) + scale_y_sqrt())
#Downsampled
plot(m + geom_histogram(binwidth = 0.25) + scale_y_sqrt())
I don't know, if there is a built-in way or not, geom_histogram() has a default of 30 bins, which you can override.
One possible soltution can be, if you count the number of different x values and use that in the number of bins (or a fraction of them):
plot(m + geom_histogram(bins = nlevels(as.factor(movies$rating))))
Workaround for now is to simply modify binwidth as a function of data resolution, as opposed to number of bins.

R - Control Histogram Y-axis Limits by second-tallest peak

I've written an R script that loops through a data.frame making multiple of complex plots that includes a histogram. The problem is that the histograms often show a tall, uninformative peak at x=0 or x=1 and it obscures the rest of the data which is more informative. I have figured out that I can hide the tall peak by defining the limits of the x and y axes of each histogram as seen in the code below - but what I really need to figure out is how to define the y-axis limits such that they are optimized for the second-largest peak in my histogram.
Here's some code that simulates my data and plots histograms with different sorts of axis limits imposed:
require(ggplot2)
set.seed(5)
df = data.frame(matrix(sample(c(1:10), 1000, replace = TRUE, prob = c(0.8,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01)), nrow=100))
cols = names(df)
for (i in c(1:length(cols))) {
my_col = cols[i]
p1 = ggplot(df, aes_string(my_col)) + geom_histogram(bins = 10)
print(p1)
p2 = p1 + ggtitle(paste("Fixed X Limits", my_col)) + scale_x_continuous(limits = c(1,10))
print(p2)
p3 = p1 + ggtitle(paste("Fixed Y Limits", my_col)) + scale_y_continuous(limits = c(0,3))
print(p3)
p4 = p1 + ggtitle(paste("Fixed X & Y Limits", my_col)) + scale_y_continuous(limits = c(0,3)) + scale_x_continuous(limits = c(1,10))
print(p4)
}
The problem is that in this data, I can hard-code y-limits and have a reasonable expectation that they will work well for all the histograms. With my real data the size of the peaks varies wildly between the numerous histograms I am producing. I've tried defining the y-limit with various equations based on descriptive numbers like the mean, median and range but nothing I've come up with works well for all cases.
If I could define the y-limit in relation to the second-tallest peak of the histogram, I would have something that was perfectly suited for each situation.
I am not sure how ggplot builds its histograms, but one method would be to grab the results from hist:
maxDensities <- sapply(df, function(i) max(hist(i)$density))
# take the second highest peak:
myYlim <- rev(sort(maxDensities))[2]
I would process the data to determine the height you need.
Something along the lines of:
sort(table(cut(df$X1,breaks=10)),T)[2]
Working from the inside out
cut will bin the data (not really needed with integer data like you have but probably needed with real data
table then creates a table with the count of each of those bins
sort sorts the table from highest to lowest
[2] takes the 2nd highest value

ggplot: coordinate axes are unordered when using geom_point()

I want to create a scatter plot, but the scale of the axes is messed up. I want it to have an increasing order, but in the plot y = 7 lies between y = 8.8 and y = 11.8.
It is a bit difficult to explain, so I uploaded a picture of the plot to
splot <- ggplot(df, aes(x_val, y_val)) + geom_point() + ggtitle(title) + xlab(label) + ylab(label)
df looks like that
x_val y_val x_min x_max y_min y_max series
1 8.2640626 7.1605616 7.43370308695577 9.09442211304423 5.62731954407747 8.69380365592253 1IWG
2 10.0321728 8.8790822 8.43774194466477 11.6266036553352 6.97682936735609 10.7813350326439 1J4N
3 13.4994332665331 11.8238683366733 12.4200921869666 14.5787743460995 9.99549351881522 13.6522431545315 1KPL
Thanks for any help.
Use str(df) to examine your data frame df. If the variables you are trying to plot are factors, then use as.numeric() to convert them so that they are interpreted as numbers. Or you can try to specify that they are numeric when you create your data set, depending on how the frame is defined.

Resources