Overlay ggplot2 stat_density2d plots with alpha channels constant across groups - r

I would like to plot multiple groups in a stat_density2 plot with alpha values related to the counts of observations in each group. However, the levels formed by stat_density2d seem to be normalized to the number of observations in each group. For example,
temp <- rbind(movies[1:2,],movies[movies$mpaa == "R" | movies$mpaa == "PG-13",])
ggplot(temp, aes(x=rating,y=length)) +
stat_density2d(geom="tile", aes(fill = mpaa, alpha=..density..), contour=FALSE) +
theme_minimal()
Creates a plot like this:
Because I only included 2 points without ratings, they result in densities that look much tighter/stronger than the other two, and so wash out the other two densities. I've tried looking at Overlay two ggplot2 stat_density2d plots with alpha channels and Specifying the scale for the density in ggplot2's stat_density2d but they don't really address this specific issue.
Ultimately, what I'm trying to accomplish with my real data, is I have "power" samples from discrete 2d locations for multiple conditions, and I am trying to plot what their relative powers/spatial distributions are. I am duplicating points in locations relative to their powers, but this has resulted in low power conditions with just a few locations looking the strongest when using stat_density2d. Please let me know if there is a better way of going about doing this!
Thanks!

stat_hexbin, which understands ..count.. in addition to ..density.., may work for you:
ggplot(temp, aes(x=rating,y=length)) +
stat_binhex(geom="hex", aes(fill = mpaa, alpha=..count..)) +
theme_minimal()
Although you may want to adjust the bin width.

Not the most elegant r code, but this seems to work. I normalize my real data a bit differently than this, but this gets the solution I found across. I use a for loop where I find the average power for the condition and add a new stat_density2d layer with the alpha scaled by that average power.
temp <- rbind(movies[1:2,],movies[movies$mpaa == "R" | movies$mpaa == "PG-13",])
mpaa = unique(temp$mpaa)
p <- ggplot() + theme_minimal()
for (ii in seq(1,3)) {
ratio = length(which(temp$mpaa == mpaa[ii]))
p <- p + stat_density2d(data=temp[temp$mpaa == mpaa[ii],],
aes(x=rating,y=length,fill = mpaa, alpha=..level..),
geom="polygon",
contour=TRUE,
alpha = ratio/20,
lineType = "none")
}
print(p)

Related

How to plot density of points in one dimension with different factors in ggplot2

I am attempting to place individual points on a plot using ggplot2, however as there are many points, it is difficult to gauge how densely packed the points are. Here, there are two factors being compared against a continuous variable, and I want to change the color of the points to reflect how closely packed they are with their neighbors. I am using the geom_point function in ggplot2 to plot the points, but I don't know how to feed it the right information on color.
Here is the code I am using:
s1 = rnorm(1000, 1, 10)
s2 = rnorm(1000, 1, 10)
data = data.frame(task_number = as.factor(c(replicate(100, 1),
replicate(100, 2))),
S = c(s1, s2))
ggplot(data, aes(x = task_number, y = S)) + geom_point()
Which generates this plot:
However, I want it to look more like this image, but with one dimension rather than two (which I borrowed from this website: https://slowkow.com/notes/ggplot2-color-by-density/):
How do I change the colors of the first plot so it resembles that of the second plot?
I think the tricky thing about this is you want to show the original values, and evaluate the density at those values. I borrowed ideas from here to achieve that.
library(dplyr)
data = data %>%
group_by(task_number) %>%
# Use approxfun to interpolate the density back to
# the original points
mutate(dens = approxfun(density(S))(S))
ggplot(data, aes(x = task_number, y = S, colour = dens)) +
geom_point() +
scale_colour_viridis_c()
Result:
One could, of course come up with a meausure of proximity to neighbouring values for each value... However, wouldn't adjusting the transparency basically achieve the same goal (gauging how densely packed the points are)?
geom_point(alpha=0.03)

Probability density Matrix Subtraction for heatmaps

Pretty new to R and stuck. I am attempting to normalize 2d probability density of a heat map by subtracting the 2d probability densities of another data set. I am looking where behaviors occur in space, however to do this I want to subtract out where the subjects just spend most of their time from were the behaviors are occuring to get an idea of relative density of just the behaviors. To do this I am trying to find the probability density matrices used to plot a heatmap for the following code:
ctrlplot<-ctrl %>% ggplot(aes(x=x, y=y)) +
stat_density_2d(geom = "raster", aes(fill = stat(density)), contour = FALSE)+
scale_fill_gradientn(colours=matlab.like(15), na.value = "gray",
as lowertick, uppertick, interval
limit=c(0,1.3e-05)) #sets the static limit of probabilities.
This works to make the heat plot for either data set plot, however I cannot find where ggplot or stat_density_2d is storing the density data to subtract the two.
Alternatively I have tried to get just the densities for both data sets using the following code and storing it as the variable dens:
n<-100
h<-c(bandwidth.nrd(ctrl$x),bandwidth.nrd(ctrl$y))
dens<-kde2d(ctrl$x,ctrl$y,n=n,h=h)
Now I am not sure how to subtract the resulting z values and get it back into a heat plot. I know there is likely an easy solution for this, but I am definitely stuck. Any advice on how to do this easier, or other suggestions on how to subtract the densities from one another would be greatly appreciated.
UPDATE:
I found a way to pull the density data from ggplot. I was able to pull the density data from two different data sets, subtract the vectors and place the densities back into the original data frame using the following code:
ctrlplot<-ctrl %>% ggplot(aes(x=x, y=y)) +
stat_density_2d(geom = "raster", aes(fill = stat(density)), contour = FALSE)+
scale_fill_gradientn(colours=matlab.like(15), na.value = "gray")
ctxplot<-ctx %>% ggplot(aes(x=x, y=y)) +
stat_density_2d(geom = "raster", aes(fill = stat(density)), contour = FALSE)+
scale_fill_gradientn(colours=matlab.like(15), na.value = "gray")
ctrlplot2<-ggplot_build(ctrlplot)
gbctrl<-ctrlplot2$data[[1]]
densctrl<-gbctrl$density
gbctx<-ggplot_build(ctxplot)
gbctx<-gbctx$data[[1]]
densctx<-gbctx$density
diff_ctrl_ctx<-densctrl-densctx
gbctrl$density<-diff_ctrl_ctx
ctrlplot2$data[[1]]<-gbctrl
ctrlplot2
ctrlplot
However the last two plots ctrlplot (original) and ctrlplot2(subtracted densities) give the same plot. Not sure if I am not replacing the correct parts of the data frame so that it updates for the graphing part since there are different lists in the original ggplot_build.

R - Control Histogram Y-axis Limits by second-tallest peak

I've written an R script that loops through a data.frame making multiple of complex plots that includes a histogram. The problem is that the histograms often show a tall, uninformative peak at x=0 or x=1 and it obscures the rest of the data which is more informative. I have figured out that I can hide the tall peak by defining the limits of the x and y axes of each histogram as seen in the code below - but what I really need to figure out is how to define the y-axis limits such that they are optimized for the second-largest peak in my histogram.
Here's some code that simulates my data and plots histograms with different sorts of axis limits imposed:
require(ggplot2)
set.seed(5)
df = data.frame(matrix(sample(c(1:10), 1000, replace = TRUE, prob = c(0.8,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01)), nrow=100))
cols = names(df)
for (i in c(1:length(cols))) {
my_col = cols[i]
p1 = ggplot(df, aes_string(my_col)) + geom_histogram(bins = 10)
print(p1)
p2 = p1 + ggtitle(paste("Fixed X Limits", my_col)) + scale_x_continuous(limits = c(1,10))
print(p2)
p3 = p1 + ggtitle(paste("Fixed Y Limits", my_col)) + scale_y_continuous(limits = c(0,3))
print(p3)
p4 = p1 + ggtitle(paste("Fixed X & Y Limits", my_col)) + scale_y_continuous(limits = c(0,3)) + scale_x_continuous(limits = c(1,10))
print(p4)
}
The problem is that in this data, I can hard-code y-limits and have a reasonable expectation that they will work well for all the histograms. With my real data the size of the peaks varies wildly between the numerous histograms I am producing. I've tried defining the y-limit with various equations based on descriptive numbers like the mean, median and range but nothing I've come up with works well for all cases.
If I could define the y-limit in relation to the second-tallest peak of the histogram, I would have something that was perfectly suited for each situation.
I am not sure how ggplot builds its histograms, but one method would be to grab the results from hist:
maxDensities <- sapply(df, function(i) max(hist(i)$density))
# take the second highest peak:
myYlim <- rev(sort(maxDensities))[2]
I would process the data to determine the height you need.
Something along the lines of:
sort(table(cut(df$X1,breaks=10)),T)[2]
Working from the inside out
cut will bin the data (not really needed with integer data like you have but probably needed with real data
table then creates a table with the count of each of those bins
sort sorts the table from highest to lowest
[2] takes the 2nd highest value

ggplot2: how to overlay 2 plots when using stat_summary

i am totally new in R so maybe the answer to the question is trivial but I couldn't find any solution after searching in the net for days.
I am using ggplot2 to create graphs containing the mean of my samples with the confidence interval in a ribbon (I can't post the pic but something like this: S1
I have a data frame (df) with time in the first column and the values of the variable measured in the other columns (each column is a replicate of the measurement).
I do the following:
mdf<-melt(df, id='time', variable_name="samples")
p <- ggplot(data=mdf, aes(x=time, y=value)) +
geom_point(size=1,colour="red")
stat_sum_df <- function(fun, geom="crosbar", ...) {
stat_summary(fun.data=fun, geom=geom, colour="red")
}
p + stat_sum_df("mean_cl_normal", geom = "smooth")
and I get the graph I have shown at the beginning.
My question is: if I have two different data frames, each one with a different variable, measured in the same sample at the same time, how I can plot the 2 graphs in the same plot? Everything I have tried ends in doing the statistics in the both sets of data or just in one of them but not in both. Is it possible just to overlay the plots?
And a second small question: is it possible to change the colour of the ribbon?
Thanks!
something like this:
library(ggplot2)
a <- data.frame(x=rep(c(1,2,3,5,7,10,15,20), 5),
y=rnorm(40, sd=2) + rep(c(4,3.5,3,2.5,2,1.5,1,0.5), 5),
g = rep(c('a', 'b'), each = 20))
ggplot(a, aes(x=x,y=y, group = g, colour = g)) +
geom_point(aes(colour = g)) +
geom_smooth(aes(fill = g))
I'd suggest you reading the basics of ggplot. Check ?ggplot2 for help on ggplot but also available help topics here and particularly how group aesthetic may be manipulated.
You'll find useful the discussion group at Google groups and maybe join it. Also, QuickR have a lot of examples on ggplot graphs and, obviously, here at Stackoverflow.

ggplot boxplots with scatterplot overlay (same variables)

I'm an undergrad researcher and I've been teaching myself R over the past few months. I just started trying ggplot, and have run into some trouble. I've made a series of boxplots looking at the depth of fish at different acoustic receiver stations. I'd like to add a scatterplot that shows the depths of the receiver stations. This is what I have so far:
data <- read.csv(".....MPS.csv", header=TRUE)
df <- data.frame(f1=factor(data$Tagging.location), #$
f2=factor(data$Station),data$Detection.depth)
df2 <- data.frame(f2=factor(data$Station), data$depth)
df$f1f2 <- interaction(df$f1, df$f2) #$
plot1 <- ggplot(aes(y = data$Detection.depth, x = f2, fill = f1), data = df) + #$
geom_boxplot() + stat_summary(fun.data = give.n, geom = "text",
position = position_dodge(height = 0, width = 0.75), size = 3)
plot1+xlab("MPS Station") + ylab("Depth(m)") +
theme(legend.title=element_blank()) + scale_y_reverse() +
coord_cartesian(ylim=c(150, -10))
plot2 <- ggplot(aes(y=data$depth, x=f2), data=df2) + geom_point()
plot2+scale_y_reverse() + coord_cartesian(ylim=c(150,-10)) +
xlab("MPS Station") + ylab("Depth (m)")
Unfortunately, since I'm a new user in this forum, I'm not allowed to upload images of these two plots. My x-axis is "Stations" (which has 12 options) and my y-axis is "Depth" (0-150 m). The boxplots are colour-coded by tagging site (which has 2 options). The depths are coming from two different columns in my spreadsheet, and they cannot be combined into one.
My goal is to to combine those two plots, by adding "plot2" (Station depth scatterplot) to "plot1" boxplots (Detection depths). They are both looking at the same variables (depth and station), and must be the same y-axis scale.
I think I could figure out a messy workaround if I were using the R base program, but I would like to learn ggplot properly, if possible. Any help is greatly appreciated!
Update: I was confused by the language used in the original post, and wrote a slightly more complicated answer than necessary. Here is the cleaned up version.
Step 1: Setting up. Here, we make sure the depth values in both data frames have the same variable name (for readability).
df <- data.frame(f1=factor(data$Tagging.location), f2=factor(data$Station), depth=data$Detection.depth)
df2 <- data.frame(f2=factor(data$Station), depth=data$depth)
Step 2: Now you can plot this with the 'ggplot' function and split the data by using the `col=f1`` argument. We'll plot the detection data separately, since that requires a boxplot, and then we'll plot the depths of the stations with colored points (assuming each station only has one depth). We specify the two different plots by referencing the data from within the 'geom' functions, instead of specifying the data inside the main 'ggplot' function. It should look something like this:
ggplot()+geom_boxplot(data=df, aes(x=f2, y=depth, col=f1)) + geom_point(data=df2, aes(x=f2, y=depth), colour="blue") + scale_y_reverse()
In this plot example, we use boxplots to represent the detection data and color those boxplots by the site label. The stations, however, we plot separately using a specific color of points, so we will be able to see them clearly in relation to the boxplots.
You should be able to adjust the plot from here to suit your needs.
I've created some dummy data and loaded into the chart to show you what it would look like. Keep in mind that this is purely random data and doesn't really make sense.

Resources