How to spatially separate rug plots from different series - r

I'm trying to graphically evaluate distributions (bimodal vs. unimodal) of datasets, in which the number of datapoints per dataset can vary widely. My problem is to indicate numbers of data points, using something like rug plots, but to avoid the problem of having a series with many data points overhwelm a series with only a few points.
Currently I'm working in ggplot2, combining geom_density and geom_rug like so:
# Set up data: 1000 bimodal "b" points; 20 unimodal "a" points
set.seed(0); require(ggplot2)
x <- c(rnorm(500, mean=10, sd=1), rnorm(500, mean=5, sd=1), rnorm(20, mean=7, sd=1))
l <- c(rep("b", 1000), rep("a", 20))
d <- data.frame(x=x, l=l)
ggplot(d, aes(x=x, colour=l)) + geom_density() + geom_rug()
This almost does what I want - but the "a" points get overwhelmed by the "b" points.
I've hacked a solution using geom_point instead of geom_rug:
d$ypos <- NA
d$ypos[d$l=="b"] <- 0
d$ypos[d$l=="a"] <- 0.01
ggplot() +
geom_density(data=d, aes(x=x, colour=l)) +
geom_point(data=d, aes(x=x, y=ypos, colour=l), alpha=0.5)
However this is unsatisfying because the y positions must be adjusted manually. Is there a more automatic way to separate rug plots from different series, for instance using a position adjustment?

One way would be to use two geom_rug() calls - one for b, other for a. Then for one geom_rug() set sides="t" to plot them on top.
ggplot(d, aes(x=x, colour=l)) + geom_density() +
geom_rug(data=subset(d,l=="b"),aes(x=x)) +
geom_rug(data=subset(d,l=="a"),aes(x=x),sides="t")

Related

Plotting a `geom_rug` on both sides of the plot (left and right) based on `x` value in ggplot2

Problem Statement
Suppose that I have the following function of two variables
f <- function(x, y){
return(x*y + (x^3)*sin(y))
}
I want to fix two x points, for instance at x=2 and x=3 and then, I want to get, say, 100 standard random normal samples, which I'm going to feed in as y values.
What the data looks like
This is what the data looks like
set.seed(1)
y <- rnorm(100)
df <- data.frame(
x = c(rep(2, 50), rep(3, 50)),
y=c(f(2, head(y, 50)), f(3, tail(y, 50)))
)
head(df)
x y
1 2 -5.943113
2 2 1.828189
3 2 -7.605003
4 2 11.188164
5 2 3.247634
6 2 -7.492659
Standard Scatter Plot of the data
df$x <- as.factor(df$x)
ggplot(data=df, aes(x=x, y=y)) +
geom_point()
What I am trying to do
Basically I want to have two geom_rug() one on the left, corresponding to the scatter points for x=2 and one on right, corresponding to the scatter plot for x=3. I can produce a geom_rug() for all scatter points, as shown below, but I don't know how to have two different
ggplot(data=df, aes(x=x, y=y)) +
geom_point(aes(color=x)) +
geom_rug()
Ideally, I'd like the rug plot on the left to have the same color as the scatter points on x=2, and the rug plot on the right to have the same color as the scatter points on x=3.
As I said, I did solve the problem by using
ggplot(data=df, aes(x=x, y=y, color=x)) +
geom_point(aes(color=x)) +
geom_rug(data=subset(df, x==2), sides="l", aes(y=y)) +
geom_rug(data=subset(df, x==3), sides="r")

Plot point on ggplot2 smoothing regression on vline intersection

I want to create a (time-series) plot out of 40 million data points in order to show two regression lines with two specific events on each of it (first occurrence of an optimum in time-series).
Currently, I draw the regression lines and add a geom_vline to it to indicate the event.
As I want to be independent from colours in the plot, it would be beneficial if I could just plot the marker geom_vline as a point on the regression line.
Do you have any idea how to solve this using ggplot2?
My current approach is this here (replaced data points with test data):
library(ggplot2)
# Generate data
m1 <- "method 1"
m2 <- "method 2"
data1 <- data.frame(Time=seq(100), Value=sample(1000, size=100), Type=rep(as.factor(m1), 100))
data2 <- data.frame(Time=seq(100), Value=sample(1000, size=100), Type=rep(as.factor(m2), 100))
df <- rbind(data1, data2)
rm(data1, data2)
# Calculate first minima for each Type
m1_intercept <- df[which(df$Type == m1), ][which.min(df[which(df$Type == m1), ]$Value),]
m2_intercept <- df[which(df$Type == m2), ][which.min(df[which(df$Type == m2), ]$Value),]
# Plot regression and vertical lines
p1 <- ggplot(df, aes(x=Time, y=Value, group=Type, colour=Type), linetype=Type) +
geom_smooth(se=F) +
geom_vline(aes(xintercept=m1_intercept$Time, linetype=m1_intercept$Type)) +
geom_vline(aes(xintercept=m2_intercept$Time, linetype=m2_intercept$Type)) +
scale_linetype_manual(name="", values=c("dotted", "dashed")) +
guides(colour=guide_legend(title="Regression"), linetype=guide_legend(title="First occurrence of optimum")) +
theme(legend.position="bottom")
ggsave("regression.png", plot=p1, height=5, width=7)
which generates this plot:
My desired plot would be something like this:
So my questions are
Does it make sense to indicate a minimum value on a regression line? The values y-axis position would be in fact wrong but just to indicate the timepoint?
If yes, how can I achieve such a behaviour?
If no, what would you think could be better?
Thank you very much in advance!
Robin
If you first run your ggplot() call with only geom_smooth(), you can access plotted values through ggplot_build(), which we then can use to plot points on the two fitted lines. Example:
# Create initial plot
p1<-ggplot(df, aes(x=Time, y=Value, colour=Type)) +
geom_smooth(se=F)
# Now we can access the fitted values
smooths <- ggplot_build(p1)$data[[1]]
smooths_1 <- smooths[smooths$group==1,] # First group (method 1)
smooths_2 <- smooths[smooths$group==2,] # Second group (method 2)
# Then we find the closest plotted values to the minima
smooth_1_x <- smooths_1$x[which.min(abs(smooths_1$x - m1_intercept$Time))]
smooth_2_x <- smooths_2$x[which.min(abs(smooths_2$x - m2_intercept$Time))]
# Subset the previously defined datasets for respective closest values
point_data1 <- smooths_1[smooths_1$x==smooth_1_x,]
point_data2 <- smooths_1[smooths_2$x==smooth_2_x,]
Now we use point_data1 and point_data2 to place the points on your plot:
ggplot(df, aes(x=Time, y=Value, colour=Type)) +
geom_smooth(se=F) +
geom_point(data=point_data1, aes(x=x, y=y), colour = "red",size = 5) +
geom_point(data=point_data2, aes(x=x, y=y), colour = "red", size = 5)
To reproduce this plot, you can use set.seed(42) for your data generation step.

How to shade part of a density curve in ggplot (with no y axis data)

I'm trying to create a density curve in R using a set of random numbers between 1000, and shade the part that is less than or equal to a certain value. There are a lot of solutions out there involving geom_area or geom_ribbon, but they all require a yval, which I don't have (it's just a vector of 1000 numbers). Any ideas on how I could do this?
Two other related questions:
Is it possible to do the same thing for a cumulative density function (I'm currently using stat_ecdf to generate one), or shade it at all?
Is there any way to edit geom_vline so it will only go up to the height of the density curve, rather than the whole y axis?
Code: (the geom_area is a failed attempt to edit some code I found. If I set ymax manually, I just get a column taking up the whole plot, instead of just the area under the curve)
set.seed(100)
amount_spent <- rnorm(1000,500,150)
amount_spent1<- data.frame(amount_spent)
rand1 <- runif(1,0,1000)
amount_spent1$pdf <- dnorm(amount_spent1$amount_spent)
mean1 <- mean(amount_spent1$amount_spent)
#density/bell curve
ggplot(amount_spent1,aes(amount_spent)) +
geom_density( size=1.05, color="gray64", alpha=.5, fill="gray77") +
geom_vline(xintercept=mean1, alpha=.7, linetype="dashed", size=1.1, color="cadetblue4")+
geom_vline(xintercept=rand1, alpha=.7, linetype="dashed",size=1.1, color="red3")+
geom_area(mapping=aes(ifelse(amount_spent1$amount_spent > rand1,amount_spent1$amount_spent,0)), ymin=0, ymax=.03,fill="red",alpha=.3)+
ylab("")+
xlab("Amount spent on lobbying (in Millions USD)")+
scale_x_continuous(breaks=seq(0,1000,100))
There are a couple of questions that show this ... here and here, but they calculate the density prior to plotting.
This is another way, more complicated than required im sure, that allows ggplot to do some of the calculations for you.
# Your data
set.seed(100)
amount_spent1 <- data.frame(amount_spent=rnorm(1000, 500, 150))
mean1 <- mean(amount_spent1$amount_spent)
rand1 <- runif(1,0,1000)
Basic density plot
p <- ggplot(amount_spent1, aes(amount_spent)) +
geom_density(fill="grey") +
geom_vline(xintercept=mean1)
You can extract the x and y positions for the area to shade from the plot object using ggplot_build. Linear interpolation was used to get the y value at x=rand1
# subset region and plot
d <- ggplot_build(p)$data[[1]]
p <- p + geom_area(data = subset(d, x > rand1), aes(x=x, y=y), fill="red") +
geom_segment(x=rand1, xend=rand1,
y=0, yend=approx(x = d$x, y = d$y, xout = rand1)$y,
colour="blue", size=3)

making binned scatter plots for two variables in ggplot2 in R

I have a dataframe with two columns x and y that each contain values between 0 and 100 (the data are paired). I want to correlate them to each other using binned scatter plots. If I were to use a regular scatter plot, it would be easy to do:
geom_point(aes(x=x, y=y))
but I'd like to instead bin the points into N bins from 0 to 100, get the average value of x in each bin and the average value of y for the points in that bin, and show that as a scatter plot - so correlate the binned averages instead of the raw data points.
is there a clever/quick way to do this in ggplot2, using some combination of geom_smooth() and geom_point? Or does it have to be pre-computed manually and then plotted?
Yes, you can use stat_summary_bin.
set.seed(42)
x <- runif(1e4)
y <- x^2 + x + 4 * rnorm(1e4)
df <- data.frame(x=x, y=y)
library(ggplot2)
(ggplot(df, aes(x=x,y=y)) +
geom_point(alpha = 0.4) +
stat_summary_bin(fun.y='mean', bins=20,
color='orange', size=2, geom='point'))
I suggest geom_bin2d.
DF <- data.frame(x=1:100,y=1:100+rnorm(100))
library(ggplot2)
p <- ggplot(DF,aes(x=x,y=y)) + geom_bin2d()
print(p)

Plot a sample of a time series

I have a dataset that contains observations for every second of four consecutive days (roughly 340'000 data points). This is too much to display in a scatter plot. I would like to plot only a uniform sample of, say, 2000 time points.
Is it possible to achieve this with ggplot2's "grammar of graphics" approach? I haven't found any built-in "sampling" modifier, but perhaps it's easy enough to write one?
library(ggplot2)
x <- 1:100000
d <- data.frame(x=x, y=rnorm(length(x)))
ggplot(d[sample(x, 2000), ], aes(x=x, y=y)) + geom_point()
This is how it can be "hacked" by modifying the data passed to ggplot. But I don't want to modify the data, just filter it to include only a sample.
ggplot(d, aes(x=x, y=y)) + ??? + geom_point()
EDIT: I'm specifically looking for sampling, not smoothing or binning. The data I have shows the time it takes to simulate one second of a specific process. The simulation has been parallelized, and for each simulated seconds I have the run times for each of the cores involved (8 in total). I want to show sub-optimal load balancing by plotting just the raw data points. The reason for the sampling is just that 300'000 data points are way too much for a scatter plot: Plotting takes too long and the visualization is no good.
You can subset with in the geom_point call using the data argument:
... + geom_point(data=d[sample(x,2000),])
This way, you are free to add other geoms using all the data, eg, using the example data:
ggplot(d, aes(x=x, y=y)) + geom_hex() + geom_point(data=d[sample(x,2000),])
If you want create a scatter plot for big data here are a couple of ggplot2 options
They come from This course by hadley
# upload all images to imgur.com
opts_chunk$set(fig.width = 5, fig.height = 5, dev = "png")
render_markdown(strict = T)
# some autocorrelated data
set.seed(1)
x <- 1:1e+05
d <- data.frame(x = x)
d$y <- arima.sim(list(order = c(1, 1, 0), ar = 0.9), n = 1e+05 - 1)
# the basic plot
base_plot <- ggplot(d, aes(x = x, y = y))
geom_bin2d
you can set the binwidth for the x and y variables
base_plot + geom_bin2d(binwidth = c(200, 5))
geom_hex
you can set the number of bins
base_plot + geom_hex(bins = 200)
small points
Stops overplotting
base_plot + geom_point(size = I("."))
use a smoother
This relies on having a smoothing method that will get you the detail you want without crashing or taking too long. In this case the number of knots was chosen by trial and error (and perhaps you will want more detail)
library(mgcv)
base_plot + stat_smooth(method = "gam", formula = y ~ s(x, k = 50))

Resources