How can I generate data from a plotted line in R? - r

I have no dataset just two plotted lines and I want to generate scattered y-axis data 2 standard deviations away from the mean (the plotted line). Here is my code for the line:
ggplot() +
lims(x = c(0,20), y = c(0,1)) +
annotate("segment",x = .1,xend = 5, yend = .25, y = .1) +
annotate("segment",x = 5,xend = 20, yend = .35,y = .25)
Sorry if this post is unclear but I am not sure the best way to explain it. Let me know if you have any questions or if what I am trying to do isn't possible.

Here's an example for one of the lines you have (I didn't double check whether y = 0.09*x + 0 is consistent or not with what you showed, guiding my answer from your comment).
library(ggplot2)
library(dplyr)
df <- tibble(x=1:20,
y1=0.09*x,
y2=0.0067*x)
# generate dots for y1
# mean y1 and sd = 1
sapply(df$y1, function(tt) rnorm(10, tt)) %>%
# make it into tibble
as_tibble() %>%
# pivot into longer format
tidyr::pivot_longer(everything()) %>%
# names of the columns get assigned to V1 V2 ...
# we can clean that and get the actual x
# this works nicely because your x=1:20, will fail otherwise
mutate(X=as.numeric(stringr::str_remove(name, "V"))) %>%
# plot the thing
ggplot(aes(X, value)) +
geom_point() +
# add the "mean" values from before
geom_point(data=df, aes(x, y1), col="red", size=2)

Related

violin_plot() with continuous axis for grouping variable?

The grouping variable for creating a geom_violin() plot in ggplot2 is expected to be discrete for obvious reasons. However my discrete values are numbers, and I would like to show them on a continuous scale so that I can overlay a continuous function of those numbers on top of the violins. Toy example:
library(tidyverse)
df <- tibble(x = sample(c(1,2,5), size = 1000, replace = T),
y = rnorm(1000, mean = x))
ggplot(df) + geom_violin(aes(x=factor(x), y=y))
This works as you'd imagine: violins with their x axis values (equally spaced) labelled 1, 2, and 5, with their means at y=1,2,5 respectively. I want to overlay a continuous function such as y=x, passing through the means. Is that possible? Adding + scale_x_continuous() predictably gives Error: Discrete value supplied to continuous scale. A solution would presumably spread the violins horizontally by the numeric x values, i.e. three times the spacing between 2 and 5 as between 1 and 2, but that is not the only thing I'm trying to achieve - overlaying a continuous function is the key issue.
If this isn't possible, alternative visualisation suggestions are welcome. I know I could replace violins with a simple scatter plot to give a rough sense of density as a function of y for a given x.
The functionality to plot violin plots on a continuous scale is directly built into ggplot.
The key is to keep the original continuous variable (instead of transforming it into a factor variable) and specify how to group it within the aesthetic mapping of the geom_violin() object. The width of the groups can be modified with the cut_width argument, depending on the data at hand.
library(tidyverse)
df <- tibble(x = sample(c(1,2,5), size = 1000, replace = T),
y = rnorm(1000, mean = x))
ggplot(df, aes(x=x, y=y)) +
geom_violin(aes(group = cut_width(x, 1)), scale = "width") +
geom_smooth(method = 'lm')
By using this approach, all geoms for continuous data and their varying functionalities can be combined with the violin plots, e.g. we could easily replace the line with a loess curve and add a scatter plot of the points.
ggplot(df, aes(x=x, y=y)) +
geom_violin(aes(group = cut_width(x, 1)), scale = "width") +
geom_smooth(method = 'loess') +
geom_point()
More examples can be found in the ggplot helpfile for violin plots.
Try this. As you already guessed, spreading the violins by numeric values is the key to the solution. To this end I expand the df to include all x values in the interval min(x) to max(x) and use scale_x_discrete(drop = FALSE) so that all values are displayed.
Note: Thanks #ChrisW for the more general example of my approach.
library(tidyverse)
set.seed(42)
df <- tibble(x = sample(c(1,2,5), size = 1000, replace = T), y = rnorm(1000, mean = x^2))
# y = x^2
# add missing x values
x.range <- seq(from=min(df$x), to=max(df$x))
df <- df %>% right_join(tibble(x = x.range))
#> Joining, by = "x"
# Whatever the desired continuous function is:
df.fit <- tibble(x = x.range, y=x^2) %>%
mutate(x = factor(x))
ggplot() +
geom_violin(data=df, aes(x = factor(x, levels = 1:5), y=y)) +
geom_line(data=df.fit, aes(x, y, group=1), color = "red") +
scale_x_discrete(drop = FALSE)
#> Warning: Removed 2 rows containing non-finite values (stat_ydensity).
Created on 2020-06-11 by the reprex package (v0.3.0)

ggplot faceted cumulative histogram

I have the following data
set.seed(123)
x = c(rnorm(100, 4, 1), rnorm(100, 6, 1))
gender = rep(c("Male", "Female"), each=100)
mydata = data.frame(x=x, gender=gender)
and I want to plot two cumulative histograms (one for males and the other for females) with ggplot.
I have tried the code below
ggplot(data=mydata, aes(x=x, fill=gender)) + stat_bin(aes(y=cumsum(..count..)), geom="bar", breaks=1:10, colour=I("white")) + facet_grid(gender~.)
but I get this chart
that, obviously, is not correct.
How can I get the correct one, like this:
Thanks!
I would pre-compute the cumsum values per bin per group, and then use geom_histogram to plot.
mydata %>%
mutate(x = cut(x, breaks = 1:10, labels = F)) %>% # Bin x
count(gender, x) %>% # Counts per bin per gender
mutate(x = factor(x, levels = 1:10)) %>% # x as factor
complete(x, gender, fill = list(n = 0)) %>% # Fill missing bins with 0
group_by(gender) %>% # Group by gender ...
mutate(y = cumsum(n)) %>% # ... and calculate cumsum
ggplot(aes(x, y, fill = gender)) + # The rest is (gg)plotting
geom_histogram(stat = "identity", colour = "white") +
facet_grid(gender ~ .)
Like #Edo, I also came here looking for exactly this. #Edo's solution was the key for me. It's great. But I post here a few additions that increase the information density and allow comparisons across different situations.
library(ggplot2)
set.seed(123)
x = c(rnorm(100, 4, 1), rnorm(50, 6, 1))
gender = c(rep("Male", 100), rep("Female", 50))
grade = rep(1:3, 50)
mydata = data.frame(x=x, gender=gender, grade = grade)
ggplot(mydata, aes(x,
y = ave(after_stat(density), group, FUN = cumsum)*after_stat(width),
group = interaction(gender, grade),
color = gender)) +
geom_line(stat = "bin") +
scale_y_continuous(labels = scales::percent_format()) +
facet_wrap(~grade)
I rescale the y so that the cumulative plot always ends at 100%. Otherwise, if the groups are not the same size (like they are in the original example data) then the cumulative plots have different final heights. This obscures their relative distribution.
Secondly, I use geom_line(stat="bin") instead of geom_histogram() so that I can put more than one line on a panel. This way I can compare them easily.
Finally, because I also want to compare across facets, I need to make sure the ggplot group variable uses more than just color=gender. We set it manually with group = interaction(gender, grade).
Answering a million years later....
I was looking for a solution for the same problem and I got here..
Eventually I figured it out by myself, so I'll drop it here in case other people will ever need it.
As required: no pre-work is necessary!
ggplot(mydata) +
geom_histogram(aes(x = x, y = ave(..count.., group, FUN = cumsum),
fill = gender, group = gender),
colour = "gray70", breaks = 1:10) +
facet_grid(rows = "gender")

Drawing several "numeric" lines using ggplot

I have a dataset which contains 200 different groups, which can take some a between 0 and 200. I would like to draw a line for every group, so a total of 200 lines and have the legend to be "numeric". I know how to do this with a factor, but cant get it to work. Not the best example:
library(tidyverse)
df <- data.frame(Day = 1:100)
df <- df %>% mutate(A = Day + runif(100,1,400) + rnorm(100,3,400) + 2500,
B = Day + rnorm(100,2,900) + -5000 ,
C = Day + runif(100,1,50) + rnorm(100,1,1000) -500,
D = (A+B+C)/5 - rnorm(100, 3,450) - 2500)
df <- gather(df, "Key", "Value", -Day)
df$Key1 <- apply(df, 1, function(x) which(LETTERS == x[2]))
ggplot(df, aes(Day, Value, col = Key)) + geom_line() # I would to keep 4 lines, but would like have the following legend
ggplot(df, aes(Day, Value, col = Key1)) + geom_line() # Not correct lines
ggplot(df, aes(Day, Value)) + geom_line(aes(col = Key1)) # Not correct lines
Likely a duplicate, but I cant find the answer and guess there is something small that is incorrect.
Is this what you mean? I'm not sure since you say you want 200 lines, but in your code you say you want 4 lines.
ggplot(df, aes(Day, Value, group = Key, col=Key1)) + geom_line()
Using group gives you the different lines, using col gives you the different colours.

Create a colour blind test with ggplot

I would like to create a colour blind test, similar to that below, using ggplot.
The basic idea is to use geom_hex (or perhaps a voronoi diagram, or possibly even circles as in the figure above) as the starting point, and define a dataframe that, when plotted in ggplot, produces the image.
We would start by creating a dataset, such as:
df <- data.frame(x = rnorm(10000), y = rnorm(10000))
then plot this:
ggplot(df, aes(x, y)) +
geom_hex() +
coord_equal() +
scale_fill_gradient(low = "red", high = "green", guide = FALSE) +
theme_void()
which gives the image below:
The main missing step is to create a dataset that actually plots a meaningful symbol (letter or number), and I'm not sure how best to go about this without painstakingly mapping the coordinates. Ideally one would be able to read in the coordinates perhaps from an image file.
Finally, a bit of tidying up could round the plot edges by removing the outlying points.
All suggestions are very welcome!
EDIT
Getting a little closer to what I'm after, we can use the image below of the letter 'e':
Using the imager package, we can read this in and convert it to a dataframe:
img <- imager::load.image("e.png")
df <- as.data.frame(img)
then plot that dataframe using geom_raster:
ggplot(df, aes(x, y)) +
geom_raster(aes(fill = value)) +
coord_equal() +
scale_y_continuous(trans = scales::reverse_trans()) +
scale_fill_gradient(low = "red", high = "green", guide = FALSE) +
theme_void()
If we use geom_hex instead of geom_raster, we can get the following plot:
ggplot(df %>% filter(value %in% 1), aes(x, y)) +
geom_hex() +
coord_equal() +
scale_y_continuous(trans = scales::reverse_trans()) +
scale_fill_gradient(low = "red", high = "green", guide = FALSE) +
theme_void()
so, getting there but clearly still a long way off...
Here's an approach for creating this plot:
Packages you need:
library(tidyverse)
library(packcircles)
Get image into a 2D matrix (x and y coordinates) of values. To do this, I downloaded the .png file of the e as "e.png" and saved in my working directory. Then some processing:
img <- png::readPNG("e.png")
# From http://stackoverflow.com/questions/16496210/rotate-a-matrix-in-r
rotate <- function(x) t(apply(x, 2, rev))
# Convert to one colour layer and rotate it to be in right direction
img <- rotate(img[,,1])
# Check that matrix makes sense:
image(img)
Next, create a whole lot of circles! I did this based on this post.
# Create random "circles"
# *** THESE VALUES WAY NEED ADJUSTING
ncircles <- 1200
offset <- 100
rmax <- 80
x_limits <- c(-offset, ncol(img) + offset)
y_limits <- c(-offset, nrow(img) + offset)
xyr <- data.frame(
x = runif(ncircles, min(x_limits), max(x_limits)),
y = runif(ncircles, min(y_limits), max(y_limits)),
r = rbeta(ncircles, 1, 10) * rmax)
# Find non-overlapping arrangement
res <- circleLayout(xyr, x_limits, y_limits, maxiter = 1000)
cat(res$niter, "iterations performed")
#> 1000 iterations performed
# Convert to data for plotting (just circles for now)
plot_d <- circlePlotData(res$layout)
# Check circle arrangement
ggplot(plot_d) +
geom_polygon(aes(x, y, group=id), colour = "white", fill = "skyblue") +
coord_fixed() +
theme_minimal()
Finally, interpolate the image pixel values for the centre of each circle. This will indicate whether a circle is centered over the shape or not. Add some noise to get variance in colour and plot.
# Get x,y positions of centre of each circle
circle_positions <- plot_d %>%
group_by(id) %>%
summarise(x = min(x) + (diff(range(x)) / 2),
y = min(y) + (diff(range(y)) / 2))
# Interpolate on original image to get z value for each circle
circle_positions <- circle_positions %>%
mutate(
z = fields::interp.surface(
list(x = seq(nrow(img)), y = seq(ncol(img)), z = img),
as.matrix(.[, c("x", "y")])),
z = ifelse(is.na(z), 1, round(z)) # 1 is the "empty" area shown earlier
)
# Add a little noise to the z values
set.seed(070516)
circle_positions <- circle_positions %>%
mutate(z = z + rnorm(n(), sd = .1))
# Bind z value to data for plotting and use as fill
plot_d %>%
left_join(select(circle_positions, id, z)) %>%
ggplot(aes(x, y, group = id, fill = z)) +
geom_polygon(colour = "white", show.legend = FALSE) +
scale_fill_gradient(low = "#008000", high = "#ff4040") +
coord_fixed() +
theme_void()
#> Joining, by = "id"
To get colours right, tweak them in scale_fill_gradient

binning geom_boxplot in ggplot2 in R?

I want to use geom_boxplot to make boxplots that correlate two variables: for each bin of x values, plot the distribution (as boxplot) of y values for that bin. I tried:
ggplot(cars) + geom_boxplot(aes(x=dist, y=speed))
but this creates basically one large bin of x values. How can I make it so for each bin of dist, there's a boxplot representing the corresponding speed values?
Not sure what you mean by "bin", since you haven't provided any bins in your question. If you just mean that you would like a speed boxplot for each unique dist value, you can do it like this (treating dist as discrete):
ggplot(cars) + geom_boxplot(aes(factor(dist), speed))
If you were to actually create bins you could do something like:
cars$bin <- cut(cars$dist, c(1, 10, 30, 50, 200))
ggplot(cars) + geom_boxplot(aes(bin, speed))
Just to put it out there, you could also do
bin_size <- 10
cars %>%
mutate(bin_dist = factor(dist%/%bin_size*10)) %>%
ggplot(aes(x = bin_dist, y = speed)) +
geom_boxplot()
And to make the labeling better:
(cars2 <- cars %>%
mutate(bin_dist = dist%/%bin_size*10)) %>%
ggplot(aes(x = factor(bin_dist), y = speed)) +
geom_boxplot() +
scale_x_discrete(labels = paste0(unique(cars2$bin_dist), "-", unique(cars2$bin_dist)+10)) +
labs(x = "dist")
cars2 gets saved so it can work in paste0.

Resources