How to add a gradient fill to a geom_density chart - r

I have a dataset where I'd like to plot the density of one column and add a gradient fill that is associated with another column.
For example, this code creates the following plot
library(datasets)
library(tidyverse)
df <- airquality
df %>%
group_by(Temp) %>%
mutate(count = n(),
avgWind = mean(Wind)) %>%
ggplot(aes(x = Temp, fill = avgWind)) +
geom_density()
What I'd like is for the plot to have a gradient fill that indicates what the average wind (avgWind) was at each temperature along the x-axis.
I've seen some examples that allow me to create a gradient fill that is associated with the values on the x-axis itself (in this case, Temp) or by percentile/quantiles, but I'd like the gradient fill to be associated with an additional variable.
It's sort of like this, but instead of a bar plot, I'd like to keep it as a smoothed density chart:
df %>%
group_by(Temp) %>%
mutate(count = n(),
avgWind = mean(Wind)) %>%
ggplot(aes(x = (Temp), fill = avgWind, group = Temp)) +
geom_bar(aes(y = (..count..)/sum(..count..)))

You can't do gradient fills in geom_polygon so the usual solution is to draw lots of line segments. For example you could do something like this:
library("datasets")
library("tidyverse")
library("viridis")
df <- airquality
df <- df %>%
group_by(Temp) %>%
mutate(count = n(), avgWind = mean(Wind))
## Since we (presumably) want continuous fill, we need to interpolate to
## get avgWind at each Temp value.
## The edges are grey because KDE is estimating density
## Where we don't know the relationship between temp and avgWind
d2fun <- approxfun(df$Temp, df$avgWind)
#> Warning in regularize.values(x, y, ties, missing(ties)): collapsing to unique
#> 'x' values
dens <- density(df$Temp)
dens_df <- data.frame(x = dens$x, y = dens$y, fill = d2fun(dens[["x"]]))
ggplot(dens_df) +
geom_segment(aes(x = x, xend = x, y = 0, yend = y, color = fill)) +
scale_color_viridis()

Related

R: ggplot, filling density plot with different colors around the mean value

Problem
I am trying to fill the density plot with different colors around the mean. For example, the left part of density plot from a vertical line of the mean will be filled with blue, and the right part with red. I tried the below method with three facets. Within each facet, by setting fill = color, it separates the plot into two density plots around the mean. I want to have only one plot filled with two colors. Can I get some help here?
Sample Data and Current Method
library(tidyverse)
library(tibble)
library(data.table)
rename <- dplyr::rename
select <- dplyr::select
set.seed(10002)
id <- sample(1:1000, 1000, replace=F)
set.seed(10003)
group <- sample(c('A','B','C'), 1000, replace=T)
set.seed(10001)
value1 <- sample(1:300, 1000, replace=T)
set.seed(10004)
value2 <- sample(1:300, 1000, replace=T)
sample <-
data.frame(id, group, value1, value2)
mu <-
sample %>%
gather(state, value, -group, -id) %>%
ddply(c("group"), summarise, grp.mean=mean(value))
p <-
sample %>%
gather(state, value, -group, -id) %>%
left_join(
mu,
by = 'group'
) %>%
distinct %>%
mutate(color = ifelse(value <= grp.mean, 'leq', 'greater')) %>%
select(-grp.mean) %>%
ggplot(aes(x = value, fill = color)) +
geom_density(alpha=0.4) +
geom_vline(
data = mu,
aes(xintercept = grp.mean, color = group),
linetype = "dashed"
) +
facet_wrap(.~group)

Barplot with set x, y and fill categories R

I have the following type of data and would like to create a stacked barplot, which would show the sum of Number on y axis for different bins of Distance on x axis which would indicate distance. In fact, that would be a sort of histogram, but not with frequencies on y but the sums of Number per set bin. This would be cumulative for all categories in Dest which would be marked with different colours.
Thanks so much.
library(ggplot2)
df <- data.frame(c(rep("A",20),rep("B",25),rep("C",35)),sample(1:30, 80,replace = TRUE),
rnorm(80,45,8))
colnames(df) <- c("Dest","Number","Distance")
ggplot(data = df, aes(x = Distance, y = Number, fill = Dest)) +
geom_histogram(colour = c("red","blue","green"))
Here are 2 solutions in case you want to be the one that specifies the (Distance) bins and not the histogram:
Option 1 (using ntile)
Here's a solution that allows you to specify the number of bins using ntile, which means that those bins will have more or less the same number of observations:
library(tidyverse)
df <- data.frame(c(rep("A",20),rep("B",25),rep("C",35)),sample(1:30, 80,replace = TRUE),
rnorm(80,45,8))
colnames(df) <- c("Dest","Number","Distance")
df %>%
group_by(bin = ntile(Distance, 3)) %>% # specify number of bins you want
mutate(DistRange = paste0(round(min(Distance)), " - ", round(max(Distance)))) %>%
ungroup() %>%
group_by(Dest, bin, DistRange = fct_reorder(DistRange, bin)) %>%
summarise(sum_number = sum(Number)) %>%
ungroup() %>%
ggplot(aes(DistRange, sum_number, fill=Dest))+
geom_col()
Option 2 (using cut)
An alternative option using cut to specify ranges:
df %>%
mutate(bin = cut(Distance, breaks = c(min(Distance)-1, 40, 50, 55, max(Distance)))) %>% # specify ranges
group_by(Dest, bin) %>%
summarise(sum_number = sum(Number)) %>%
ungroup() %>%
ggplot(aes(bin, sum_number, fill=Dest))+
geom_col()

How to normalize different curves drawn with geom = "step" when using stat_summary

Here is my code. The data set is artificially generated to simulate data similar to my actual problem.
Code:
library(ggplot2)
DataSet1 <- data.frame("Cat" = rep("A",10000), "Bin" = rep(c(-49:50),100),
"Value" = c(seq(0,4.9, by=0.1),
seq(4.9,0, by=-0.1)) * rep(rnorm(100,50,1),100))
DataSet2 <- data.frame("Cat" = rep("B",10000), "Bin" = rep(c(-49:50),100),
"Value" = c(seq(0,4.9, by=0.1),
seq(4.9,0, by=-0.1)) * rep(rnorm(100,75,1),100))
DataSet3 <- data.frame("Cat" = rep("C",10000), "Bin" = rep(c(-49:50),100),
"Value" = c(seq(0,4.9, by=0.1),
seq(4.9,0, by=-0.1)) * rep(rnorm(100,100,1),100))
DataSet <- rbind(DataSet1, DataSet2, DataSet3)
d <- ggplot(data = DataSet, aes(Bin, Value, color = Cat))
d + stat_summary(fun.y = sum, geom = 'step', size = 1)
My result:
What I want to do:
Normalize each of these plots, i.e., divide the sum at each bin width by the total Value for that curve.
As far as I am aware, stat_summary is not meant to operate over all values of x and y simultaneously, so this type of per-group summary isn't possible strictly within ggplot. In cases such as this, it's usually best to compute your summary ahead of time and then plot that. Using dplyr to make summarization easy:
library(dplyr)
DataSet <- DataSet %>%
group_by(Cat, Bin) %>%
summarize(Value = sum(Value)) %>%
group_by(Cat) %>%
mutate(Value = Value / sum(Value))
d <- ggplot(data = DataSet, aes(Bin, Value, color = Cat))
d + stat_summary(fun.y = mean, geom = 'step', size = 1)

R - ggplot2 contour plot

I am trying to replicate the code from Andrew Ng's Machine Learning course on Coursera in R (as the course is in Octave).
Basically I have to plot a non linear decision boundary (at p = 0.5) for a polynomial regularized logistic regression.
I can easily replicate the plot with the base library:
contour(u, v, z, levels = 0)
points(x = data$Test1, y = data$Test2)
where:
u <- v <- seq(-1, 1.5, length.out = 100)
and z is a matrix 100x100 with the values of z for every point of the grid.
Dimension of data is 118x3.
I cannot do it in ggplot2. Does somebody know how to replicate the same in ggplot2? I tried with:
z = as.vector(t(z))
ggplot(data, aes(x = Test1, y = Test2) + geom_contour(aes(x = u, y =
v, z = z))
But I get the error: Aesthetics must be either length 1 or the same as the data (118): colour, x, y, shape
Thanks.
EDIT (Adding plot created from code of missuse):
What you need is to convert the coordinates into long format. Here is an example using volcano data set:
data(volcano)
in base R:
contour(volcano)
with ggplot2:
library(tidyverse)
as.data.frame(volcano) %>% #convert the matrix to data frame
rownames_to_column() %>% #get row coordinates
gather(key, value, -rowname) %>% #convert to long format
mutate(key = as.numeric(gsub("V", "", key)), #convert the column names to numbers
rowname = as.numeric(rowname)) %>%
ggplot() +
geom_contour(aes(x = rowname, y = key, z = value))
if you would like to label it directly as in base R plot you can use library directlabels:
First map the color/fill to a variable:
as.data.frame(volcano) %>%
rownames_to_column() %>%
gather(key, value, -rowname) %>%
mutate(key = as.numeric(gsub("V", "", key)),
rowname = as.numeric(rowname)) %>%
ggplot() +
geom_contour(aes(x = rowname,
y = key,
z = value,
colour = ..level..)) -> some_plot
and then
library(directlabels)
direct.label(some_plot, list("far.from.others.borders", "calc.boxes", "enlarge.box",
box.color = NA, fill = "transparent", "draw.rects"))
to add markers at specific coordinates you just need to add another layer with appropriate data:
the previous plot
as.data.frame(volcano) %>%
rownames_to_column() %>%
gather(key, value, -rowname) %>%
mutate(key = as.numeric(gsub("V", "", key)),
rowname = as.numeric(rowname)) %>%
ggplot() +
geom_contour(aes(x = rowname, y = key, z = value)) -> plot_cont
add layer with points for instance:
plot_cont +
geom_point(data = data.frame(x = c(35, 47, 61),
y = c(22, 37, 15)),
aes(x = x, y = y), color = "red")
you can add any type of layer this way: geom_line, geom_text to name a few.
EDIT2: to change the scale of the axis there are several options, one is to assign appropriate rownames and colnames to the matrix:
I will assign a sequence from 0 - 2 for the x axis and 0 - 5 to the y axis:
rownames(volcano) <- seq(from = 0,
to = 2,
length.out = nrow(volcano)) #or some vector like u
colnames(volcano) <- seq(from = 0,
to = 5,
length.out = ncol(volcano)) #or soem vector like v
as.data.frame(volcano) %>%
rownames_to_column() %>%
gather(key, value, -rowname) %>%
mutate(key = as.numeric(key),
rowname = as.numeric(rowname)) %>%
ggplot() +
geom_contour(aes(x = rowname, y = key, z = value))
ggplot2 works most efficiently with data in long format. Here's an example with fake data:
library(tidyverse)
u <- v <- seq(-1, 1.5, length.out = 100)
# Generate fake data
z = outer(u, v, function(a, b) sin(2*a^3)*cos(5*b^2))
rownames(z) = u
colnames(z) = v
# Convert data to long format and plot
as.data.frame(z) %>%
rownames_to_column(var="row") %>%
gather(col, value, -row) %>%
mutate(row=as.numeric(row),
col=as.numeric(col)) %>%
ggplot(aes(col, row, z=value)) +
geom_contour(bins=20) +
theme_classic()

Changing line color in ggplot based on "several factors" slope

UPDATED:
I have the following data which I would like to draw a line between the groups, based on the slope of 3 factors `("I","II","III").
set.seed(205)
dat = data.frame(t=rep(c("I","II","III"), each=10),
pairs=rep(1:10,3),
value=rnorm(30),
group=rep(c("A","B"), 15))
I have tried the following, but I cannot manage to connect change the color of the line connecting "I" - "III" and "II" - "III":
ggplot(dat %>% group_by(pairs) %>%
mutate(slope = (value[t=="II"] - value[t=="I"])/( value[t=="II"])- value[t=="I"]),
aes(t, value, group=pairs, linetype=group, colour=slope > 0)) +
geom_point() +
geom_line()
This is a very similar issue to
Changing line color in ggplot based on slope
I hope I was able to explain my problem.
We can split apart the data, and get what you want:
#calculate slopes for I and II
dat %>%
filter(t != "III") %>%
group_by(pairs) %>%
# use diff to calculate slope
mutate(slope = diff(value)) -> dat12
#calculate slopes for II and III
dat %>%
filter(t != "I") %>%
group_by(pairs) %>%
# use diff to calculate slope
mutate(slope = diff(value)) -> dat23
ggplot()+
geom_line(data = dat12, aes(x = t, y = value, group = pairs, colour = slope > 0,
linetype = group))+
geom_line(data = dat23, aes(x = t, y = value, group = pairs, colour = slope > 0,
linetype = group))+
theme_bw()
Since the data in dat came sorted by t, I used diff to calculate the slope.

Resources