Colour segments of density plot by bin - r

Warning, I am brand-new to R!
I have the R bug and having a play with the possibilities but getting very lost. I want to try and colour segments of a density plot with a condition '>' to indicate bins. In my head it look like:
...but not quartile or % change dependant.
My data shows; x = duration (number of days) and y = frequency. I would like the plot to colour split on 3 month intervals up to 12 months and one colour after (using working days i.e. 63 = 3 months).
I have had a go, but really not sure where to start!
ggplot(df3, aes(x=Investigation.Duration))+
geom_density(fill = W.S_CleanNA$Investigation.Duration[W.S_CleanNA$Investigation.Duration>0],
fill = W.S_CleanNA$Investigation.Duration[W.S_CleanNA$Investigation.Duration>63], color = "white",
fill = W.S_CleanNA$Investigation.Duration[W.S_CleanNA$Investigation.Duration>127], color = "light Grey",
fill = W.S_CleanNA$Investigation.Duration[W.S_CleanNA$Investigation.Duration>190], color = "medium grey",
fill = W.S_CleanNA$Investigation.Duration[W.S_CleanNA$Investigation.Duration>253], color = "dark grey",
fill = W.S_CleanNA$Investigation.Duration[W.S_CleanNA$Investigation.Duration>506], color = "black")+
ggtitle ("Investigation duration distribution in 'Wales' complexity sample")+
geom_text(aes(x=175, label=paste0("Mean, 136"), y=0.0053))+
geom_vline(xintercept = c(136.5), color = "red")+
geom_text(aes(x=80, label=paste0("Median, 129"), y=0.0053))+
geom_vline(xintercept = c(129.5), color = "blue")
Any really simple help much appreciated.

Unfortunately, you can't do this directly with geom_density, as "under the hood" it is built with a single polygon, and a polygon can only have a single fill. The only way to do this is to have multiple polygons, and you need to build them yourself.
Fortunately, this is easier than it sounds.
There was no sample data in the question, so we will create a plausible distribution with the same median and mean:
#> Simulate data
set.seed(69)
df3 <- data.frame(Investigation.Duration = rgamma(1000, 5, 1/27.7))
round(median(df3$Investigation.Duration))
#> [1] 129
round(mean(df3$Investigation.Duration))
#> [1] 136
# Get the density as a data frame
dens <- density(df3$Investigation.Duration)
dens <- data.frame(x = dens$x, y = dens$y)
# Exclude the artefactual times below zero
dens <- dens[dens$x > 0, ]
# Split into bands of 3 months and group > 12 months together
dens$band <- dens$x %/% 63
dens$band[dens$band > 3] <- 4
# This us the complex bit. For each band we want to add a point on
# the x axis at the upper and lower ltime imits:
dens <- do.call("rbind", lapply(split(dens, dens$band), function(df) {
df <- rbind(df[1,], df, df[nrow(df),])
df$y[c(1, nrow(df))] <- 0
df
}))
Now we have the polygons, it's just a case of drawing and labelling appropriately:
library(ggplot2)
ggplot(dens, aes(x, y)) +
geom_polygon(aes(fill = factor(band), color = factor(band))) +
theme_minimal() +
scale_fill_manual(values = c("#003f5c", "#58508d", "#bc5090",
"#ff6361", "#ffa600"),
name = "Time",
labels = c("Less than 3 months",
"3 to 6 months",
"6 to 9 months",
"9 to 12 months",
"Over 12 months")) +
scale_colour_manual(values = c("#003f5c", "#58508d", "#bc5090",
"#ff6361", "#ffa600"),
guide = guide_none()) +
labs(x = "Days since investigation started", y = "Density") +
ggtitle ("Investigation duration distribution in 'Wales' complexity sample") +
geom_text(aes(x = 175, label = paste0("Mean, 136"), y = 0.0053),
check_overlap = TRUE)+
geom_vline(xintercept = c(136.5), linetype = 2)+
geom_text(aes(x = 80, label = paste0("Median, 129"), y = 0.0053),
check_overlap = TRUE)+
geom_vline(xintercept = c(129.5), linetype = 2)

Related

Problem with colouring a GG Plot Histogram

I`ve got an issue with colouring a ggplot2 histogram.
R-Junk
ggplot(Hospital, aes(x=BodyTemperature)) +
geom_histogram(aes(fill = factor(BodyTemperature))) +
scale_x_continuous(breaks = seq(0, 100, by = 10)) +
ylab("prevalence") +
xlab("BodyTemperature") +
ggtitle("Temperature vs. prevalence")
So the histogram should plot the information (x-axis), that as higher the temperature gets, the worse it is. So for example „temperature“ at 36°C should be green, 38°C yellow, 40° red - going from left to right on the x-axis.
Y-Axis should provide how often these temperatures ocures in the Patientdata of the Hospital. The Data "BodyTemperature" is a list of 200+ Data like: "35.3" or "37.4" etc.
How can this chunk be fixed to provide the color changes? For a non-ggplot version ive already written this r-junk positiv:
```{r, fig.width=8}
color1 <- rep(brewer.pal(1, "Greens"))
color2 <- rep("#57c4fa", 0)
color3 <- brewer.pal(8, "Reds")
hist(Hospital$BodyTemperature[-357],
breaks = seq(from = 0, to = 100, by = 10),
main = "Temperature vs. prevalence",
ylab = "prevalence",
xlab = "Temperature",
col = c(color1, color2, color3))
```
The key is to make sure the bin intervals used for the fill scale match those used for the x axis. You can do this by setting the binwidth argument to geom_histogram(), and using ggplot2::cut_width() to break BodyTemperature into the same bins for the fill scale:
set.seed(13)
library(ggplot2)
# example data
Hospital <- data.frame(BodyTemperature = 36.5 + rchisq(100, 2))
ggplot(Hospital, aes(BodyTemperature)) +
geom_histogram(
aes(fill = cut_width(BodyTemperature, width = 1)),
binwidth = 1,
show.legend = FALSE
) +
scale_fill_brewer(palette = "RdYlGn", direction = -1) +
labs(
title = "Temperature vs. Prevalence",
x = "Body Temperature (°C)",
y = "Prevalence"
) +
theme_minimal()
Created on 2022-10-24 with reprex v2.0.2

Different objects are not showing up on my ggplot2

I'm studying the returns to college admission for marginal student and i'm trying to make a ggplot2 of the following data which is, average salaries of students who finished or didn't finish their masters in medicin and the average 'GPA' (foreign equivalent) distance to the 'acceptance score':
SalaryAfter <- c(287.780,305.181,323.468,339.082,344.738,370.475,373.257,
372.682,388.939,386.994)
DistanceGrades <- c("<=-1.0","[-0.9,-0.5]","[-0.4,-0.3]","-0,2","-0.1",
"0.0","0.1","[0.2,0.3]","[0.4,0.5]",">=0.5")
I have to do a Regression Discontinuity Design (RDD), so to do the regression - as far as i understand it - i have to rewrite the DistanceGrades to numeric so i just created a variable z
z <- -5:4
where 0 is the cutoff (ie. 0 is equal to "0.0" in DistanceGrades).
I then make a dataframe
df <- data.frame(z,SalaryAfter)
Now my attempt to create the plot gets a bit messy (i use the package 'fpp3', but i suppose that it is just the ggplot2 and maybe dyplr packages)
df %>%
select(z, SalaryAfter) %>%
mutate(D = as.factor(ifelse(z >= -0.1, 1, 0))) %>%
ggplot(aes(x = z, y = SalaryAfter, color = D)) +
geom_point(stat = "identity") +
geom_smooth(method = "lm") +
geom_vline(xintercept = 0) +
theme(panel.grid = element_line(color = "white",
size = 0.75,
linetype = 1)) +
xlim(-6,5) +
xlab("Distance to acceptance score") +
labs(title = "Figur 1.1", subtitle = "Salary for every distance to the acceptance score")
Which plots:
What i'm trying to do is firstly, split the data with a dummy variable D=1 if z>0 and D=0 if z<0. Then i plot it with a linear regression and a vertical line at z=0. Lastly i write the title and subtilte. Now i have two problems:
The x axis is displaying -5, -2.5, ... but i would like for it to show all the integers, the rational numbers have no relation to the z variable which is discrete. I have tried to fix this with several different methods, but none of them have worked, i can't remember all the ways i have tried (theme(panel.grid...),scale_x_discrete and many more), but the outcome has all been pretty similar. They all cause the x-axis to be completely removed such that there is no numbers and sometimes it even removes the axis title.
i would like for the regression channel for the first part of the data to extend to z=0
When i try to solve both of these problems i again get similar results, most of the things i try is not producing an error message when i run the code, but they either do nothing to my plot or they remove some of the existing elements which leaves me made of questions. I suppose that the error is caused by some of the elements not working together but i have no idea.
Try this:
library(tidyverse)
SalaryAfter <- c(287.780,305.181,323.468,339.082,344.738,370.475,373.257,
372.682,388.939,386.994)
DistanceGrades <- c("<=-1.0","[-0.9,-0.5]","[-0.4,-0.3]","-0,2","-0.1",
"0.0","0.1","[0.2,0.3]","[0.4,0.5]",">=0.5")
z <- -5:4
df <- data.frame(z,SalaryAfter) %>%
select(z, SalaryAfter) %>%
mutate(D = as.factor(ifelse(z >= -0.1, 1, 0)))
# Fit a lm model for the left part of the panel
fit_data <- lm(SalaryAfter~z, data = filter(df, z <= -0.1)) %>%
predict(., newdata = data.frame(z = seq(-5, 0, 0.1)), interval = "confidence") %>%
as.data.frame() %>%
mutate(z = seq(-5, 0, 0.1), D = factor(0, levels = c(0, 1)))
# Plot
ggplot(mapping = aes(color = D)) +
geom_ribbon(data = filter(fit_data, z <= 0 & -1 <= z),
aes(x = z, ymin = lwr, ymax = upr),
fill = "grey70", color = "transparent", alpha = 0.5) +
geom_line(data = fit_data, aes(x = z, y = fit), size = 1) +
geom_point(data = df, aes(x = z, y = SalaryAfter), stat = "identity") +
geom_smooth(data = df, aes(x = z, y = SalaryAfter), method = "lm") +
geom_vline(xintercept = 0) +
theme(panel.grid = element_line(color = "white",
size = 0.75,
linetype = 1)) +
scale_x_continuous(limits = c(-6, 5), breaks = -6:5) +
xlab("Distance to acceptance score") +
labs(title = "Figure 1.1", subtitle = "Salary for every distance to the acceptance score")

divide the y axis to make part with a score <25 occupies the majority in ggplot

I want to divide the y axis for the attached figure to take part with a score <25 occupies the majority of the figure while the remaining represent a minor upper part.
I browsed that and I am aware that I should use scale_y_discrete(limits .I used this p<- p+scale_y_continuous(breaks = 1:20, labels = c(1:20,"//",40:100)) but it doesn't work yet.
I used the attached data and this is my code
Code
p<-ggscatter(data, x = "Year" , y = "Score" ,
color = "grey", shape = 21, size = 3, # Points color, shape and size
add.params = list(color = "blue", fill = "lightgray"), # Customize reg. line
add = "loess", #reg.line
conf.int = T,
cor.coef = F, cor.method = "pearson",
xlab = "Year" , ylab= "Score")
p<-p+ coord_cartesian(xlim = c(1980, 2020));p
Here is as close as I could get getting a fake axis break and resizing the upper area of the plot. I still think it's a bad idea and if this were my plot I'd much prefer a more straightforward axis transform.
First, we'd need a function that generates a transform that squeezes all values above some threshold:
library(ggplot2)
library(scales)
# Define new transform
my_transform <- function(threshold = 25, squeeze_factor = 10) {
force(threshold)
force(squeeze_factor)
my_transform <- trans_new(
name = "trans_squeeze",
transform = function(x) {
ifelse(x > threshold,
((x - threshold) * (1 / squeeze_factor)) + threshold,
x)
},
inverse = function(x) {
ifelse(x > threshold,
((x - threshold) * squeeze_factor) + threshold,
x)
}
)
return(my_transform)
}
Next we apply that transformation to the y-axis and add a fake axis break. I've used vanilla ggplot2 code as I find the ggscatter() approach confusing.
ggplot(data, aes(Year, Score)) +
geom_point(color = "grey", shape = 21, size = 3) +
geom_smooth(method = "loess", fill = "lightgray") +
# Add fake axis lines
annotate("segment", x = -Inf, xend = -Inf,
y = c(-Inf, Inf), yend = c(24.5, 25.5)) +
# Apply transform to y-axis
scale_y_continuous(trans = my_transform(25, 10),
breaks = seq(0, 80, by = 10)) +
scale_x_continuous(limits = c(1980, 2020), oob = oob_keep) +
theme_classic() +
# Turn real y-axis line off
theme(axis.line.y = element_blank())
You might find it informative to read Hadley Wickham's view on discontinuous axes. People sometimes mock weird y-axes.

Change alpha value for certain break values in ggplot geom_point

I have made a scatter plot from 100k++ points and i would like the colour points (break values 1 and 2 which are "green" and break value 20 which is "red") to stand out more than the "cornsilk1" points (break values 3 to 19). I have tried the code below but no luck.
Any help would be appreciated.
Thanks so much
p.s. please excuse my juvenile code. I am sure there is a way more effective way to do this...
plotIA<-ggplot(plotintaobs,aes(x=SD13009PB,y=SD13009PB2,colour=quartile))+geom_point()+labs(x="Phillips Observeration 1", y="Phillips Observation 2") + ggtitle("Intra-observer Variation") + mytheme
plotIA+ scale_color_manual(breaks = c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20"),
values=c("green","green", "cornsilk1", "cornsilk1","cornsilk1","cornsilk1","cornsilk1","cornsilk1","cornsilk1","cornsilk1","cornsilk1","cornsilk1","cornsilk1","cornsilk1","cornsilk1","cornsilk1","cornsilk1","cornsilk1","cornsilk1","red"))
plotIA+scale_alpha_manual(values=c(1,1,0.4,0.4,0.4,0.4,0.4,0.4,0.4,0.4,0.4,0.4,0.4,0.4,0.4,0.4,0.4,0.4,0.4,1))
One strategy is to use cut to split the quartiles into into your three groups. Then you can use scale_colour_manual
# some fake data
plotintaobs <- data.frame(SD13009PB = rnorm(20), SD13009PB2 = rnorm(20), quartile = 1:20)
#cut quartile
plotintaobs$q2 <- cut(plotintaobs$quartile, breaks = c(0, 2, 19, 20), labels = c("low", "mid", "high"))
#plot
plotIA <- ggplot(plotintaobs, aes(x = SD13009PB, y = SD13009PB2, colour = q2, alpha = q2)) +
geom_point() +
scale_colour_manual(values = c("green", "cornsilk1","red")) +
scale_alpha_manual(values = c(1, 0.8, 1))
plotIA

How to make a color scale with sharp transition in ggplot2

I am trying to create a color scale with a sharp color transition at one point. What I am currently doing is:
test <- data.frame(x = c(1:20), y = seq(0.01, 0.2, by = 0.01))
cutoff <- 0.10
ggplot(data = test,
aes(x = as.factor(x), y = y, fill = log(y), width = 1, binwidth = 0)) +
geom_bar(stat = "identity") +
scale_fill_gradientn(colours = c("red", "red", "yellow", "green"),
values = rescale(log(c(0.01, cutoff - 0.0000000000000001, cutoff, 0.2))),
breaks = c(log(cutoff)), label = c(cutoff))
It is producing the plots I want. But the position of the break in colorbar somehow varies depending on the cutoff. Sometimes below the value, sometimes above, sometimes on the line. Here are some plots with different cutoffs (0.05, 0.06, 0.1):
What am I doing wrong? Or alternatively, is there a better way to create a such a color scale?
Have you looked into scale_colour_steps or scale_colour_stepsn?
Using the option n.break from scale_colour_stepsn you should be able to specify the number of breaks you want and have sharper transitions.
Be sure to use ggplot2 > 3.3.2
In case you are still interested in a solution for this, you can add guide = guide_colourbar(nbin = <some arbitrarily large number>) to scale_fill_gradientn(). This increases the number of bins used by the colourbar legend, which makes the transition look sharper.
# illustration using nbin = 1000, & weighted colours below the cutoff
plot.cutoff <- function(cutoff){
p <- ggplot(data = test,
aes(x = as.factor(x), y = y, fill = log(y))) +
geom_col(width = 1) +
scale_fill_gradientn(colours = c("red4", "red", "yellow", "green"),
values = scales::rescale(log(c(0.01, cutoff - 0.0000000000000001,
cutoff, 0.2))),
breaks = c(log(cutoff)),
label = c(cutoff),
guide = guide_colourbar(nbin = 1000))
return(p)
}
cowplot::plot_grid(plot.cutoff(0.05),
plot.cutoff(0.06),
plot.cutoff(0.08),
plot.cutoff(0.1),
ncol = 2)
(If you find the above insufficiently sharp at very high resolutions, you can also set raster = FALSE in guide_colourbar(), which turns off interpolation & draws rectangles instead.)
I think it is slightly tricky to achieve an exact, discrete cutoff point in the continuous color scale using scale_fill_gradientn. A quick alternative would be to use scale_fill_gradient, set the cutoff with limits, and set the color of 'out-of-bounds' values with na.value.
Here's a slightly simpler example than in your question:
# some data
df <- data.frame(x = factor(1:10), y = 1, z = 1:10)
# a cutoff point
lo <- 4
ggplot(df, aes(x = x, y = y, fill = z)) +
geom_bar(stat = "identity") +
scale_fill_gradient(low = "yellow", high = "green",
limits = c(lo, max(df$z)), na.value = "red")
As you see, the values below your cutpoint will not appear in the legend, but one may consider including a large chunk of red a waste of "legend band width" anyway. You might just add a verbal description of the red bars in the figure caption instead.
You may also wish to differentiate between values below a lower cutpoint and above an upper cutpoint. For example, set 'too low' values to blue and 'too high values' to red. Here I use findInterval to differentiate between low, mid and high values.
# some data
set.seed(2)
df <- data.frame(x = factor(1:10), y = 1, z = sample(1:10))
# lower and upper limits
lo <- 3
hi <- 8
# create a grouping variable based on the the break points
df$grp <- findInterval(df$z, c(lo, hi), rightmost.closed = TRUE)
ggplot(df, aes(x = x, y = y, fill = z)) +
geom_bar(stat = "identity") +
scale_fill_gradient(low = "yellow", high = "green", limits = c(lo, hi), na.value = "red") +
geom_bar(data = df[df$grp == 0, ], fill = "blue", stat = "identity")

Resources