Coloring a specific line in ggplot - r

I use ggplot to plot hundreds of simulated paths. The data has been organized by pivot_longer to look like this (200 simulated paths, each having 2520 periods; simulation 1 first, then simulation 2 etc., with ind showing the simulated values for each period):
sim
period
ind
1
0
100.0
1
1
99.66
.
.
.
1
2520
103.11
2
0
100.0
.
.
.
.
.
.
200
0
100.0
.
.
.
200
2520
195.11
Not sure if using pivot_long is optimal or not but at least the following ggplot looks fine:
p<-ggplot(simdata, aes(x=period, y=ind,color=sim, group=sim))+geom_line()
producing a nice graph with paths in different shades of blue.
What I would like to do is to color the mean, median and quartile paths with different colours (e.g. red and green). Median, mean and quartile paths are defined by the last period's value. I already know the sim number for those. E.g. let's assume that median path is the one where sim = 160.
I have tried the following approaches.
Add a new geom_line specifying the number (sim) of the median path:
p + geom_line(aes(y = simdata[sim == 160,], color ="red")
This fails since the additional geom_line is not of the same length (200*2520) as the simdata - even if the graph's x-axis only has 2520 periods.
Stat_summary
p + stat_summary(aes(group=sim),fun=median, geom="line",colour="red")
The outcome was that all lines become read, also the simulated ones. Also, I rejected this since it takes a lot more time to have ggplot to find the mean, median etc. values rather than finding them before the graphics part.
gghighlight
I experimented with this package but could not figure out if you can specify the path numbers to color.

Maybe try your first solution, but pass it to the data argument of geom_line instead:
p + geom_line(data = simdata[simdata$sim == 160,], color ="red")
As a quick example with some simulated data:
library(ggplot2)
df <- data.frame(a = rep(1:20, each = 100),
b = rep(1:100, times = 20),
c = rnorm(2000))
ggplot(df, aes(b, c, group = a)) +
geom_line(colour = "grey") +
geom_line(data = df[df$a==20,], colour = "red")
You can also pass a conditional as an argument in aes, which draws one line a colour specified by scale_colour_manual (tidier, adds legend, with labels which can be edited):
ggplot(df, aes(b, c, group = a, colour = a == 20)) +
geom_line() +
scale_colour_manual(values = c("TRUE" = "red", "FALSE" = "grey"))
Created on 2021-12-07 by the reprex package (v2.0.1)

Related

How to plot a heatmap with 3 continuous variables in r ggplot2?

Sample dataset is as below:
count is discrete variable, temperature and relative_humidity_percent are continuous variables.
The code to generate sample dataset:
templ = data.frame(count = c(200,225,610,233,250,210,290,255,279,250),
temperature = c(12.2,11.6,12,8.5,4,8.2,9.2,10.6,10.8,10.9),
relative_humidity_percent = c(74,78,72,65,77,84,83,74,73,75))
count
temperature
relative_humidity_percent
200
12.2
74
225
11.6
78
610
12
72
233
8.5
65
250
4
77
210
8.2
84
290
9.2
83
255
10.6
74
279
10.8
73
250
10.9
75
I tried to plot a heatmap with ggplot2::stat_contour,
plot2 <- ggplot(templ, aes(x = temperature, y = relative_humidity_percent, z = count)) +
stat_contour(geom = 'contour') +
geom_tile(aes(fill = n)) +
stat_contour(bins = 15) +
guides(fill = guide_colorbar(title = 'count'))
plot2
The result is:
Also, I tried to use ggplot::stat_density_2d,
> ggplot(templ, aes(temperature, relative_humidity_percent, z = count)) +
+ stat_density_2d(aes(fill = count))
Warning messages:
1: In stat_density_2d(aes(fill = count)) :
Ignoring unknown aesthetics: fill
2: The following aesthetics were dropped during statistical transformation: fill, z
ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?
> geom_density_2d() +
+ geom_contour() +
+ metR::geom_contour_fill(na.fill=TRUE) +
+ theme_classic()
Error in `+.gg`:
! Cannot add <ggproto> objects together
ℹ Did you forget to add this object to a <ggplot> object?
Run `rlang::last_error()` to see where the error occurred.
The result:
which was not filled with colour.
What I want is:
I want to replace level with count in the graph. However, since count variable is not factor. Therefore I cannot plot heatmap by using ggplot::geom_contour...
I understand from your comment that you want to "fill the entire graph", thus having a less truthful representation of your three dimensional data, which would be more accurately represented as a scatter plot and local coding of your third variable. I understand that you intend to interpolate the observation density between the measured locations.
You can of course use geom_density_2d for this. Just do the same trick as in my other answer and uncount your data first.
NB this is of course creating bins of densities. Otherwise this type of visualisation with iso density lines is not working.
ggplot(tidyr::uncount(templ, count)) +
geom_density_2d_filled(aes(temperature, relative_humidity_percent))
Just use geom_point and color according to your count. You can of course make your points square.
Or, if your count is not yet actually an aggregate measure and you want to show the density of neighbouring observations, you could use ggpointdensity::geom_pointdensity for this. (in your example, I have to uncount first).
library(ggplot2)
library(dplyr)
library(tidyr)
templ = data.frame(count = c(200,225,610,233,250,210,290,255,279,250),
temperature = c(12.2,11.6,12,8.5,4,8.2,9.2,10.6,10.8,10.9),
relative_humidity_percent = c(74,78,72,65,77,84,83,74,73,75))
ggplot(templ) +
geom_point(aes(temperature, relative_humidity_percent, color = count), shape = 15, size = 5)
## first uncount
templ %>%
uncount(count) %>%
ggplot() +
ggpointdensity::geom_pointdensity(aes(temperature, relative_humidity_percent))

R, ggplot, How do I keep related points together when using jitter?

One of the variables in my data frame is a factor denoting whether an amount was gained or spent. Every event has a "gain" value; there may or may not be a corresponding "spend" amount. Here is an image with the observations overplotted:
Adding some random jitter helps visually, however, the "spend" amounts are divorced from their corresponding gain events:
I'd like to see the blue circles "bullseyed" in their gain circles (where the "id" are equal), and jittered as a pair. Here are some sample data (three days) and code:
library(ggplot2)
ccode<-c(Gain="darkseagreen",Spend="darkblue")
ef<-data.frame(
date=as.Date(c("2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03")),
site=c("Castle","Temple","Temple","Temple","Temple","Temple","Palace","Palace","Castle","Castle","Castle","Temple","Temple","Palace","Palace","Castle","Castle","Castle","Castle","Castle","Temple","Temple","Palace","Castle","Temple","Temple","Temple","Temple","Temple","Palace","Palace","Castle","Castle","Castle","Temple","Temple","Palace","Palace","Castle","Castle","Castle","Castle","Castle","Temple","Temple","Palace"),
id=c("C123","T101","T93","T94","T95","T96","P102","P96","C126","C127","C128","T100","T98","P100","P98","C129","C130","C131","C132","C133","T104","T99","P99","C123","T101","T93","T94","T95","T96","P102","P96","C126","C127","C128","T100","T98","P100","P98","C129","C130","C131","C132","C133","T104","T99","P99"),
gainspend=c("Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend"),
amount=c(6,14,34,31,3,10,6,14,2,16,16,14,1,1,15,11,8,7,2,10,15,4,3,NA,NA,4,5,NA,NA,NA,NA,NA,NA,2,NA,1,NA,3,NA,NA,2,NA,NA,2,NA,3))
#▼ 3 day, points centered
ggplot(ef,aes(date,site)) +
geom_point(aes(size=amount,color=gainspend),alpha=0.5) +
scale_color_manual(values=ccode) +
scale_size_continuous(range=c(1,15),breaks=c(5,10,20))
#▼ 3 day, jitted
ggplot(ef,aes(date,site)) +
geom_point(aes(size=amount,color=gainspend),alpha=0.5,position=position_jitter(w=0,h=0.2)) +
scale_color_manual(values=ccode) +
scale_size_continuous(range=c(1,15),breaks=c(5,10,20))
My main idea is the old "add jitter manually" approach. I'm wondering if a nicer approach could be something like plotting little pie charts as points a la package scatterpie.
In this case you could add a random number for the amount of jitter to each ID so points within groups will be moved the same amount. This takes doing work outside of ggplot2.
First, draw the "jitter" to add for each ID. Since a categorical axis is 1 unit wide, I choose numbers between -.3 and .3. I use dplyr for this work and set the seed so you will get the same results.
library(dplyr)
set.seed(16)
ef2 = ef %>%
group_by(id) %>%
mutate(jitter = runif(1, min = -.3, max = .3)) %>%
ungroup()
Then the plot. I use a geom_blank() layer so that the categorical site axis is drawn before I add the jitter. I convert site to be numeric from a factor and add the jitter on; this only works for factors so luckily categorical axes in ggplot2 are based on factors.
Now paired ID's move together.
ggplot(ef2, aes(x = date, y = site)) +
geom_blank() +
geom_point(aes(size = amount, color = gainspend,
y = as.numeric(factor(site)) + jitter),
alpha=0.5) +
scale_color_manual(values = ccode) +
scale_size_continuous(range = c(1, 15), breaks = c(5, 10, 20))
#> Warning: Removed 15 rows containing missing values (geom_point).
Created on 2021-09-23 by the reprex package (v2.0.0)
You can add some jitter by id outside the ggplot() call.
jj <- data.frame(id = unique(ef$id), jtr = runif(nrow(ef), -0.3, 0.3))
ef <- merge(ef, jj, by = 'id')
ef$sitej <- as.numeric(factor(ef$site)) + ef$jtr
But you need to make site integer/numeric to do this. So when it comes to making the plot, you need to manually add axis labels with scale_y_continuous(). (Update: the geom_blank() trick from aosmith above is a better solution!)
ggplot(ef,aes(date,sitej)) +
geom_point(aes(size=amount,color=gainspend),alpha=0.5) +
scale_color_manual(values=ccode) +
scale_size_continuous(range=c(1,15),breaks=c(5,10,20)) +
scale_y_continuous(breaks = 1:3, labels= sort(unique(ef$site)))
This seems to work, but there are still a few gain/spend circles without a partner--perhaps there is a problem with the id variable.
Perhaps someone else has a better approach!

ggplot2 density-plot with discrete data

I want to create a density plot with the following data:
interval fr mi ab
0x 9765 3631 12985
1x 2125 2656 601
2x 1299 2493 191
3x 493 2234 78
4x 141 1559 20
5x and more 75 1325 23
On the X-Axis I want to have the Intervals and on the Y-Axis I want to have the density of "fr", "mi" and "ab" in different colors.
My imagination was something like this graph.
My problem is that I don't know how to get the density on the Y-Axis. I tried it with geom_density, but it didn't work. The best result I accomplished was using the following code:
DS29 <-as.data.frame(DS29)
DS29$interval <- factor(DS29$interval, levels = DS29$interval)
DS29 <- melt (DS29,id=c("interval"))
output$DS51<- renderPlot({
plot_tab6 <- ggplot(DS29, aes(x= interval,y = value, fill=variable, group = variable)) +
geom_col()+
geom_line()
return(plot_tab6)
})
This gives me the following plot, which is not the result I want to have. Do you have an idea how I could get to my wanted result? Thank you very much.
Seeing your sample data, I am not sure if you want to use geom_density. If you type ?geom_density, you will see some example codes. If I take one example from the help page, you may see things that you are missing.
ggplot(diamonds, aes(depth, fill = cut, colour = cut)) +
geom_density(alpha = 0.1) +
xlim(55, 70)
For x-axis, depth is a continuous variable, not a categorical variable. Your current data has a categorical variable in x-axis. For geom_density, you are looking for density of something at a value on x-axis. The example code above shows that the density of diamonds classified as "Ideal" has high density around 61.5-62, suggesting that the largest proportion "Ideal" diamonds have depth value around 61.5-62. Indeed, mean value for depth of "Ideal" diamond is 61.71. This means that you need multiple data points to calculate density. Your data has only one data point for each interval for each group (e.g., ab, fr, mi). So, I do not think your data is not ready for calculating density.
If you want to draw a graphic similar to what you suggested in your question using the current data, I think you need to 1) convert interval to a numeric variable, 2) transform the data into long format, and 3) use stat_smooth.
library(tidyverse)
mydf %>%
mutate(interval = as.numeric(sub(x = as.character(interval), pattern = "x", replacement = ""))) %>%
gather(key = group, value = value, - interval) -> temp
ggplot(temp, aes(x = interval, y = value, fill = group)) +
stat_smooth(geom = "area", span = 0.4, method = "loess", alpha = 0.4)

ggplot stat_summary_bin glitch?

I was happy to discover that ggplot has binned scatter plots, which are useful for exploring and visualizing relationships in large data. Yet the top bin appears to misbehave. Here's an example: All bin averages are roughly linearly aligned, as they should be, but the top one is off on both dimensions:
the code:
library(ggplot2)
# simulate an example of linear data
set.seed(1)
N <- 10^4
x <- runif(N)
y <- x + rnorm(N)
dt <- data.frame(x=x, y=y)
ggplot(dt, aes(x, y)) +
geom_point(alpha = 0.1, size = 0.01) +
stat_summary_bin(fun.y='mean', bins=10, color='orange', size=5, geom='point')
is there a simple workaround (and where should this be posted)?
stat_summary_bin is actually excluding the two rows with the largest x-values from the bins, and those two values are ending up with bin = NA. The mean of those two excluded values is plotted as a separate bin to the right of the regular bins. First, I show what is going wrong in your original plot then I provide a workaround to get the desired behavior.
What's going wrong in the original plot
To see what's going wrong in your original plot, create a plot with two calls to stat_summary_bin where we calculate the mean of each bin and the number of values in each bin. Then use ggplot_build to capture all of the internal data that ggplot generated to create the plot.
p1 = ggplot(dt, aes(x, y)) +
geom_point(alpha = 0.1, size = 0.01) +
stat_summary_bin(fun.y=mean, bins=10, size=5, geom='text',
aes(label=..y..)) +
stat_summary_bin(fun.y=length, bins=10, size=5, geom='text',
aes(label=..y.., y=0))
p1b = ggplot_build(p1)
Now let's look at the data for the mean and length layers, respectively. I've printed only bins 9 through 11 (the three right-most bins) for brevity. Bin 11 is the "extra" bin and you can see that it contains only 2 values (its label is 2 in the second table below), and that the mean of those two values is -0.1309998, as can be seen in the first table below.
p1b$data[[2]][9:11,c(1,2,4,6,7)]
label bin y x width
9 0.8158320 9 0.8158320 0.8498505 0.09998242
10 0.9235531 10 0.9235531 0.9498329 0.09998242
11 -0.1309998 11 -0.1309998 1.0498154 0.09998244
p1b$data[[3]][9:11,c(1,2,4,6,7)]
label bin y x width
9 1025 9 1025 0.8498505 0.09998242
10 1042 10 1042 0.9498329 0.09998242
11 2 11 2 1.0498154 0.09998244
Which two values are those? It looks like they come from the two rows with the highest x values in the original data frame:
mean(dt[order(-dt$x), "y"][1:2])
[1] -0.1309998
I'm not sure how stat_summary_bin is managing to bin the data such that the two highest x values are excluded.
Workaround to get the desired behavior
A workaround is to summarize the data yourself, so you'll have complete control over how the bins are created. The example below uses your original code and then plots pre-summarized values in blue, so you can compare the behavior. I've included the dplyr package so that I can use the chaining operator (%>%) to summarize the data on the fly:
library(dplyr)
ggplot(dt, aes(x, y)) +
geom_point(alpha = 0.1, size = 0.01) +
stat_summary_bin(fun.y='mean', bins=10, color='orange', size=5, geom='point') +
geom_point(data=dt %>%
group_by(bins=cut(x,breaks=seq(min(x),max(x),length.out=11), include.lowest=TRUE)) %>%
summarise(x=mean(x), y=mean(y)),
aes(x,y), size=3, color="blue") +
theme_bw()
#eipi10 has already explained, why this is happening.
Perhaps the simplest solution is to add a scale_x_continuous with limits to your plot, so that the extra "NA" bin is excluded from the plot.
ggplot(dt, aes(x, y)) +
geom_point(alpha = 0.1, size = 0.01) +
stat_summary_bin(fun.y='mean', bins=10, color='orange', size=5, geom='point') +
scale_x_continuous(limits = range(x))
This should be acceptable with large data such as in the example, where the small number of data points that were excluded from the bins will not significantly bias the stats. However, if dealing with situations where missing a couple of data points from the summary statistics is important, then the solution provided by #eipi will be better.

Can characters be graphed in a histogram in R?

So I have a data frame which I will call R. Looks something like this:
zep SEX AGE BMI
1 O F 3.416667 16.00000
2 O F 3.833333 14.87937
3 O G 3.416667 14.80223
4 O F 4.000000 15.09656
5 N G 3.666667 16.50000
6 O G 4.000000 16.49102
7 N G 3.916667 16.02413
With this data frame I want to plot multiple histograms comparing different aspects like how gender effects BMI. Like so:
par(mfrow=c(1,3)
boxplot(DF$BMI ~ DF$ZEP)
boxplot(DF$BMI ~ DF$GENDER)
boxplot(DF$BMI ~ ~ DF$AGE)
But for some reason the columns are made to be in characters instead of factors.
Now I pose this, is there a way to plot these if they are characters? If not,what can I do?
Also is there a way maybe to change zep and sex into a vector of logical factors? Maybe like in zep if O then true (1) if not then false (0), and the same thing for SEX. If G then true (1) if not then false (0).
I have to plot categorical variables for me advanced data analysis class. I can help you out. beedstands for border entry and employment data, don't steal my research plz.
The code I use to create factors is for example: (I have a column called portname that is dummy variables in a column, to create a column with factor variables (the names) This is how I would make the logical you describe. I've added that code with the larger code chunk below.
beed$portdisc <- as.numeric(beed$portname)
beed$portdisc[beed$portdisc==0] <- "Columbus Port of Entry"
beed$portdisc[beed$portdisc==1] <- "Santa Teresa Port of Entry"
beed$portdisc[beed$portdisc==2] <- "New Mexico All Ports Aggregate"
So what I've done here is taken by dataframe beed and used the specific column containing my portname variables. I add a new column to my dataframe called beed$portdisc then using the [ ] I define what I want to label as what.
In your case I think this should work (think, but I've tested by using the data you provided).
I have a hard time making the labels come out right with discrete variables. My apologies but this gets you very close.
library(ggplot2)
DF$SEX.factor <- as.character(DF$SEX)
DF$SEX.factor[DF$SEX.factor== "G"] <- "0"
DF$SEX.factor[DF$SEX.factor== "F"] <- "1"
DF$SEX.factor <- as.factor(DF$SEX.factor)
bar <- ggplot()
bar <- bar + geom_bar(data = DF$Sex.factor, aes(x=DF$SEX.factor),binwidth = .5)+ xlab("Sex")
bar <- bar + scale_x_discrete(limits = c(0,1,2), breaks= c(0,1,2), labels = c(" ","Male" ,"Female"))
bar
# DF.BMI5 = cut(DF$BMI,pretty(DF$BMI,5)) # Creates close to 5 integer ranges as factors, actomatically chooses pretty scales.
# This would be good to compair say age and BMI, best with one discreate and one continious variable
p <- ggplot(DF, aes(x = SEX.factor, y = BMI))
p <- p + geom_boxplot(width = 0.25, alpha = 0.4)
p <- p + geom_jitter(position = position_jitter(width = 0.1), alpha = .35, color = "blue")
# diamond at mean for each group
p <- p + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 6,
colour = "red", alpha = 0.8)
p <- p + scale_x_discrete(limits = c(0,1,2), breaks= c(0,1,2), labels = c(" ","Male" ,"Female")) + xlab("Sex")
p
Here is what I got when I ran this code on my own data. I think this is what you're looking to create, I've included the code above. It'll work with anything where x is a discrete variable, just use the at.factor() and set y as type continuous. function/
If you need any more help just let me know, I like to help out people on here because it helps me hone my R skills. I'm more of an Visual Studio kind of guy, VBA is my friend.
Hope this helps!
If you ever need to change a character to a factor, you can always use as.factor('A'), for instance.

Resources