ggplot2 density-plot with discrete data

ggplot2 density-plot with discrete data - r

I want to create a density plot with the following data:
interval fr mi ab
0x 9765 3631 12985
1x 2125 2656 601
2x 1299 2493 191
3x 493 2234 78
4x 141 1559 20
5x and more 75 1325 23
On the X-Axis I want to have the Intervals and on the Y-Axis I want to have the density of "fr", "mi" and "ab" in different colors.
My imagination was something like this graph.
My problem is that I don't know how to get the density on the Y-Axis. I tried it with geom_density, but it didn't work. The best result I accomplished was using the following code:
DS29 <-as.data.frame(DS29)
DS29$interval <- factor(DS29$interval, levels = DS29$interval)
DS29 <- melt (DS29,id=c("interval"))
output$DS51<- renderPlot({
plot_tab6 <- ggplot(DS29, aes(x= interval,y = value, fill=variable, group = variable)) +
geom_col()+
geom_line()
return(plot_tab6)
})
This gives me the following plot, which is not the result I want to have. Do you have an idea how I could get to my wanted result? Thank you very much.

Seeing your sample data, I am not sure if you want to use geom_density. If you type ?geom_density, you will see some example codes. If I take one example from the help page, you may see things that you are missing.
ggplot(diamonds, aes(depth, fill = cut, colour = cut)) +
geom_density(alpha = 0.1) +
xlim(55, 70)
For x-axis, depth is a continuous variable, not a categorical variable. Your current data has a categorical variable in x-axis. For geom_density, you are looking for density of something at a value on x-axis. The example code above shows that the density of diamonds classified as "Ideal" has high density around 61.5-62, suggesting that the largest proportion "Ideal" diamonds have depth value around 61.5-62. Indeed, mean value for depth of "Ideal" diamond is 61.71. This means that you need multiple data points to calculate density. Your data has only one data point for each interval for each group (e.g., ab, fr, mi). So, I do not think your data is not ready for calculating density.
If you want to draw a graphic similar to what you suggested in your question using the current data, I think you need to 1) convert interval to a numeric variable, 2) transform the data into long format, and 3) use stat_smooth.
library(tidyverse)
mydf %>%
mutate(interval = as.numeric(sub(x = as.character(interval), pattern = "x", replacement = ""))) %>%
gather(key = group, value = value, - interval) -> temp
ggplot(temp, aes(x = interval, y = value, fill = group)) +
stat_smooth(geom = "area", span = 0.4, method = "loess", alpha = 0.4)

Related

How to plot a heatmap with 3 continuous variables in r ggplot2?

Sample dataset is as below:
count is discrete variable, temperature and relative_humidity_percent are continuous variables.
The code to generate sample dataset:
templ = data.frame(count = c(200,225,610,233,250,210,290,255,279,250),
temperature = c(12.2,11.6,12,8.5,4,8.2,9.2,10.6,10.8,10.9),
relative_humidity_percent = c(74,78,72,65,77,84,83,74,73,75))
count
temperature
relative_humidity_percent
200
12.2
74
225
11.6
78
610
12
72
233
8.5
65
250
4
77
210
8.2
84
290
9.2
83
255
10.6
74
279
10.8
73
250
10.9
75
I tried to plot a heatmap with ggplot2::stat_contour,
plot2 <- ggplot(templ, aes(x = temperature, y = relative_humidity_percent, z = count)) +
stat_contour(geom = 'contour') +
geom_tile(aes(fill = n)) +
stat_contour(bins = 15) +
guides(fill = guide_colorbar(title = 'count'))
plot2
The result is:
Also, I tried to use ggplot::stat_density_2d,
> ggplot(templ, aes(temperature, relative_humidity_percent, z = count)) +
+ stat_density_2d(aes(fill = count))
Warning messages:
1: In stat_density_2d(aes(fill = count)) :
Ignoring unknown aesthetics: fill
2: The following aesthetics were dropped during statistical transformation: fill, z
ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?
> geom_density_2d() +
+ geom_contour() +
+ metR::geom_contour_fill(na.fill=TRUE) +
+ theme_classic()
Error in `+.gg`:
! Cannot add <ggproto> objects together
ℹ Did you forget to add this object to a <ggplot> object?
Run `rlang::last_error()` to see where the error occurred.
The result:
which was not filled with colour.
What I want is:
I want to replace level with count in the graph. However, since count variable is not factor. Therefore I cannot plot heatmap by using ggplot::geom_contour...

I understand from your comment that you want to "fill the entire graph", thus having a less truthful representation of your three dimensional data, which would be more accurately represented as a scatter plot and local coding of your third variable. I understand that you intend to interpolate the observation density between the measured locations.
You can of course use geom_density_2d for this. Just do the same trick as in my other answer and uncount your data first.
NB this is of course creating bins of densities. Otherwise this type of visualisation with iso density lines is not working.
ggplot(tidyr::uncount(templ, count)) +
geom_density_2d_filled(aes(temperature, relative_humidity_percent))

Just use geom_point and color according to your count. You can of course make your points square.
Or, if your count is not yet actually an aggregate measure and you want to show the density of neighbouring observations, you could use ggpointdensity::geom_pointdensity for this. (in your example, I have to uncount first).
library(ggplot2)
library(dplyr)
library(tidyr)
templ = data.frame(count = c(200,225,610,233,250,210,290,255,279,250),
temperature = c(12.2,11.6,12,8.5,4,8.2,9.2,10.6,10.8,10.9),
relative_humidity_percent = c(74,78,72,65,77,84,83,74,73,75))
ggplot(templ) +
geom_point(aes(temperature, relative_humidity_percent, color = count), shape = 15, size = 5)
## first uncount
templ %>%
uncount(count) %>%
ggplot() +
ggpointdensity::geom_pointdensity(aes(temperature, relative_humidity_percent))

Extend line length with geom_line

I want to represent three lines on a graph overlain with datapoints that I used in a discriminant function analysis. From my analysis, I have two points that fall on each line and I want to represent these three lines. The lines represent the probability contours of the classification scheme and exactly how I got the points on the line are not relevant to my question here. However, I want the lines to extend further than the points that define them.
df <-
data.frame(Prob = rep(c("5", "50", "95"), each=2),
Wing = rep(c(107,116), 3),
Bill = c(36.92055, 36.12167, 31.66012, 30.86124, 26.39968, 25.6008))
ggplot()+
geom_line(data=df, aes(x=Bill, y=Wing, group=Prob, color=Prob))
The above df is a dataframe for my points from which the three lines are constructed. I want the lines to extend from y=105 to y=125.
Thanks!

There are probably more idiomatic ways of doing it but this is one way to get it done.
In short you quickly calculate the linear formula that will connect the lines i.e y = mx+c
df_withFormula <- df |>
group_by(Prob) |>
#This mutate command will create the needed slope and intercept for the geom_abline command in the plotting stage.
mutate(increaseBill = Bill - lag(Bill),
increaseWing = Wing - lag(Wing),
slope = increaseWing/increaseBill,
intercept = Wing - slope*Bill)
# The increaseBill, increaseWing and slope could all be combined into one calculation but I thought it was easier to understand this way.
ggplot(df_withFormula, aes(Bill, Wing, color = Prob)) +
#Add in this just so it has something to plot ontop of. You could remove this and instead manually define all the limits (expand_limits would work).
geom_point() +
#This plots the three lines. The rows with NA are automatically ignored. More explicit handling of the NA could be done in the data prep stage
geom_abline(aes(slope = slope, intercept = intercept, color = Prob)) +
#This is the crucial part it lets you define what the range is for the plot window. As ablines are infite you can define whatever limits you want.
expand_limits(y = c(105,125))
Hope this helps you get the graph you want.
This is very much dependent on the structure of your data it could though be changed to fit different shapes.

Similar to the approach by #James in that I compute the slopes and the intercepts from the given data and use a geom_abline to plot the lines but uses
summarise instead of mutate to get rid of the NA values
and a geom_blank instead of a geom_point so that only the lines are displayed but not the points (Note: Having another geom is crucial to set the scale or the range of the data and for the lines to show up).
library(dplyr)
library(ggplot2)
df_line <- df |>
group_by(Prob) |>
summarise(slope = diff(Wing) / diff(Bill),
intercept = first(Wing) - slope * first(Bill))
ggplot(df, aes(x = Bill, y = Wing)) +
geom_blank() +
geom_abline(data = df_line, aes(slope = slope, intercept = intercept, color = Prob)) +
scale_y_continuous(limits = c(105, 125))

R, ggplot, How do I keep related points together when using jitter?

One of the variables in my data frame is a factor denoting whether an amount was gained or spent. Every event has a "gain" value; there may or may not be a corresponding "spend" amount. Here is an image with the observations overplotted:
Adding some random jitter helps visually, however, the "spend" amounts are divorced from their corresponding gain events:
I'd like to see the blue circles "bullseyed" in their gain circles (where the "id" are equal), and jittered as a pair. Here are some sample data (three days) and code:
library(ggplot2)
ccode<-c(Gain="darkseagreen",Spend="darkblue")
ef<-data.frame(
date=as.Date(c("2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03")),
site=c("Castle","Temple","Temple","Temple","Temple","Temple","Palace","Palace","Castle","Castle","Castle","Temple","Temple","Palace","Palace","Castle","Castle","Castle","Castle","Castle","Temple","Temple","Palace","Castle","Temple","Temple","Temple","Temple","Temple","Palace","Palace","Castle","Castle","Castle","Temple","Temple","Palace","Palace","Castle","Castle","Castle","Castle","Castle","Temple","Temple","Palace"),
id=c("C123","T101","T93","T94","T95","T96","P102","P96","C126","C127","C128","T100","T98","P100","P98","C129","C130","C131","C132","C133","T104","T99","P99","C123","T101","T93","T94","T95","T96","P102","P96","C126","C127","C128","T100","T98","P100","P98","C129","C130","C131","C132","C133","T104","T99","P99"),
gainspend=c("Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend"),
amount=c(6,14,34,31,3,10,6,14,2,16,16,14,1,1,15,11,8,7,2,10,15,4,3,NA,NA,4,5,NA,NA,NA,NA,NA,NA,2,NA,1,NA,3,NA,NA,2,NA,NA,2,NA,3))
#▼ 3 day, points centered
ggplot(ef,aes(date,site)) +
geom_point(aes(size=amount,color=gainspend),alpha=0.5) +
scale_color_manual(values=ccode) +
scale_size_continuous(range=c(1,15),breaks=c(5,10,20))
#▼ 3 day, jitted
ggplot(ef,aes(date,site)) +
geom_point(aes(size=amount,color=gainspend),alpha=0.5,position=position_jitter(w=0,h=0.2)) +
scale_color_manual(values=ccode) +
scale_size_continuous(range=c(1,15),breaks=c(5,10,20))

My main idea is the old "add jitter manually" approach. I'm wondering if a nicer approach could be something like plotting little pie charts as points a la package scatterpie.
In this case you could add a random number for the amount of jitter to each ID so points within groups will be moved the same amount. This takes doing work outside of ggplot2.
First, draw the "jitter" to add for each ID. Since a categorical axis is 1 unit wide, I choose numbers between -.3 and .3. I use dplyr for this work and set the seed so you will get the same results.
library(dplyr)
set.seed(16)
ef2 = ef %>%
group_by(id) %>%
mutate(jitter = runif(1, min = -.3, max = .3)) %>%
ungroup()
Then the plot. I use a geom_blank() layer so that the categorical site axis is drawn before I add the jitter. I convert site to be numeric from a factor and add the jitter on; this only works for factors so luckily categorical axes in ggplot2 are based on factors.
Now paired ID's move together.
ggplot(ef2, aes(x = date, y = site)) +
geom_blank() +
geom_point(aes(size = amount, color = gainspend,
y = as.numeric(factor(site)) + jitter),
alpha=0.5) +
scale_color_manual(values = ccode) +
scale_size_continuous(range = c(1, 15), breaks = c(5, 10, 20))
#> Warning: Removed 15 rows containing missing values (geom_point).
Created on 2021-09-23 by the reprex package (v2.0.0)

You can add some jitter by id outside the ggplot() call.
jj <- data.frame(id = unique(ef$id), jtr = runif(nrow(ef), -0.3, 0.3))
ef <- merge(ef, jj, by = 'id')
ef$sitej <- as.numeric(factor(ef$site)) + ef$jtr
But you need to make site integer/numeric to do this. So when it comes to making the plot, you need to manually add axis labels with scale_y_continuous(). (Update: the geom_blank() trick from aosmith above is a better solution!)
ggplot(ef,aes(date,sitej)) +
geom_point(aes(size=amount,color=gainspend),alpha=0.5) +
scale_color_manual(values=ccode) +
scale_size_continuous(range=c(1,15),breaks=c(5,10,20)) +
scale_y_continuous(breaks = 1:3, labels= sort(unique(ef$site)))
This seems to work, but there are still a few gain/spend circles without a partner--perhaps there is a problem with the id variable.
Perhaps someone else has a better approach!

ggplot with stat_summary for mean along time represented by days

I have this data representing the value of a variable Q1 along time.
The time is not represented by dates, it is represented by the number of days since one event.
https://www.mediafire.com/file/yfzbx67yivvvkgv/dat.xlsx/file
I'm trying to plot the mean value of Q1along time, like in here
Plotting average of multiple variables in time-series using ggplot
I'm using this code
library(Hmisc)
ggplot(dat,aes(x=days,y=Q1,colour=type,group=type)) +
stat_summary(fun.data = "mean_cl_boot", geom = "smooth")

Besides the code, which does not appear to work with the new ggplot2 version, you also have the problem that your data is not really suited for that kind of plot. This code achieves what you wanted to do:
dat <- rio::import("dat.xlsx")
library(ggplot2)
library(dplyr)dat %>%
ggplot(aes(x = days, y = Q1, colour = type, group = type)) +
geom_smooth(stat = 'summary', fun.data = mean_cl_boot)
But the plot doesn't really tell you anything, simply because there aren't enough values in your data. Most often there seems to be only one value per day, the vales jump quickly up and down, and the gaps between days are sometimes quite big.
You can see this when you group the values into timespans instead. Here I used round(days, -2) which will round to the nearest 100 (e.g., 756 is turned into 800, 301 becomes 300, 49 becomes 0):
dat %>%
mutate(days = round(days, -2)) %>%
ggplot(aes(x = days, y = Q1, colour = type, group = type)) +
geom_smooth(stat = 'summary', fun.data = mean_cl_boot)
This should be the same plot as linked but with huge confidence intervals. Which is not surprising since, as mentioned, values quickly alternate between values 1-5. I hope that helps.

ggplot stat_summary_bin glitch?

I was happy to discover that ggplot has binned scatter plots, which are useful for exploring and visualizing relationships in large data. Yet the top bin appears to misbehave. Here's an example: All bin averages are roughly linearly aligned, as they should be, but the top one is off on both dimensions:
the code:
library(ggplot2)
# simulate an example of linear data
set.seed(1)
N <- 10^4
x <- runif(N)
y <- x + rnorm(N)
dt <- data.frame(x=x, y=y)
ggplot(dt, aes(x, y)) +
geom_point(alpha = 0.1, size = 0.01) +
stat_summary_bin(fun.y='mean', bins=10, color='orange', size=5, geom='point')
is there a simple workaround (and where should this be posted)?

stat_summary_bin is actually excluding the two rows with the largest x-values from the bins, and those two values are ending up with bin = NA. The mean of those two excluded values is plotted as a separate bin to the right of the regular bins. First, I show what is going wrong in your original plot then I provide a workaround to get the desired behavior.
What's going wrong in the original plot
To see what's going wrong in your original plot, create a plot with two calls to stat_summary_bin where we calculate the mean of each bin and the number of values in each bin. Then use ggplot_build to capture all of the internal data that ggplot generated to create the plot.
p1 = ggplot(dt, aes(x, y)) +
geom_point(alpha = 0.1, size = 0.01) +
stat_summary_bin(fun.y=mean, bins=10, size=5, geom='text',
aes(label=..y..)) +
stat_summary_bin(fun.y=length, bins=10, size=5, geom='text',
aes(label=..y.., y=0))
p1b = ggplot_build(p1)
Now let's look at the data for the mean and length layers, respectively. I've printed only bins 9 through 11 (the three right-most bins) for brevity. Bin 11 is the "extra" bin and you can see that it contains only 2 values (its label is 2 in the second table below), and that the mean of those two values is -0.1309998, as can be seen in the first table below.
p1b$data[[2]][9:11,c(1,2,4,6,7)]
label bin y x width
9 0.8158320 9 0.8158320 0.8498505 0.09998242
10 0.9235531 10 0.9235531 0.9498329 0.09998242
11 -0.1309998 11 -0.1309998 1.0498154 0.09998244
p1b$data[[3]][9:11,c(1,2,4,6,7)]
label bin y x width
9 1025 9 1025 0.8498505 0.09998242
10 1042 10 1042 0.9498329 0.09998242
11 2 11 2 1.0498154 0.09998244
Which two values are those? It looks like they come from the two rows with the highest x values in the original data frame:
mean(dt[order(-dt$x), "y"][1:2])
[1] -0.1309998
I'm not sure how stat_summary_bin is managing to bin the data such that the two highest x values are excluded.
Workaround to get the desired behavior
A workaround is to summarize the data yourself, so you'll have complete control over how the bins are created. The example below uses your original code and then plots pre-summarized values in blue, so you can compare the behavior. I've included the dplyr package so that I can use the chaining operator (%>%) to summarize the data on the fly:
library(dplyr)
ggplot(dt, aes(x, y)) +
geom_point(alpha = 0.1, size = 0.01) +
stat_summary_bin(fun.y='mean', bins=10, color='orange', size=5, geom='point') +
geom_point(data=dt %>%
group_by(bins=cut(x,breaks=seq(min(x),max(x),length.out=11), include.lowest=TRUE)) %>%
summarise(x=mean(x), y=mean(y)),
aes(x,y), size=3, color="blue") +
theme_bw()

#eipi10 has already explained, why this is happening.
Perhaps the simplest solution is to add a scale_x_continuous with limits to your plot, so that the extra "NA" bin is excluded from the plot.
ggplot(dt, aes(x, y)) +
geom_point(alpha = 0.1, size = 0.01) +
stat_summary_bin(fun.y='mean', bins=10, color='orange', size=5, geom='point') +
scale_x_continuous(limits = range(x))
This should be acceptable with large data such as in the example, where the small number of data points that were excluded from the bins will not significantly bias the stats. However, if dealing with situations where missing a couple of data points from the summary statistics is important, then the solution provided by #eipi will be better.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

ggplot2 density-plot with discrete data - r

Related

How to plot a heatmap with 3 continuous variables in r ggplot2?

Extend line length with geom_line

R, ggplot, How do I keep related points together when using jitter?

ggplot with stat_summary for mean along time represented by days

ggplot stat_summary_bin glitch?

Categories

Resources