Creating Stacked Density Plot with Weightings - r

I am attempting to use ggplot2 to create a weighted density plot showing the distribution of two groups that each account for a fraction of a certain distribution. The difficulty that I am encountering stems from the fact that although both groups have the same number of observations in the data, they have different weightings, and I would like for each group's area in the graph to reflect this difference in weightings.
My data look something like this.
var <- sort(rnorm(1000, mean = 5, sd = 2))
df <- tibble(id = c(rep(1, 1000), rep(2, 1000)),
var = c(var,var),
weight = c(rep(.1, 500), rep(.2, 500), rep(.9, 500), rep(.8, 500)))
Observe that, group 1 is given low weightings (.1 or .2) while group 2 is given high weighting of (.9 or .8). Also observe that for any given value of var has weightings that add up to 1. In the real data, the shares accounted for by each group differ in a more complex manner across the distribution of var.
I have tried plotting this data as follows, and although using weight captures the way that the distributions vary within each group, it does not capture the way that the distribution varies between groups.
library(ggplot2)
var <- rnorm(1000, mean = 5, sd = 2)
df %>%
ggplot(aes(x = var, group = id, fill = factor(id), weight = weight)) +
geom_density(position = 'stack')
The resulting plot looks something like this.
It is clear that the groups do not account for around 15% and 85% of the area under the density curve respectively, but the issue is clearer to see when we use position = 'fill'.
Each group seems to take up a similar area, apparently because the weighting is applied before grouping is accounted for. I would like to see a solution that results in the area associated with group 1 being commensurate with it's weight (i.e. much smaller than the area associated with group 2).
To clarify, it is the height associated with each group that should differ. In the above plot, the line of demarcation between group 1 and group 2 should be significantly higher, making the area taken up by group 1 significantly smaller.

Dealing with the relative density of the two groups is a bit ambiguous. Clearly, each group's density needs to have an integral of 1 for it to be a true density. The closest you can come is probably to have the integral of both curves sum to 1, which I think requires you to do the density calculation yourself then plot as a stacked geom_area:
library(tidyverse)
df %>%
nest(data = -id) %>%
summarize(id = factor(id),
weight = unlist(map(data, ~sum(.x$weight))),
dens = map(data, function(.x) {
x <- density(.x$var, weights = .x$weight/sum(.x$weight))
data.frame(x = x$x, y = x$y)
})) %>%
mutate(weight = weight / sum(weight)) %>%
unnest(dens) %>%
mutate(y = y * weight) %>%
ggplot(aes(x, y, fill = id)) +
geom_area(position = 'stack', color = 'black') +
labs(y = 'density', x = 'var')

I am not completely sure if I understand you correctly, but maybe you can calculate the value beforehand based on the weight and then stack it like this:
library(ggplot2)
library(dplyr)
# Stacked
df %>%
mutate(weighted_var = var*weight) %>%
ggplot(aes(x = weighted_var, fill = factor(id), group = id)) +
geom_density(position = 'stack')
And check the groups with fill like this:
# Fill
df %>%
mutate(weighted_var = var*weight) %>%
ggplot(aes(x = weighted_var, fill = factor(id), group = id)) +
geom_density(position = 'fill')
Created on 2022-11-01 with reprex v2.0.2

Related

How to specify unique geom assignments to facets?

Below I have simulated a dataset where an assignment was given to 5 groups of individuals on 5 different days (a new group with 200 new individuals each day). TrialStartDate denotes the date on which the assignment was given to each individual (ID), and TrialEndDate denotes when each individual finished the assignment.
set.seed(123)
data <-
data.frame(
TrialStartDate = rep(c(sample(seq(as.Date('2019/02/01'), as.Date('2019/02/15'), by="day"), 5)), each = 200),
TrialFinishDate = sample(seq(as.Date('2019/02/01'), as.Date('2019/02/15'), by = "day"), 1000,replace = T),
ID = seq(1,1000, 1)
)
I am interested in comparing how long individuals took to complete the trial depending on when they started the trial (i.e., assuming TrialStartDate has an effect on the length of time it takes to complete the trial).
To visualize this, I want to make a barplot showing counts of IDs on each TrialFinishDate where bars are colored by TrialStartDate (since each TrialStartDate acts as a grouping variable). The best I have come up with so far is by faceting like this:
data%>%
group_by(TrialStartDate, TrialFinishDate)%>%
count()%>%
ggplot(aes(x = TrialFinishDate, y = n, col = factor(TrialStartDate), fill = factor(TrialStartDate)))+
geom_bar(stat = "identity")+
facet_wrap(~TrialStartDate, ncol = 1)
However, I also want to add a vertical line to each facet showing when the TrialStartDate was for each group (preferably colored the same as the bars). When attempting to add vertical lines with geom_vline, it adds all the lines to each facet:
data%>%
group_by(TrialStartDate, TrialFinishDate)%>%
count()%>%
ggplot(aes(x = TrialFinishDate, y = n, col = factor(TrialStartDate), fill = factor(TrialStartDate)))+
geom_bar(stat = "identity")+
geom_vline(xintercept = unique(data$TrialStartDate))+
facet_wrap(~TrialStartDate, ncol = 1)
How can we make the vertical lines unique to the respective group in each facet?
You're specifying xintercept outside of aes, so the faceting is not respected.
This should do the trick:
data %>%
group_by(TrialStartDate, TrialFinishDate)%>%
count()%>%
ggplot(aes(x = TrialFinishDate, y = n, col = factor(TrialStartDate), fill = factor(TrialStartDate)))+
geom_bar(stat = "identity")+
geom_vline(aes(xintercept = TrialStartDate))+
facet_wrap(~TrialStartDate, ncol = 1)
Note geom_vline(aes(xintercept = TrialStartDate))

Bins for fixed interval in Longitudinal data and plotting it over the period of time by categories

It is longitudinal data; ID wise values are repeating 4 times in every tick of 20 steps. Then this experiments repeats. For the datafarme below I want bins based for every tick time steps for the categories of land based on the values of X.
Bins can be 3 for every time interval for land type (Small, medium and large) each.
I want to see timeline of bins of X based on categories of Land.
Any help will be appreciated. I have added possibly a picture of how data may look like for ggplot and plot as bins or dots may look like as in picture.
Seed(123)
ID = 1:5
Time = rep (c(1,2,3,4,5), each = 20)
Type = 1:25
data <- data.frame( IDn = rep(ID,20), Time, Land = rep(Type, 40), y = rnorm(100,0,1), x = runif(100,0,1))
data$Land= ifelse (data$Land > 15,"large farmers", ifelse(data$Land <=5, "small farmers", "medium-farmers"))
Edit: Question for labeling the faceting variable and dot plots.
Maybe something like this would help -
library(dplyr)
library(ggplot2)
data %>%
group_by(Time, Land) %>%
mutate(x = cut(x, c(0, 0.25, 0.75, 1))) %>%
ungroup %>%
count(Time, Land, x) %>%
ggplot() + aes(Time, n, fill = Land) + geom_col(position = 'dodge')

How to plot % positive cases (y-axis) by collection date (x-axis) and by other factors (R)?

Please help!
I have case data I need to prepare for a report soon and just cannot get the graphs to display properly.
From a dataset with CollectionDate as the "record" of cases (i.e. multiple rows with the same date means more cases that day), I want to display Number of positive cases/total (positive + negative) cases for that day as a percent on the y-axis, with collection dates along the x-axis. Then I want to break down by region. Goal is to look like this but in terms of daily positives/# of tests rather than just positives vs negatives. I also want to add a horizontal line on every graph at 20%.
I have tried manipulating it before, in and after ggplot:
ggplot(df_final, aes(x =CollectionDate, fill = TestResult)) +
geom_bar(aes(y=..prop..)) +
scale_y_continuous(labels=percent_format())
Which is, again, close. But the percents are wrong because they are just taking the proportion of that day against counts of all days instead of per day.
Then I tried using tally()in the following command to try and count per region and aggregate:
df_final %>%
group_by(CollectionDate, Region, as.factor(TestResult)) %>%
filter(TestResult == "Positive") %>%
tally()
and I still cannot get the graphs right.
Suggestions?
A quick look at my data:
head(df_final)
Well, I have to say that I am not 100% sure that I got what you want, but anyway, this can be helpful.
The data: Since you are new here, I have to let you know that using a simple and reproducible version of your data will make it easier to the rest of us to answer. To do this you can simulate a data frame o any other objec, or use dput function on it.
library(ggplot2)
library(dplyr)
data <- data.frame(
# date
CollectionDate = sample(
seq(as.Date("2020-01-01"), by = "day", length.out = 15),
size = 120, replace = TRUE),
# result
TestResult = sample(c("Positive", "Negative"), size = 120, replace = TRUE),
# region
Region = sample(c("Region 1", "Region2"), size = 120, replace = TRUE)
)
With this data, you can do ass follow to get the plots you want.
# General plot, positive cases proportion
data %>%
count(CollectionDate, TestResult, name = "cases") %>%
group_by(CollectionDate) %>%
summarise(positive_pro = sum(cases[TestResult == "Positive"])/sum(cases)) %>%
ggplot(aes(x = CollectionDate, y = positive_pro)) +
geom_col() +
geom_hline(yintercept = 0.2)
# positive proportion by day within region
data %>%
count(CollectionDate, TestResult, Region, name = "cases") %>%
group_by(CollectionDate, Region) %>%
summarise(
positive_pro = sum(cases[TestResult == "Positive"])/sum(cases)
) %>%
ggplot(aes(x = CollectionDate, y = positive_pro)) +
geom_col() +
# horizontal line at 20%
geom_hline(yintercept = 0.2) +
facet_wrap(~Region)
I can get you halfway there (refer to the comments in the code for clarifications). This code is for the counts per day per region (plotted separately for each region). I think you can tweak things further to calculate the counts per day per county too; and whole state should be a cakewalk. I wish you good luck with your report.
rm(list = ls())
library(dplyr)
library(magrittr)
library(ggplot2)
library(scales)
library(tidyr) #Needed for the spread() function
#Dummy data
set.seed(1984)
sdate <- as.Date('2000-03-09')
edate <- as.Date('2000-05-18')
dateslist <- as.Date(sample(as.numeric(sdate): as.numeric(edate), 10000, replace = TRUE), origin = '1970-01-01')
df_final <- data.frame(Region = rep_len(1:9, 10000),
CollectionDate = dateslist,
TestResult = sample(c("Positive", "Negative"), 10000, replace = TRUE))
#First tally the positve and negative cases
#by Region, CollectionDate, TestResult in that order
df_final %<>%
group_by(Region, CollectionDate, TestResult) %>%
tally()
#Then
#First spread the counts (in n)
#That is, create separate columns for Negative and Positive cases
#for each Region-CollectionDate combination
#Then calculate their proportions (as shown)
#Now you have Negative and Positive
#percentages by CollectionDate by Region
df_final %<>%
spread(key = TestResult, value = n) %>%
mutate(Negative = Negative/(Negative + Positive),
Positive = Positive/(Negative + Positive))
#Plotting this now
#Since the percentages are available already
#Use geom_col() instead of geom_bar()
df_final %>% ggplot() +
geom_col(aes(x = CollectionDate, y = Positive, fill = "Positive"),
position = "identity", alpha = 0.4) +
geom_col(aes(x = CollectionDate, y = Negative, fill = "Negative"),
position = "identity", alpha = 0.4) +
facet_wrap(~ Region, nrow = 3, ncol = 3)
This yields:

Plotting multiple box plots as a single graph in R

I am trying to plot multiple box plots as a single graph. The data is where I have done a wilcoxon test. It should be like this
I have four/five questions and I want to plot the respondent score for two sets as a box plot. This should be done for all questions (Two groups for each question).
I am thinking of using ggplot2. My data is like
q1o <- c(4,4,5,4,4,4,4,5,4,5,4,4,5,4,4,4,5,5,5,5,5,5,5,5,5,3,4,4,3,4)
q1s <- c(5,4,4,5,5,5,5,5,4,5,4,4,5,4,5,5,5,5,5,5,5,5,5,5,5,5,4,5,4,4)
q2o <- c(3,3,3,4,3,4,4,3,3,3,4,4,3,4,3,3,4,3,3,3,3,4,4,4,4,3,3,3,3,4)
q2s <- c(5,4,4,5,5,5,5,5,4,5,4,4,5,4,5,5,5,5,5,5,5,5,5,5,5,5,4,3,4,4)
....
....
q1 means question 1 and q2 means question 2. I also want to know how to align these stacked box plots based on my need. Like one row or two rows.
This should get you started:
Unfortunately you don't provide a minimal example with sample data, so I will generate some random sample data.
# Generate sample data
set.seed(2017);
df <- cbind.data.frame(
value = rnorm(1000),
Label = sample(c("Good", "Bad"), 1000, replace = T),
variable = sample(paste0("F", 5:11), 1000, replace = T));
# ggplot
library(tidyverse);
df %>%
mutate(variable = factor(variable, levels = paste0("F", 5:11))) %>%
ggplot(aes(variable, value, fill = Label)) +
geom_boxplot(position=position_dodge()) +
facet_wrap(~ variable, ncol = 3, scale = "free");
You can specify the number of columns and rows in your 2d panel layout through arguments ncol and nrow, respectively, of facet_wrap. Many more details and examples can be found if you follow ?geom_boxplot and ?facet_wrap.
Update 1
A boxplot based on your sample data doesn't make too much sense, because your data are not continuous. But ignoring that, you could do the following:
df <- data.frame(
q1o = c(4,4,5,4,4,4,4,5,4,5,4,4,5,4,4,4,5,5,5,5,5,5,5,5,5,3,4,4,3,4),
q1s = c(5,4,4,5,5,5,5,5,4,5,4,4,5,4,5,5,5,5,5,5,5,5,5,5,5,5,4,5,4,4),
q2o = c(3,3,3,4,3,4,4,3,3,3,4,4,3,4,3,3,4,3,3,3,3,4,4,4,4,3,3,3,3,4),
q2s = c(5,4,4,5,5,5,5,5,4,5,4,4,5,4,5,5,5,5,5,5,5,5,5,5,5,5,4,3,4,4));
df %>%
gather(key, value, 1:4) %>%
mutate(
variable = ifelse(grepl("q1", key), "F1", "F2"),
Label = ifelse(grepl("o$", key), "Bad", "Good")) %>%
ggplot(aes(variable, value, fill = Label)) +
geom_boxplot(position = position_dodge()) +
facet_wrap(~ variable, ncol = 3, scale = "free");
Update 2
One way of visualising discrete data would be in a mosaicplot.
mosaicplot(table(df2));
The plot shows the count of value (as filled rectangles) per Variable per Label. See ?mosaicplot for details.

How to change the colour of bins in ggplot (geom_bin2d) to reflect difference between density in that area and the average density across a dataset?

Say I have some data that looks a bit like this
library(ggplot2)
library(dplyr)
employee <- employee <- c('John','Dave','Paul','Ringo','George','Tom','Jim','Harry','Jamie','Adrian')
quality <- c('good', 'bad')
x = runif(4000,0,100)
y = runif(4000,0,100)
employ.data <- data.frame(employee, quality, x, y)
And I'm working with a geom_bin2d plot that looks like this
ggplot(dat, aes(x, y)) +
geom_bin2d(binwidth = c(20, 20)) +
scale_fill_gradient2(low="darkred", high = "darkgreen")
plot
How can I change the colour of the bins to reflect the percentage of the x/y points that are 'bad' compared to the overall average in that area across the dataset? I.e, if the average of 'bad' points in the bottom left bin is x number, and the average for John in that area is y lower number, how can I make the bin colour darker to show that his count is lower?
I figured this could work to create the averages:
df2 <- employ.data
df2$xbin <- cut(df2$x, breaks = seq(0, 100, by = 20))
df2$ybin <- cut(df2$y, breaks = seq(0, 100, by = 20))
df2 <- df2 %>% group_by(xbin, ybin) %>% mutate(ave_pct = mean(quality == "bad"))
df2 <- df2 %>% group_by(employee, xbin, ybin) %>% mutate(person_pct = mean(quality == "bad"))
But then I have no idea how to plot that.
So if I am understanding you correctly, you would like to have the bins colored by how each respective bins percentage of bad employees compares to the overall percentage of bad employees. To accomplish this, I changed up how this was calculated to this:
df <- employ.data %>%
mutate(xbin = cut(x, breaks = seq(0, 100, by = 20)),
ybin = cut(y, breaks = seq(0, 100, by = 20)),
overall_ave = mean(quality == "bad")) %>%
group_by(xbin, ybin) %>%
mutate(bin_ave = mean(quality == "bad")) %>%
ungroup() %>%
mutate(bin_quality = bin_ave - overall_ave)
This creates the bins, then finds the overall percentage of "bad" quality employees. Then it groups by the respective bins, and finds the percentage of "bad" employees per bin. Then it compares each bin average to the overall average. This gives a positive value for bin_quality for bins with a higher percentage of "good" employees and a negative number for bins with a higher percentage of "bad" employees.
You can then graph it by adding a fill = bin_quality and group = bin_quality argument to your aes() call inside of ggplot. You also need to add aes(group = bin_quality) to your geom_bin2d call. It looks like this:
ggplot(df, aes(x, y, fill = bin_quality, group = bin_quality)) +
geom_bin2d(aes(group = bin_quality), binwidth = c(20, 20)) +
scale_fill_gradient2(low="darkred", high = "darkgreen")
This gives you this graph:

Resources