Plot a line on a barchart in ggplot2 - r

I have built a stacked bar chart showing the relative proportions of response to different questions. Now I want to show a particular response ontop of that barchart, to show how an individuals response relates to the overall proportions of responses.
I created a toy example here:
library(ggplot2)
n = 1000
n_groups = 5
overall_df = data.frame(
state = sample(letters[1:8], n, replace = TRUE),
frequency = runif(n, min = 0, max = 1),
var_id = rep(LETTERS[1:n_groups], each = 1000 / n_groups)
)
row = data.frame(
A = "a", B = "b", C = "c", D = "h", E = "b"
)
ggplot(overall_df,
aes(fill=state, y=frequency, x=var_id)) +
geom_bar(position="fill", stat="identity")
The goal here is to have the responses in the object row plotted as a point in the corresponding barchart box, with a line connecting the points.
Here is a (poorly drawn) example of the desired result. Thanks for your help.

This was trickier than I thought. I'm not sure there's any way round manually calculating the x/y co-ordinates of the line.
library(dplyr)
library(ggplot2)
df <- overall_df %>% group_by(state, var_id) %>%
summarize(frequency = sum(frequency))
freq <- unlist(Map(function(d, val) {
(sum(d$frequency[d$state > val]) + 0.5 * d$frequency[d$state == val]) /
sum(d$frequency)
}, d = split(df, df$var_id), val = row))
line_df <- data.frame(state = unlist(row),
frequency = freq,
var_id = names(row))
ggplot(df, aes(fill=state, y=frequency, x=var_id)) +
geom_col(position="fill") +
geom_line(data = line_df, aes(group = 1)) +
geom_point(data = line_df, aes(group = 1))
Created on 2022-03-08 by the reprex package (v2.0.1)

Here's an automated approach using dplyr. I prepare the summary by joining the label data to the original data, and then using group_by + summarize to get those.
library(dplyr)
row_df <- data.frame(state = letters[1:n_groups], var_id = LETTERS[1:n_groups])
line_df <- row_df %>%
left_join(overall_df, by = "var_id") %>%
group_by(var_id) %>%
summarize(state = last(state.x),
frequency = (sum(frequency[state.x < state.y]) +
sum(frequency[state.x == state.y])/2) / sum(frequency))
ggplot(overall_df, aes(fill=state, y=frequency, x=var_id)) +
geom_bar(position="fill", stat="identity") +
geom_point(data = line_df) +
geom_line(data = line_df, aes(group = 1))

Related

How to plot line graph of normalized differences from binned data with ggplot?

I have several sets of data that I calculate binned normalized differences for. The results I want to plot within a single line plot using ggplot. The lines representing different combinations of the paired differences are supposed to be distinguished by colors and line types.
I am stuck on taking the computed values from the bins (would be y-axis values now), and plotting these onto an x-axis.
Below is the code I use for importing the data and calculating the normalized differences.
# Read data from column 3 as data table for different number of rows
# you could use replicate here for test
# dat1 <- data.frame(replicate(1,sample(25:50,10000,rep=TRUE)))
# dat2 <- data.frame(replicate(1,sample(25:50,9500,rep=TRUE)))
dat1 <- fread("/dir01/a/dat01.txt", header = FALSE, data.table=FALSE, select=c(3))
dat2 <- fread("/dir02/c/dat02.txt", header = FALSE, data.table=FALSE, select=c(3))
# Change column names
colnames(dat1) <- c("Dat1")
colnames(dat2) <- c("Dat2")
# Perhaps there is a better way to compute the following as all-in-one? I have broken these down step by step.
# 1) Sum for each bin
bin1 = cut(dat1$Dat1, breaks = seq(25, 50, by = 2))
sum1 = tapply(dat1$Dat1, bin1, sum)
bin2 = cut(dat2$Dat2, breaks = seq(25, 50, by = 2))
sum2 = tapply(dat2$Dat2, bin2, sum)
# 2) Total sum of all bins
sumt1 = sum(sum1)
sumt2 = sum(sum2)
# 3) Divide each bin by total sum of all bins
sumn1 = lapply(sum1, `/`, sumt1)
sumn2 = lapply(sum2, `/`, sumt2)
# 4) Convert to data frame as I'm not sure how to difference otherwise
df_sumn1 = data.frame(sumn1)
df_sumn2 = data.frame(sumn2)
# 5) Difference between the two as percentage
dbin = (df_sumn1 - df_sumn2)*100
How can I plot those results using ggplot() and geom_line()?
I want
dbin values on the x-axis ranging from 25-50
different colors and line types for the lines
Here is what I tried:
p1 <- ggplot(dbin, aes(x = ?, color=Data, linetype=Data)) +
geom_line() +
scale_linetype_manual(values=c("solid")) +
scale_x_continuous(limits = c(25, 50)) +
scale_color_manual(values = c("#000000"))
dput(dbin) outputs:
structure(list(X.25.27. = -0.0729132928804117, X.27.29. = -0.119044772581772,
X.29.31. = 0.316016473225017, X.31.33. = -0.292812782147632,
X.33.35. = 0.0776336591308158, X.35.37. = 0.0205584754637611,
X.37.39. = -0.300768421159599, X.39.41. = -0.403235174844081,
X.41.43. = 0.392510458816457, X.43.45. = 0.686758883448307,
X.45.47. = -0.25387105113263, X.47.49. = -0.0508324553382303), class = "data.frame", row.names = c(NA,
-1L))
Edit
The final piece of code that works, using only the dbin and plots multiple dbins:
dat1 <- data.frame(a = replicate(1,sample(25:50,10000,rep=TRUE, prob = 25:0/100)))
dat2 <- data.frame(a = replicate(1,sample(25:50,9500,rep=TRUE, prob = 0:25/100)))
dat3 <- data.frame(a = replicate(1,sample(25:50,9500,rep=TRUE, prob = 12:37/100)))
dat4 <- data.frame(a = replicate(1,sample(25:50,9500,rep=TRUE, prob = 37:12/100)))
calc_bin_props <- function(data) {
as_tibble(data) %>%
mutate(bin = cut(a, breaks = seq(25, 50, by = 2))) %>%
group_by(bin) %>%
summarise(sum = sum(a), .groups = "drop") %>%
filter(!is.na(bin)) %>%
ungroup() %>%
mutate(sum = sum / sum(sum))
}
diff_data <-
full_join(
calc_bin_props(data = dat1),
calc_bin_props(dat2),
by = "bin") %>%
separate(bin, c("trsh", "bin", "trshb", "trshc")) %>%
mutate(dbinA = (sum.x - sum.y * 100)) %>%
select(-starts_with("trsh"))
diff_data2 <-
full_join(
calc_bin_props(data = dat3),
calc_bin_props(dat4),
by = "bin") %>%
separate(bin, c("trsh", "bin", "trshb", "trshc")) %>%
mutate(dbinB = (sum.x - sum.y * 100)) %>%
select(-starts_with("trsh"))
# Combine two differences, and remove sum.x and sum.y
full_data <- cbind(diff_data, diff_data2[,4])
full_data <- full_data[,-c(2:3)]
# Melt the data to plot more than 1 variable on a plot
m <- melt(full_data, id.vars="bin")
theme_update(plot.title = element_text(hjust = 0.5))
ggplot(m, aes(as.numeric(bin), value, col=variable, linetype = variable)) +
geom_line() +
scale_linetype_manual(values=c("solid", "longdash")) +
scale_color_manual(values = c("black", "black"))
dev.off()
library(tidyverse)
Creating example data as shown in question, but adding different probabilities to the two sample() calls, to create so visible difference
between the two sets of randomized data.
dat1 <- data.frame(a = replicate(1,sample(25:50,10000,rep=TRUE, prob = 25:0/100))) %>% as_tibble()
dat2 <- data.frame(a = replicate(1,sample(25:50,9500,rep=TRUE, prob = 0:25/100))) %>% as_tibble()
Using dplyr we can handle this within data.frames (tibbles) without
the need to switch to other datatypes.
Let’s define a function that can be applied to both datasets to get
the preprocessing done.
We use base::cut() to create
a new column that pairs each value with its bin. We then group the data
by bin, calculate the sum for each bin and finally divide the bin sums
by the total sum.
calc_bin_props <- function(data) {
as_tibble(data) %>%
mutate(bin = cut(a, breaks = seq(25, 50, by = 2), labels = seq(25, 48, by = 2))) %>%
group_by(bin) %>%
summarise(sum = sum(a), .groups = "drop") %>%
filter(!is.na(bin)) %>%
ungroup() %>%
mutate(sum = sum / sum(sum))
}
Now we call calc_bin_props() on both datasets and join them by bin.
This gives us a dataframe with the columns bin, sum.x and sum.y.
The latter two are correspond to the bin sums derived from dat1 and
dat2. With the mutate() line we calculate the differences between the
two columns.
diff_data <-
full_join(
calc_bin_props(data = dat1),
calc_bin_props(dat2),
by = "bin") %>%
mutate(dbin = (sum.x - sum.y),
bin = as.numeric(as.character(bin))) %>%
select(-starts_with("trsh"))
Before we feed the data into ggplot() we convert it to the long
format using pivot_longer() this allows us to instruct ggplot() to
plot the results for sum.x, sum.y and dbin as separate lines.
diff_data %>%
pivot_longer(-bin) %>%
ggplot(aes(as.numeric(bin), value, color = name, linetype = name)) +
geom_line() +
scale_linetype_manual(values=c("longdash", "solid", "solid")) +
scale_color_manual(values = c("black", "purple", "green"))

ggplot in R - geom_tile with color-splitted tiles

Suppose I have a data where group1 and group2 both assign an integer value from 0 to 4 to the entities a,b,c,d,e, so:
data <- data.frame(data_id = c(letters[1:5], letters[1:5]), data_group = c(replicate(5, "Group1"), replicate(5, "Group2")), data_value = c(0:4, replicate(5,2)))
I want to plot these values using geom_tile() from the ggplot package in R:
ggplot(data, aes(x=data_value, y=data_id)) +
geom_tile(aes(fill = data_group), width = 0.4, height = 0.8)
The graph looks like this:
My problem is that for entity c Group1 and Group2 both assign the same value 2, but the red tile is overlayed by the blue one. Ideally, I would like to have a splitted tile in this case, that is half-red, half-blue. Does anyone have an idea of how to do this?
Many thanks in advance!
I feel like this would be best approached by splitting the data into overlapping and non-overlapping sets, then plotting them with separate geom_tile commands:
library(dplyr)
data <- data.frame(data_id = c(letters[1:5],
letters[1:5]),
data_group = c(replicate(5, "Group1"),
replicate(5, "Group2")),
data_value = c(0:4, replicate(5,2)))
data_unique <- data %>% ## non-overlapping data
group_by(data_id, data_value) %>%
filter(n() == 1)
data_shared <- data %>% ## overlapping data
group_by(data_id, data_value) %>%
filter(n() != 1)
ggplot(data,
aes(x = data_value, y = data_id)) +
geom_tile(data = data_unique, aes(fill = data_group, group = data_group),
width = 0.4, height = 0.8) + ## non-overlapping data
geom_tile(data = data_shared, aes(fill = data_group, group = data_group),
width = 0.4, height = 0.8,
position = "dodge") ## non-overlapping data

How to plot a(n unknown) number of data series as geom_line in same chart

My first Q here, so please go lightly if I'm out of step anywhere.
I'm trying to code R to produce a single chart to contain a number of data series lines. The number of data series may vary but will be provided in the data frame. I have tried to rearrange another thread's content to print the geom_line , but not successfully.
The logic is:
#desire to replace loop of 1:5 with ncol(df)
print(ggplot(df,aes(x=time))
for (i in 1:5) {
print (+ geom_line(aes(y=df[,i]))
}
#functioning geom point loops ggplot production:
for (i in 1:5) {
print(ggplot(df,aes(x=time,y=df[,i]))+geom_point())
}
#functioning multi-line ggplot where n is explicit:
ggplot(data=df, aes(x=time), group=1) +
geom_line(aes(y=df$`3`))+
geom_line(aes(y=df$`4`))
The functioning example code produces n number of point charts, 5 in this case. I would like just one chart to contain n line series.
This may be similar to How to plot n dimensional matrix? for which there are currently no relevant answers
Any contributions much appreciated, thanks
You can use gather from tidyverse "world" to do that.
As you didn't supply a sample data I used mtcars.
I created two data.frames one with 3 columns one with 9. In each one of them I plotted all of the variables against the variable mpg.
library(tidyverse)
df3Columns <- mtcars[, 1:4]
df9Columns <- mtcars[, 1:10]
df3Columns %>%
gather(var, value, -mpg) %>%
ggplot(aes(mpg, value, group = var, color = var)) +
geom_line()
df9Columns %>%
gather(var, value, -mpg) %>%
ggplot(aes(mpg, value, group = var, color = var)) +
geom_line()
Edit - using the sample data in comments.
library(tidyverse)
df %>%
rownames_to_column("time") %>%
gather(var, value, -time) %>%
ggplot(aes(time, value, group = var, color = var)) +
geom_line()
Sample data:
df <- structure(list("39083" = c(96, 100, 100), "39090" = c(99, 100, 100), "39097" = c(99, 100, 100)), row.names = 3:5, class = "data.frame")
To strictly answer your question, you can simply store your ggplot in a variable and add the geom_line one by one:
df <- structure(list("39083" = c(96, 100, 100), "39090" = c(99, 100, 100), "39097" = c(99, 100, 100)), row.names = 3:5, class = "data.frame")
g <- ggplot(df, aes(x = 1:nrow(df)))
for (i in colnames(df))
{
g <- g + geom_line(y = df[,i])
}
g <- g + scale_y_continuous(limits = c(min(df), max(df)))
print(g)
However, this is not a very convenient solution. I would highly recommend to refactor your data frame to be more ggplot style.
df.ultimate <- data.frame(time = numeric(), value = numeric(), group = character())
for (i in colnames(df))
{
df.ultimate <- rbind(df.ultimate, data.frame(time = 1:nrow(df), value = df[, i], group = i))
}
g <- ggplot(df.ultimate, aes(x = time, y = value, color = group))
g <- g + geom_line()
print(g)
A one-line solution:
ggplot(data.frame(time = rep(1:nrow(df), ncol(df)),
value = as.vector(as.matrix(df)),
group = rep(colnames(df), each = nrow(df))),
aes(x = time, y = value, color = group)) + geom_line()

How to specify two curves with different colors and shapes than all the others on geom_density

So i have a dataframe with 2 columns : "ID" and "Score"
ID contain the name of a simulation and each simulation have 58 different scores that are listed in the column Score.
There is 10 simulations.
I am doing a geom_density plot :
my_dataframe %>%
ggplot(aes(x=`Score`), xlim = c(0, 1)) +
geom_density(aes(color = ID)) +
theme_bw() +
labs(title = "Scores")
https://imgur.com/a/9DUTmWw
How can i tell ggplot that i want the curves of Simulation1 and Simulation2 to not be like the others, i want them to be in red and with an higher width than all the other one.
Thank you for your help,
Best,
Maxime
Something like this?
my_dataframe %>% mutate(group = ifelse(ID %in% c(1,2), 'special', 'NonSpecial')) %>%
ggplot(aes(x=`Score`, lty = group), xlim = c(0, 1)) +
geom_density(aes(color = ID)) +
theme_bw() +
labs(title = "Scores")
I used this data:
my_dataframe <- data.frame(ID = factor(sample(1:4, 100, T)), Score = sin(1:100))

Plot NA counts in a histogram

I have a question related to the histograms in R using ggplot2. I have been working trying to represent some values in a histogram from two different variables. After trying and looking for some solutions in Stackoverflow I got it but...does somebody know how to print NAs count as a new column just to compare the missings in the two variables?
Here is the R code:
i<-"ADL_1_bathing"
j<-"ADL_1_T2_bathing"
t1<-data.frame(datosMedicos[,i])
colnames(t1)<-"datos"
t2<-data.frame(datosMedicos[,j])
colnames(t2)<-"datos"
t1$time<-"t1"
t2$time<-"t2"
juntarParaGrafico<-rbind(t1,t2)
ggplot(juntarParaGrafico, aes(datos, fill = time) ) +
geom_histogram(col="darkblue",alpha = 0.5, aes(y = ..count..), binwidth = 0.2, position = 'dodge', na.rm = F) +
theme(legend.justification = c(1, 1), legend.position=c(1, 1))+
labs(title=paste0("Distribution of ",i), x=i, y="Count")
And this is the output:
Image about the two variables values but without the missing bars:
you could try to summarise the number of NAs b4 plotting. How about this?
library(ggplot2)
library(dplyr)
df1 = data.frame(a = rnorm(1:20))
df1[sample(1:20, 5),] = NA
df2 = data.frame(a = rnorm(1:20))
df2[sample(1:20, 3),] = NA
df2$time = "t2"
df1$time = "t1"
df = rbind(df1, df2)
df %>% group_by(time) %>% summarise(numNAs = sum(is.na(a)))
histogramDF= df %>% group_by(time) %>% summarise(numNAs = sum(is.na(a)))
qplot(x=time, y = numNAs, fill=time, data = histogramDF, stat='identity', geom="histogram")

Resources