Add text labels to a ggplot2 mosaic plot - r

Using the following data:
Category <- c("Bankpass", "Bankpass", "Bankpass", "Moving", "Moving")
Subcategory <- c("Stolen", "Lost", "Login", "Address", "New contract")
Weight <- c(10,20,13,40,20)
Duration <- as.character(c(0.2,0.4,0.5,0.44,0.66))
Silence <- as.character(c(0.1,0.3,0.25,0.74,0.26))
df <- data.frame(Category, Subcategory, Weight, Duration, Silence)
Which I use to create the following mosaic plot:
library (ggplot2)
library (ggmosaic)
g <- ggplot(data = df) +
geom_mosaic(aes(weight = Weight, x = product(Category), fill = Duration),
offset = 0, na.rm = TRUE) +
theme(axis.text.x = element_text(angle = -25, hjust = .1)) +
theme(axis.title.x = element_blank()) +
scale_fill_manual(values = c("#e8f5e9", "#c8e6c9", "#a5d6a7", "#81c784", "#66bb6a"))
This works, however I would like to include text labels on the elements on the graph ("Showing fe stolen, lost" etc.)
However, when I do:
g + geom_text(x = Category, y = Subcategory, label = Weight)
I get the following error:
Error in UseMethod("rescale") : no applicable method for 'rescale' applied to an object of class "character"
Any thoughts on what goes wrong here?

Here is my attempt. The x-axis is in a discrete variable (i.e., Category). So you cannot use it in geom_text(). You somehow need to create a numeric variable for the axis. Similarly, you need to find position in the y-axis for labels. In order to get numeric values for the two dimensions, I decided to access to the data frame staying behind your graphic. When you use the ggmosaic package, there is one data frame behind a graphic in this case. You can get it using ggplot_build(). You can calculate x and y values using the information in the data frame (e.g., xmin, and xmax). This is good news. But, we have bad news too. When you reach the data, you realize that there is no information about Subcategory that you need for labels.
We can overcome this challenge joining the data frame above with the original data. When I joined the data, I calculated proportion for both the original data and the other data. The values are purposely converted to character. temp is the data set you need in order to add labels.
library(dplyr)
library(ggplot2)
library(ggmosaic)
# Add proportion for each and convert to character for join
df <- group_by(df, Category) %>%
mutate(prop = as.character(round(Weight / sum(Weight),3)))
# Add proportion for each and convert to character.
# Get x and y values for positions
# Use prop for join
temp <- ggplot_build(g)$data %>%
as.data.frame %>%
transmute(prop = as.character(round(ymax - ymin, 3)),
x.position = (xmax + xmin) / 2,
y.position = (ymax + ymin) / 2) %>%
right_join(df)
g + geom_text(x = temp$x.position, y = temp$y.position, label = temp$Subcategory)

I think you are looking for something like this
library(ggplot2)
library(ggmosaic)
Your data:
Category <- c("Bankpass", "Bankpass", "Bankpass", "Moving", "Moving")
Subcategory <- c("Stolen", "Lost", "Login", "Address", "New contract")
Weight <- c(10,20,13,40,20)
Duration <- as.character(c(0.2,0.4,0.5,0.44,0.66))
Silence <- as.character(c(0.1,0.3,0.25,0.74,0.26))
mydf <- data.frame(Category, Subcategory, Weight, Duration, Silence)
ggplot(data = mydf) +
geom_mosaic(aes( x = product(Duration, Subcategory), fill=factor(Duration)), na.rm=TRUE) +
theme(axis.text.x=element_text(angle=-25, hjust= .1)) +
labs(x="Subcategory", title='f(Duration, Subcategory | Category)') +
facet_grid(Category~.) +
guides(fill=guide_legend(title = "Duration", reverse = TRUE))
The output is:
It is almost the best you can do on ggmosaic package. You should try other packages.
Good luck for your project work ;-)

Related

Plotting geom_segment with position_dodge

I have a data set with information of where individuals work at over time. More specifically, I have information on the interval at which individuals work in a given workplace.
library('tidyverse')
library('lubridate')
# individual A
a_id <- c(rep('A',1))
a_start <- c(201201)
a_end <- c(201212)
a_workplace <-c(1)
# individual B
b_id <- c(rep('B',2))
b_start <- c(201201, 201207)
b_end <- c(201206, 201211)
b_workplace <-c(1, 2)
# individual C
c_id <- c(rep('C',2))
c_start <- c(201201, 201202)
c_end <- c(201204, 201206)
c_workplace <-c(1, 2)
# individual D
d_id <- c(rep('D',1))
d_start <- c(201201)
d_end <- c(201201)
d_workplace <-c(1)
# final data frame
id <- c(a_id, b_id, c_id, d_id)
start <- c(a_start, b_start, c_start, d_start)
end <- c(a_end, b_end, c_end, d_end)
workplace <- as.factor(c(a_workplace, b_workplace, c_workplace, d_workplace))
mydata <- data.frame(id, start, end, workplace)
mydata_ym <- mydata %>%
mutate(ymd_start = as.Date(paste0(start, "01"), format = "%Y%m%d"),
ymd_end0 = as.Date(paste0(end, "01"), format = "%Y%m%d"),
day_end = as.numeric(format(ymd_end0 + months(1) - days(1), format = "%d")),
ymd_end = as.Date(paste0(end, day_end), format = "%Y%m%d")) %>%
select(-ymd_end0, -day_end)
I would like a plot where I can see the patterns of how long each individual works at each workplace as well as how they move around. I tried plotting a geom_segment as I have information of start and end date the individual works in each place. Besides, because the same individual may work in more than one place during the same month, I would like to use position_dodge to make it visible when there is overlap of different workplaces for the same id-time. This was suggested in this post here: Ggplot (geom_line) with overlaps
ggplot(mydata_ym) +
geom_segment(aes(x = id, xend = id, y = ymd_start, yend = ymd_end),
position = position_dodge(width = 0.1), size = 2) +
scale_x_discrete(limits = rev) +
coord_flip() +
theme(panel.background = element_rect(fill = "grey97")) +
labs(y = "time", title = "Work affiliation")
The problem I am having is that: (i) the position_dodge doesn't seem to be working, (ii) I don't know why all the segments are being colored in black. I would expect each workplace to have a different color and a legend to show up.
If you include colour = workplace in the aes() mapping for geom_segment you get colours and a legend and some dodging, but it doesn't work quite right (it looks like position_dodge only applies to x and not xend ... ? this seems like a bug, or at least an "infelicity", in position_dodge ...
However, replacing geom_segment with an appropriate use of geom_linerange does seem to work:
ggplot(mydata_ym) +
geom_linerange(aes(x = id, ymin = ymd_start, ymax = ymd_end, colour = workplace),
position = position_dodge(width = 0.1), size = 2) +
scale_x_discrete(limits = rev) +
coord_flip()
(some tangential components omitted).
A similar approach is previously documented here — a near-duplicate of your question once the colour= mapping is taken care of ...

Is there a way I could plot t = 300, 350, 450, and 500 lines in one graph?

enter image description hereI wanted to plot multiple lines in one graph but I couldn't figure out which code to use. Also, is there a way I could assign colors to each of the lines? Just new to Rstudio and was assigned to pick up someones work so I've been doing a lot of trial and error but I haven't been lucky for the past few days. Hope someone could help me with this! Thank you so much
ecdf.shift <- function(OUR_threshold, des_cap = 40, nint = 10000){
#create some empty vectors for later use in the loop
ecdf_med = c()
ecdf_obs = c()
for (i in 1:length(OUR_threshold)){
# filter out the OUR threshold data, then select only the capture column and create a ecdf function
ecdf_fun <- HRP_rESS_no %>%
filter(ESS > OUR_threshold[i]) %>%
.$TSS_con %>%
ecdf()
# extract the ecdf data and put in tibble dataframe, then create a linear interpolation of the curve.
ecdf_data <- tibble(TSS_con = environment(ecdf_fun)$x, prob = environment(ecdf_fun)$y)
ecdf_interpol <- approx(x = ecdf_data$TSS_con, y = ecdf_data$prob, n = nint)
# find the vector numbers in x which correspond with the desired capture. Then find correlate the vectornumbers with probability numbers in the y vectors. Take the median value in case multiple hits. Put this number in a vector with designed vectornumber as ditacted by the loopnumber i.
ecdf_med[i] <- median(ecdf_interpol$y[(round(ecdf_interpol$x,1) == des_cap)])
# calculate the number of observations when the filtering takes place.
ecdf_obs[i] <- HRP_rESS_no %>%
filter(ESS > OUR_threshold[i]) %>%
.$TSS_con %>%
length()
# Flush the ecdf data. The ecdf is encoded as a function with global paramaters, so you want to reset them everytime the loop is done to avoid pesky bugs to appear.
rm(ecdf_data)
}
#create a tibble dataframe with all the loop data.
ecdf_out <- tibble(OUR_ratio_cutoff = OUR_threshold, prob = (ecdf_med)*100, nobs = ecdf_obs)
return(ecdf_out)
}
ratio_threshold <- seq(0,115, by = 5)
t = ecdf_MLSS_target <- 400 %>%
ecdf.shift(ratio_threshold, .) %>%
filter(nobs > 2) %>%
ggplot(aes( x = OUR_ratio_cutoff, y = prob)) +
geom_line() +
geom_point() +
theme_bw(base_size = 12) +
theme(panel.grid = element_blank()) +
scale_y_continuous(limits = c(0,100),
breaks = seq(0,300, by = 5),
expand = c(0,0)) +
scale_x_continuous(limits = c(0,120),
breaks = seq(0,110, by = 10),
expand = c(0,0)) +
labs(x = "ESS mg TSS/L",
y = "Probability of contactor MLSS > 400 mg TSS/L ")
plot(t)
Easiest would be to loop over your different t values first and bring the resulting data frames into one big data frame, and use this for your plot. Your code is not fully reproducible (it requires data that we do not have, i.e. HRP_rESS_no). So I have stripped down the function to the core - creating a data frame which makes different "lines" depending on your t value. I just used it as slope.
I hope the idea is clear.
library(tidyverse)
ecdf.shift <- function(OUR_threshold, t) {
data.frame(x = OUR_threshold, y = t * OUR_threshold)
}
ratio_threshold <- seq(0, 115, by = 5)
t_df <-
map(1:5, function(t) ecdf.shift(ratio_threshold, t)) %>%
bind_rows(, .id = "t")
ggplot(t_df, aes(x, y, color = t)) +
geom_line() +
geom_point()
Created on 2020-05-07 by the reprex package (v0.3.0)

ggplot2 - Two color series in area chart

I've got a question regarding an edge case with ggplot2 in R.
They don't like you adding multiple legends, but I think this is a valid use case.
I've got a large economic dataset with the following variables.
year = year of observation
input_type = *labor* or *supply chain*
input_desc = specific type of labor (eg. plumbers OR building supplies respectively)
value = percentage of industry spending
And I'm building an area chart over approximately 15 years. There are 39 different input descriptions and so I'd like the user to see the two major components (internal employee spending OR outsourcing/supply spending)in two major color brackets (say green and blue), but ggplot won't let me group my colors in that way.
Here are a few things I tried.
Junk code to reproduce
spec_trend_pie<- data.frame("year"=c(2006,2006,2006,2006,2007,2007,2007,2007,2008,2008,2008,2008),
"input_type" = c("labor", "labor", "supply", "supply", "labor", "labor","supply","supply","labor","labor","supply","supply"),
"input_desc" = c("plumber" ,"manager", "pipe", "truck", "plumber" ,"manager", "pipe", "truck", "plumber" ,"manager", "pipe", "truck"),
"value" = c(1,2,3,4,4,3,2,1,1,2,3,4))
spec_broad <- ggplot(data = spec_trend_pie, aes(y = value, x = year, group = input_type, fill = input_desc)) + geom_area()
Which gave me
Error in f(...) : Aesthetics can not vary with a ribbon
And then I tried this
sff4 <- ggplot() +
geom_area(data=subset(spec_trend_pie, input_type="labor"), aes(y=value, x=variable, group=input_type, fill= input_desc)) +
geom_area(data=subset(spec_trend_pie, input_type="supply_chain"), aes(y=value, x=variable, group=input_type, fill= input_desc))
Which gave me this image...so closer...but not quite there.
To give you an idea of what is desired, here's an example of something I was able to do in GoogleSheets a long time ago.
It's a bit of a hack but forcats might help you out. I did a similar post earlier this week:
How to factor sub group by category?
First some base data
set.seed(123)
raw_data <-
tibble(
x = rep(1:20, each = 6),
rand = sample(1:120, 120) * (x/20),
group = rep(letters[1:6], times = 20),
cat = ifelse(group %in% letters[1:3], "group 1", "group 2")
) %>%
group_by(group) %>%
mutate(y = cumsum(rand)) %>%
ungroup()
Now, use factor levels to create gradients within colors
df <-
raw_data %>%
# create factors for group and category
mutate(
group = fct_reorder(group, y, max),
cat = fct_reorder(cat, y, max) # ordering in the stack
) %>%
arrange(cat, group) %>%
mutate(
group = fct_inorder(group), # takes the category into account first
group_fct = as.integer(group), # factor as integer
hue = as.integer(cat)*(360/n_distinct(cat)), # base hue values
light_base = 1-(group_fct)/(n_distinct(group)+2), # trust me
light = floor(light_base * 100) # new L value for hcl()
) %>%
mutate(hex = hcl(h = hue, l = light))
Create a lookup table for scale_fill_manual()
area_colors <-
df %>%
distinct(group, hex)
Lastly, make your plot
ggplot(df, aes(x, y, fill = group)) +
geom_area(position = "stack") +
scale_fill_manual(
values = area_colors$hex,
labels = area_colors$group
)

Add multiple ggplot2 geom_segment() based on mean() and sd() data

I have a data frame mydataAll with columns DESWC, journal, and highlight. To calculate the average and standard deviation of DESWC for each journal, I do
avg <- aggregate(DESWC ~ journal, data = mydataAll, mean)
stddev <- aggregate(DESWC ~ journal, data = mydataAll, sd)
Now I plot a horizontal stripchart with the values of DESWC along the x-axis and each journal along the y-axis. But for each journal, I want to indicate the standard deviation and average with a simple line. Here is my current code and the results.
stripchart2 <-
ggplot(data=mydataAll, aes(x=mydataAll$DESWC, y=mydataAll$journal, color=highlight)) +
geom_segment(aes(x=avg[1,2] - stddev[1,2],
y = avg[1,1],
xend=avg[1,2] + stddev[1,2],
yend = avg[1,1]), color="gray78") +
geom_segment(aes(x=avg[2,2] - stddev[2,2],
y = avg[2,1],
xend=avg[2,2] + stddev[2,2],
yend = avg[2,1]), color="gray78") +
geom_segment(aes(x=avg[3,2] - stddev[3,2],
y = avg[3,1],
xend=avg[3,2] + stddev[3,2],
yend = avg[3,1]), color="gray78") +
geom_point(size=3, aes(alpha=highlight)) +
scale_x_continuous(limit=x_axis_range) +
scale_y_discrete(limits=mydataAll$journal) +
scale_alpha_discrete(range = c(1.0, 0.5), guide='none')
show(stripchart2)
See the three horizontal geom_segments at the bottom of the image indicating the spread? I want to do that for all journals, but without handcrafting each one. I tried using the solution from this question, but when I put everything in a loop and remove the aes(), it give me an error that says:
Error in x - from[1] : non-numeric argument to binary operator
Can anyone help me condense the geom_segment() statements?
I generated some dummy data to demonstrate. First, we use aggregate like you have done, then we combine those results to create a data.frame in which we create upper and lower columns. Then, we pass these to the geom_segment specifying our new dataset. Also, I specify x as the character variable and y as the numeric variable, and then use coord_flip():
library(ggplot2)
set.seed(123)
df <- data.frame(lets = sample(letters[1:8], 100, replace = T),
vals = rnorm(100),
stringsAsFactors = F)
means <- aggregate(vals~lets, data = df, FUN = mean)
sds <- aggregate(vals~lets, data = df, FUN = sd)
df2 <- data.frame(means, sds)
df2$upper = df2$vals + df2$vals.1
df2$lower = df2$vals - df2$vals.1
ggplot(df, aes(x = lets, y = vals))+geom_point()+
geom_segment(data = df2, aes(x = lets, xend = lets, y = lower, yend = upper))+
coord_flip()+theme_bw()
Here, the lets column would resemble your character variable.

How to make error bars for multiple variables in bar chat

I was hoping someone could help me with the following problem:
I am attempting to make a combined barplot showing the mean and standard errors for 3 different continuous variables (body temp, length, mass) recorded for a binary variable (gender).
I have been able to plot the mean values for each variable but I can't seem to successfully calculate the standard error for these 3 variables using any of the codes I've tried.
I tried many things, but I think I was on the right track with this:
View(test4)
test4 <- aggregate(test4,
by = list(Sex = test4$Sex),
FUN = function(x) c(mean = mean(x), sd = sd(x),
n = length(x)))
test4
#this produced mean, sd, length for ALL variables (including sex)
test4<-do.call(test4)
test4$se<-test4$x.sd / sqrt(test4$x.n)
Then I kept getting the error:
Error in sqrt(test4$x.n) : non-numeric argument to mathematical function
I tried to recode to target my 3 variables after aggregate(test4...) but I couldn't get it to work...Then I subsetted by resulting dataframe to exclude sex but that didn't work. I then tried to define it as a matrix or vector but still that didn't work.
I would like my final graph to to have y axis = mean values, x axis = variable (3 sub-groups (Tb, Mass, Length) with two bars side by side showing male and female values for comparison.
Any help or direction anyone could provide would be greatly appreciated!!
Many thanks in advance! :)
aggregate does give some crazy output when you are trying to output more than one column.
If you wish to use aggregate I would do mean and SE as separate calls to aggregate.
However, here is a solution using tidyr and dplyr that I don't think is too bad.
I've created some data. I hope it looks like yours. It is so useful to include a simulated dataset with your question.
library(tidyr)
library(dplyr)
library(ggplot2)
# Create some data
test4 <- data.frame(Sex = rep(c('M', 'F'), 50),
bodytemp = rnorm(100),
length = rnorm(100),
mass = rnorm(100))
# Gather the data to 'long' format so the bodytemp, length and mass are all in one column
longdata <- gather(test4, variable, value, -Sex)
head(longdata)
# Create the summary statistics seperately for sex and variable (i.e. bodytemp, length and mass)
summary <- longdata %>%
group_by(Sex, variable) %>%
summarise(mean = mean(value), se = sd(value) / length(value))
# Plot
ggplot(summary, aes(x = variable, y = mean, fill = Sex)) +
geom_bar(stat = 'identity', position = 'dodge') +
geom_errorbar(aes(ymin = mean - se, ymax = mean + se),
width = 0.2,
position = position_dodge(0.9))
My final plot
Update: I was able to answer my question by combining the initial part of timcdlucas script along with another one I had used when plotting just one output. For anyone else who may be seeking an answer to a similar question, I have posted my script and the resulting graph (see link above):
View(test3) #this dataframe was organized as 'sex', 'tb', 'mass', 'svl'
newtest<-test3
View(newtest)
#transform data to 'long' combining all variables in one column
longdata<-gather(newtest, variable, value, -Sex)
View(longdata)
#set up table in correct format
longdata2 <- aggregate(longdata$value,
by = list(Sex = longdata$Sex, Variable = longdata$variable),
FUN = function(x) c(mean = mean(x), sd = sd(x),
n = length(x)))
longdata2 <- do.call(data.frame, longdata2)
longdata2$se<-longdata2$x.sd / sqrt(longdata2$x.n)
colnames(longdata2)<-c("Sex", "Variable", "mean", "sd", "n", "se")
longdata2$names<-c(paste(longdata2$Variable, "Variable /", longdata2$Sex, "Sex"))
View(longdata2)
dodge <- position_dodge(width = 0.9)
limits <- aes(ymax = longdata3$mean + longdata3$se,
ymin = longdata3$mean - longdata3$se)
#To order the bars in the way I desire *might not be necessary for future scripts*
positions<-c("Tb", "SVL", "Mass")
#To plot new table:
bfinal <- ggplot(data = longdata3, aes(x = factor(Variable), y = mean,
fill = factor(Sex)))+
geom_bar(stat = "identity",
position = position_dodge(0.9))+
geom_errorbar(limits, position = position_dodge(0.9),
width = (0.25)) +
labs(x = "Variable", y = "Mean") +
ggtitle("")+
scale_fill_discrete(name = "",
labels=c("Male", "Female"))+
scale_x_discrete(breaks=c("Mass", "SVL", "Tb"),
labels=c("Mass", "SVL", "Tb"),
limits=(positions))
bfinal
:)

Resources