I am trying to make a certain alluvial plot with different widths specified in different columns. Let me try to explain it by drawing it, as I am not sure how to do this in ggalluvial.
Notice that the width of the flow from the Male box represents 3 units, while it represents 10 in box 3. Is it possible to create such graphs in ggalluvial? Or how can one construct such a graph in R?
I haven't drawn the other flows just to focus on the flow from male to 3.
I would hereby would like to present some data to create such a graph:
test_data <- data.table(`2018 - Gender` = c("Male", "Female", "Female", "Male"),
`2018 - Value` = c(10, 20, 30, 20),
`2019 - Gender` = c("Male", "Female", "Male", "Female"),
`2019 - Value` = c(20, 30, 10, 10)
)
Notice that the column names determine the "columns" in the graphs (i.e. the x-axis). While the Gender variable determines the blocks. The value from 2018 is the starting width, while the value from 2019 is the ending width of the strata.
As some have pointed out that I need to put more focus on my question. The question is how to make flow graphs with different starting and ending width.
Perhaps the following dummy example gives you a better idea. Please check if your data is in alluvial form with is_alluvia_form(), before you plot it.
c <- c(LETTERS[1:4], LETTERS[2:6], LETTERS[3:7], LETTERS[3:8])
t <- c(rep("Fortnight 1",4), rep("Fortnight 2",5), rep("Fortnight 3",5), rep("Fortnight 4",6))
s <- c(rep(c("Female","Male"),10))
ag <- c(2,3,4,6,11,13)
f <- rnorm(20,20,99)
df <- data.frame(Timeframe=t,Code=c,Sex=s,Freq=round(abs(f))) %>% mutate(Organization=ifelse((row_number() %in% ag), "Agencia2","Agencia1" ))
alluvial_data <- as.data.frame(df %>%select(Organization, Timeframe, Code, Freq, Sex))
alluvial_data <- alluvial_data %>% mutate(id = row_number())
#Remove duplicates
alluvial_data <- alluvial_data %>%
distinct(Organization, Timeframe, Code, Sex, .keep_all = TRUE)
#levels(alluvial_data$Timeframe)
# Convert Timeframe to Factor - Categorical Variable
alluvial_data$Timeframe <-as.factor(alluvial_data$Timeframe)
# Convert Code to String
alluvial_data$Code <-as.character(alluvial_data$Code)
library(RColorBrewer)
# Define the number of colors you want
nb.cols <- 10
mycolors <- colorRampPalette(brewer.pal(8, "Set2"))(nb.cols)
mycolor2 <- colorRampPalette(brewer.pal(2, "Set2"))(nb.cols)
# Chart
ggplot(alluvial_data,
aes(y = Freq, axis1 = Organization, axis2 = Timeframe, axis3 = Code,fill=Sex)) +
#scale_fill_brewer(type = "qual", palette = "Set2") +
scale_x_discrete(limits=c("Organization","Timeframe","Code"), expand=c(0.05,0.05)) +
scale_fill_manual(values = mycolors) +
geom_flow(stat = "alluvium", lode.guidance = "frontback" #, color="grey"
) +
geom_stratum(width = 1/4, fill = "cyan", color = "grey") +
geom_label(stat = "stratum", aes(label = after_stat(stratum))) +
theme(legend.position = "bottom") +
ggtitle("Organizations") +
guides(fill=guide_legend(override.aes = list(color=mycolors[1:2])))+
labs(fill=NULL)
Related
I have four variables as columns in my data set:
whether the person had free school meals when they were younger
whether the person's parents attended university
whether the person studied A-level drama at school
whether their school offered A-level drama
Each value in the column is either "yes", "no" or "not applicable".
I want to put four sets of bar charts on one single plot (which I can then save as a .png), with each of the bar charts having a yes bar and a no bar.
I have used the below to create a frequency table for each of the variables. Here I've used the example of whether the person received free school meals (FSM) when they were younger:
FSM_df <- champions %>% count(FSM, sort = TRUE) %>% mutate(pct = prop.table(n))
percentage = label_percent()(FSM_df$pct)
FSM_df$percentage = percentage
I can use the code below to create a single bar chart, but I'm not sure how to do this for multiple plots:
ggplot(FSM_df, aes(x = FSM, y = n, fill = "#fe8080")) + geom_bar(stat = "identity", show.legend = FALSE) + coord_flip() + labs(x = "FSM", y = "Number of Champions") + geom_text(aes(label = percentage), color = "#662483")
Generating Random Data
lunch <- sample(0:1, 100, replace = TRUE, prob = c(0.7,0.3))
parents <- sample(0:1, 100, replace = TRUE, prob = c(0.5,0.5))
drama_major <- sample(0:1, 100, replace = TRUE, prob = c(0.9,0.1))
drama_offered <- sample(0:1, 100, replace = TRUE, prob = c(0.8,0.1))
Creating the Tibble
df <- tibble(lunch = lunch,
parents = parents,
drama_major = drama_major,
drama_offered)
pivot_longer
df %>%
pivot_longer(cols = 1:4,
names_to = "measure",
values_to = "measure_is_true_1") %>%
mutate(is_true = if_else(measure_is_true_1 == 0, "no", "yes")) %>%
ggplot(aes(x = measure)) +
geom_bar(aes(fill = is_true), position = "dodge", alpha = 0.7) +
coord_flip() +
theme_bw()
^ in this example, you should convert your data to long format and then set the grouping aesthetics using the fill parameters. The ggplot logic should be: plot my groups along the x axis and count the frequency for each time it's a 0 or 1 in the response column (whether or not they were on free lunch/drama, etc). This is how you can achieve it all on the same plot.
Simple Bar Chart Plot
Reproduced from this code:
library(haven)
library(survey)
library(dplyr)
nhanesDemo <- read_xpt(url("https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.XPT"))
# Rename variables into something more readable
nhanesDemo$fpl <- nhanesDemo$INDFMPIR
nhanesDemo$age <- nhanesDemo$RIDAGEYR
nhanesDemo$gender <- nhanesDemo$RIAGENDR
nhanesDemo$persWeight <- nhanesDemo$WTINT2YR
nhanesDemo$psu <- nhanesDemo$SDMVPSU
nhanesDemo$strata <- nhanesDemo$SDMVSTRA
nhanesAnalysis <- nhanesDemo %>%
mutate(LowIncome = case_when(
INDFMIN2 < 40 ~ T,
T ~ F
)) %>%
# Select the necessary columns
select(INDFMIN2, LowIncome, persWeight, psu, strata)
# Set up the design
nhanesDesign <- svydesign(id = ~psu,
strata = ~strata,
weights = ~persWeight,
nest = TRUE,
data = nhanesAnalysis)
svyhist(~log10(INDFMIN2), design=nhanesDesign, main = '')
How do I color the histogram by independent variable, say, LowIncome? I want to have two separate histograms, one for each value of LowIncome. Unfortunately I picked a bad example, but I want them to be see-through in case their values overlap.
If you want to plot a histogram from your model, you can get its data from model.frame (this is what svyhist does under the hood). To get the histogram filled by group, you could use this data frame inside ggplot:
library(ggplot2)
ggplot(model.frame(nhanesDesign), aes(log10(INDFMIN2), fill = LowIncome)) +
geom_histogram(alpha = 0.5, color = "gray60", breaks = 0:20 / 10) +
theme_classic()
Edit
As Thomas Lumley points out, this does not incorporate sampling weights, so if you wanted this you could do:
ggplot(model.frame(nhanesDesign), aes(log10(INDFMIN2), fill = LowIncome)) +
geom_histogram(aes(weight = persWeight), alpha = 0.5,
color = "gray60", breaks = 0:20 / 10) +
theme_classic()
To demonstrate this approach works, we can replicate Thomas's approach in ggplot using the data example from svyhist. To get the uneven bin sizes (if this is desired), we need two histogram layers, though I'm guessing this would not be required for most use-cases.
ggplot(model.frame(dstrat), aes(enroll)) +
geom_histogram(aes(fill = "E", weight = pw, y = after_stat(density)),
data = subset(model.frame(dstrat), stype == "E"),
breaks = 0:35 * 100,
position = "identity", col = "gray50") +
geom_histogram(aes(fill = "Not E", weight = pw, y = after_stat(density)),
data = subset(model.frame(dstrat), stype != "E"),
position = "identity", col = "gray50",
breaks = 0:7 * 500) +
scale_fill_manual(NULL, values = c("#00880020", "#88000020")) +
theme_classic()
You can't just extract the data and use ggplot, because that won't use the weights and so misses the whole point of svyhist. You can use the add=TRUE argument, though. You do need to set the x and y axis ranges correctly to make sure the whole plot is visible
Using the data example from ?svyhist
svyhist(~enroll, subset(dstrat,stype=="E"), col="#00880020",ylim=c(0,0.003),xlim=c(0,3500))
svyhist(~enroll, subset(dstrat,stype!="E"), col="#88000020",add=TRUE)
I am plotting max_temperature (mean_tmax) against rainfall (mean_rain) in a mirrored barplot: max temp displayed upwards, rain values downwards on the negative scale. These two are stored in the "name" variable.
To highlight the highest values out of the 32 years plotted, I created two vectors colVecTmax, colVecRain. They return a color vector of length 32 each, with the index of max values marked differently.
But when adding these two vectors to fill within geom_bar(), it turns out that ggplot stops counting the top after 16 bars, and moves down to the negative scale to continue. So it does not count by the name (mean_tmax, or mean_rain) variable.
This messes up the plot, and I am not sure how to get ggplot count through on the top bars for max_temperature first, coloring by colVecTmax, and then move down to do the same for rain on the negative scale with colVecRain.
Can anyone give a hint on how to solve this?
colVecTmax <- rep("orange",32)
colVecTmax[which.max(as.numeric(unlist(df.long[df.long$place=="sheffield" & df.long$name == "mean_tmax",4])))] <- "blue"
colVecRain <- rep("grey",32)
colVecRain[which.max(as.numeric(unlist(df.long[df.long$place=="sheffield" & df.long$name == "mean_rain",4])))] <- "blue"
ggplot(df.long[df.long$name %in% c('mean_rain', 'mean_tmax'), ] %>% filter(place== "sheffield")%>%
group_by(name) %>% mutate(value = case_when(
name == 'mean_rain' ~ value/10 * -1,
TRUE ~ value)) %>% mutate(place==str_to_sentence(placenames)) %>%
mutate(name = recode(name,'mean_rain' = "rainfall" , "mean_tmax" = "max temp"))
, aes(x = yyyy, y = value, fill=name))+
geom_bar(stat="identity", position="identity", fill=c(colVecTmax,colVecRain))+
labs(x="Year", y=expression("Rain in cm, temperature in ("*~degree *C*")"))+
geom_smooth(colour="black", lwd=0.5,se=F)+
scale_y_continuous(breaks = seq(-30, 30 , 5))+
scale_x_continuous(breaks = seq(1990, 2025, 5))+
guides(fill= guide_legend(title=NULL))+
scale_fill_discrete(labels=c("Max temperature", "Rainfall"))+
guides(fill=guide_legend(reverse=T), res=96)
Using ggplot2 there are much easier and less error prone ways to assign colors. Instead of creating color vectors which you pass to the color or fill argument you could simply map on aesthetics (which you basically already have done) and assign your desired colors using a manual scale, e.g. scale_fill_manual. The same approach works fine when you want to highlight some values. To this end you could create additional categories, e.g. in the code below I add "_max" to the name for the observations with the max temperature or rainfall and assign your desired "blue" color to these categories. As doing so will add additional categories I use the breaks argument of scale_fill_manual so that these max categories will not show up in the legend.
Using some fake random example data:
# Create example data
set.seed(123)
df.long <- data.frame(
name = rep(c("mean_rain", "mean_tmax"), each = 30),
place = "sheffield",
yyyy = rep(1991:2020, 2),
value = c(runif(30, 40, 100), runif(30, 12, 16))
)
library(ggplot2)
library(dplyr)
df_plot <- df.long %>%
filter(name %in% c("mean_rain", "mean_tmax")) |>
filter(place == "sheffield") %>%
mutate(value = case_when(
name == "mean_rain" ~ -value / 10,
TRUE ~ value
)) |>
# Maximum values
group_by(name) |>
mutate(name = ifelse(abs(value) >= max(abs(value)), paste(name, "max", sep = "_"), name))
ggplot(df_plot, aes(x = yyyy, y = value, fill = name)) +
geom_col(position = "identity") +
geom_smooth(colour = "black", lwd = 0.5, se = F) +
scale_y_continuous(breaks = seq(-30, 30, 5), labels = abs) +
scale_x_continuous(breaks = seq(1990, 2025, 5)) +
scale_fill_manual(
values = c(
mean_rain = "orange", mean_tmax = "grey",
mean_rain_max = "blue", mean_tmax_max = "blue"
),
labels = c(mean_tmax = "Max temperature", mean_rain = "Rainfall"),
breaks = c("mean_rain", "mean_tmax")
) +
labs(x = "Year", y = expression("Rain in cm, temperature in (" * ~ degree * C * ")"), fill = NULL) +
guides(fill = guide_legend(reverse = TRUE))
So i have a dataframe with 2 columns : "ID" and "Score"
ID contain the name of a simulation and each simulation have 58 different scores that are listed in the column Score.
There is 10 simulations.
I am doing a geom_density plot :
my_dataframe %>%
ggplot(aes(x=`Score`), xlim = c(0, 1)) +
geom_density(aes(color = ID)) +
theme_bw() +
labs(title = "Scores")
https://imgur.com/a/9DUTmWw
How can i tell ggplot that i want the curves of Simulation1 and Simulation2 to not be like the others, i want them to be in red and with an higher width than all the other one.
Thank you for your help,
Best,
Maxime
Something like this?
my_dataframe %>% mutate(group = ifelse(ID %in% c(1,2), 'special', 'NonSpecial')) %>%
ggplot(aes(x=`Score`, lty = group), xlim = c(0, 1)) +
geom_density(aes(color = ID)) +
theme_bw() +
labs(title = "Scores")
I used this data:
my_dataframe <- data.frame(ID = factor(sample(1:4, 100, T)), Score = sin(1:100))
I've got a question regarding an edge case with ggplot2 in R.
They don't like you adding multiple legends, but I think this is a valid use case.
I've got a large economic dataset with the following variables.
year = year of observation
input_type = *labor* or *supply chain*
input_desc = specific type of labor (eg. plumbers OR building supplies respectively)
value = percentage of industry spending
And I'm building an area chart over approximately 15 years. There are 39 different input descriptions and so I'd like the user to see the two major components (internal employee spending OR outsourcing/supply spending)in two major color brackets (say green and blue), but ggplot won't let me group my colors in that way.
Here are a few things I tried.
Junk code to reproduce
spec_trend_pie<- data.frame("year"=c(2006,2006,2006,2006,2007,2007,2007,2007,2008,2008,2008,2008),
"input_type" = c("labor", "labor", "supply", "supply", "labor", "labor","supply","supply","labor","labor","supply","supply"),
"input_desc" = c("plumber" ,"manager", "pipe", "truck", "plumber" ,"manager", "pipe", "truck", "plumber" ,"manager", "pipe", "truck"),
"value" = c(1,2,3,4,4,3,2,1,1,2,3,4))
spec_broad <- ggplot(data = spec_trend_pie, aes(y = value, x = year, group = input_type, fill = input_desc)) + geom_area()
Which gave me
Error in f(...) : Aesthetics can not vary with a ribbon
And then I tried this
sff4 <- ggplot() +
geom_area(data=subset(spec_trend_pie, input_type="labor"), aes(y=value, x=variable, group=input_type, fill= input_desc)) +
geom_area(data=subset(spec_trend_pie, input_type="supply_chain"), aes(y=value, x=variable, group=input_type, fill= input_desc))
Which gave me this image...so closer...but not quite there.
To give you an idea of what is desired, here's an example of something I was able to do in GoogleSheets a long time ago.
It's a bit of a hack but forcats might help you out. I did a similar post earlier this week:
How to factor sub group by category?
First some base data
set.seed(123)
raw_data <-
tibble(
x = rep(1:20, each = 6),
rand = sample(1:120, 120) * (x/20),
group = rep(letters[1:6], times = 20),
cat = ifelse(group %in% letters[1:3], "group 1", "group 2")
) %>%
group_by(group) %>%
mutate(y = cumsum(rand)) %>%
ungroup()
Now, use factor levels to create gradients within colors
df <-
raw_data %>%
# create factors for group and category
mutate(
group = fct_reorder(group, y, max),
cat = fct_reorder(cat, y, max) # ordering in the stack
) %>%
arrange(cat, group) %>%
mutate(
group = fct_inorder(group), # takes the category into account first
group_fct = as.integer(group), # factor as integer
hue = as.integer(cat)*(360/n_distinct(cat)), # base hue values
light_base = 1-(group_fct)/(n_distinct(group)+2), # trust me
light = floor(light_base * 100) # new L value for hcl()
) %>%
mutate(hex = hcl(h = hue, l = light))
Create a lookup table for scale_fill_manual()
area_colors <-
df %>%
distinct(group, hex)
Lastly, make your plot
ggplot(df, aes(x, y, fill = group)) +
geom_area(position = "stack") +
scale_fill_manual(
values = area_colors$hex,
labels = area_colors$group
)