grouped by factor level in ggplot2() - r

I've got a data frame with four three-level categorical variables: before_weight, after_weight, before_pain, and after_pain.
I'd like to make a bar plot featuring the proportion for each level of the variables. That my current code achieves.
The problem's the presentation of the data. I'd like the respective before and after bars to be grouped together, so that the bar representing the people that answered 1 in the before_weight variable is grouped next to the bar representing the people that answered 1 in the after_weight variable, and so forth for both the weight and pain variables.
I've been trying to use dplyr, mutate() with numerous ifelse() statements, to make a new variable pairing up the groups in question, but can't seem to get it to work.
Any help would be much appreciated.
starting point (df):
df <- data.frame(before_weight=c(1,2,3,2,1),before_pain=c(2,2,1,3,1),after_weight=c(1,3,3,2,3),after_pain=c(1,1,2,3,1))
current code:
library(tidyr)
dflong <- gather(df, varname, score, before_weight:after_pain, factor_key=TRUE)
df$score<- as.factor(df$score)
library(ggplot2)
library(dplyr)
dflong %>%
group_by(varname) %>%
count(score) %>%
mutate(prop = 100*(n / sum(n))) %>%
ggplot(aes(x = varname, y = prop, fill = factor(score))) + scale_fill_brewer() + geom_col(position = 'dodge', colour = 'black')
UPDATE:
I'd like proportions rather than counts, so I've attempted to tweak Nate's code. Since I'm using the question variable to group the data to get the proportions, I can't seem use gsub() to change the content of that variable. Instead I added question2 and passed it into facet_wrap(). It seems to work.:
df %>% gather("question", "val") %>%
count(question, val) %>%
group_by(question) %>%
mutate(percent = 100*(n / sum(n))) %>%
mutate(time= factor(ifelse(grepl("before", question), "before", "after"), c("before", "after"))) %>%
mutate(question2= ifelse(grepl("weight", question), "weight", "pain")) %>%
ggplot(aes(x=val, y=percent, fill = time)) + geom_col(position = "dodge") + facet_wrap(~question2)

Does this code make the visual comparisons you are after? One ifelse and a gsub will help make variables we can use for facetting and filling in ggplot.
df %>% gather("question", "val") %>% # go long
mutate(time = factor(ifelse(grepl("before", question), "before", "after"),
c("before", "after")), # use factor with levels to control order
question = gsub(".*_", "", question)) %>% # clean for facets
ggplot(aes(x = val, fill = time)) + # use fill not color for whole bar
geom_bar(position = "dodge") + # stacking is the default option
facet_wrap(~question) # two panels

Related

Color/fill bars in geom_col based on another variable?

I have an uncolored geom_col and would like it to display information about another (continuous) variable by displaying different shades of color in the bars.
Example
Starting with a geom_col
library(dplyr)
library(ggplot2)
set.seed(124)
iris[sample(1:150, 50), ] %>%
group_by(Species) %>%
summarise(n=n()) %>%
ggplot(aes(Species, n)) +
geom_col()
Suppose we want to color the bars according to how low/high mean(Sepal.Width) in each grouping
(note: I don't know if there's a way to provide 'continuous' colors to a ggplot, but, if not, the following colors would be fine to use)
library(RColorBrewer)
display.brewer.pal(n = 3, name= "PuBu")
brewer.pal(n = 3, name = "PuBu")
[1] "#ECE7F2" "#A6BDDB" "#2B8CBE"
The end result should be the same geom_col as above but with the bars colored according to how low/high mean(Sepal.Width) is.
Notes
This answer shows something similar but is highly manual, and is okay for 3 bars, but not sustainable for many plots with a high number of bars (since would require too many case_when conditions to be manually set)
This is similar but the coloring is based on a variable already displayed in the plot, rather than another variable
Note also, in the example I provide above, there are 3 bars and I provide 3 colors, this is somewhat manual and if there's a better (i.e. less manual) way to designate colors would be glad to learn it
What I've tried
I thought this would work, but it seems to ignore the colors I provide
library(RColorBrewer)
# fill info from: https://stackoverflow.com/questions/38788357/change-bar-plot-colour-in-geom-bar-with-ggplot2-in-r
set.seed(124)
iris[sample(1:150, 50), ] %>%
group_by(Species) %>%
summarise(n=n(), sep_mean = mean(Sepal.Width)) %>%
arrange(desc(n)) %>%
mutate(colors = brewer.pal(n = 3, name = "PuBu")) %>%
mutate(Species=factor(Species, levels=Species)) %>%
ggplot(aes(Species, n, fill = colors)) +
geom_col()
Do the following
add fill = sep_mean to aes()
add + scale_fill_gradient()
remove mutate(colors = brewer.pal(n = 3, name = "PuBu")) since the previous step takes care of colors for you
set.seed(124)
iris[sample(1:150, 50), ] %>%
group_by(Species) %>%
summarise(n=n(), sep_mean = mean(Sepal.Width)) %>%
arrange(desc(n)) %>%
mutate(Species=factor(Species, levels=Species)) %>%
ggplot(aes(Species, n, fill = sep_mean, label=sprintf("%.2f", sep_mean))) +
geom_col() +
scale_fill_gradient() +
labs(fill="Sepal Width\n(mean cm)") +
geom_text()

bicolor heatmap with factor levels

I have this dataframe:
set.seed(0)
df <- data.frame(id = factor(sample(1:100, 10000, replace=TRUE), levels=1:100),
year = factor(sample(1950:2019, 10000, replace=TRUE), levels=1950:2019)) %>% unique() %>% arrange(id, year)
And I'm looking to plot a heatmap graph where the ids are in the X-axis, years at the Y-axis, and the color is blue when the data point exists and the color is red when the data doesn't exist. I'm almost there, but I can't figure out to change the fill argument for the two colors:
ggplot(df, aes(id, year, fill= year)) +
geom_tile()
The objective to plot both variables as factors is to plot them even when some year doesn't have any id (and plotting its whole row as red).
EDIT:
Two things I forgot to add (hope it's not too late):
How to add alpha transparency to geom_tile() without messing it?
I need to sort the ids from maximum missings to minimum missings.
The complete() function from the tidyr package is useful for filling in missing combinations. First, you need to set a flag variable to indicate if the data is present or not, and then expand the data frame with the missing combinations and fill the new flag variable with 0:
df <- df %>%
mutate(flag = TRUE) %>%
complete(id, year, fill = list(flag = FALSE))
ggplot(df, aes(id, year, fill = flag)) +
geom_tile()
EDIT1: To add transparency, add alpha = 0.x within geom_tile(), where x is a value indicating the transparency. The lower the value, the more transparent.
EDIT2: To sort by missingness add the following code prior to the ggplot code:
# Determine the order of the IDs
df_order <- df %>%
group_by(id) %>%
summarize(sum = sum(flag)) %>%
arrange(desc(sum)) %>%
mutate(order = row_number()) %>%
select(id, order)
# Set the IDs in order on the chart
df <- df %>%
left_join(df_order) %>%
mutate(id = fct_reorder(id, order))
I think you need to do some pre-processing before plotting. Create a temporary variable (data_exist) which denotes data is present for that id and year. Then use complete to fill the missing years for each id and plot it.
library(tidyverse)
df %>%
mutate_all(~as.integer(as.character(.))) %>%
mutate(data_exist = 1) %>%
complete(id, year = min(year):max(year), fill = list(data_exist = 0)) %>%
mutate(data_exist = factor(data_exist)) %>%
ggplot() + aes(id, year, fill= data_exist) + geom_tile()
With expand.gridyou can create a dataframe with all combinations of ids and years, then left join on this combinations to see if you had them in df
all <- expand.grid(id=levels(df$id),year=levels(df$year)) %>%
left_join(df) %>%
mutate(present=ifelse(is.na(present),'0','1'))
ggplot(all, aes(as.numeric(id), as.numeric(year), fill= present)) +
geom_tile() +
scale_fill_manual(values=c('0'='red','1'='blue')) + # change default colors
theme(legend.position="None") # hide legend

using facets on every column with color grouping

I've seen a lot of people use facets to visualize data. I want to be able to run this on every column in my dataset and then have it grouped by some categorical value within each individual plot.
I've seen others use gather() to plot histogram or densities. I can do that ok, but I guess I fundamentally misunderstand how to use this technique.
I want to be able to do just what I have below - but when I have it grouped by a category. For example, histogram of every column but stacked by the value color. Or dual density plots of every column with these two lines of different colors.
I'd like this - but instead of clarity it is every single column like this...
library(tidyverse)
# what I want but clarity should be replaced with every column except FILL
ggplot(diamonds, aes(x = price, fill = color)) +
geom_histogram(position = 'stack') +
facet_wrap(clarity~.)
# it would look exactly like this, except it would have the fill value by a group.
gathered_data = gather(diamonds %>% select_if(is.numeric))
ggplot(gathered_data , aes(value)) +
geom_histogram() +
theme_classic() +
facet_wrap(~key, scales='free')
tidyr::gather needs four pieces:
1) data (in this case diamonds, passed through the pipe into the first parameter of gather below)
2) key
3) value
4) names of the columns that will be converted to key / value pairs.
gathered_data <- diamonds %>%
gather(key, value,
select_if(diamonds, is.numeric) %>% names())
It's not entirely clear what you are looking for. A picture of your expected output would have been much more illuminating than a description (not all of us are native English speakers...), but perhaps something like this?
diamonds %>%
rename(group = color) %>% # change this line to use another categorical
# column as the grouping variable
group_by(group) %>% # select grouping variable + all numeric variables
select_if(is.numeric) %>%
ungroup() %>%
tidyr::gather(key, value, -group) %>% # gather all numeric variables
ggplot(aes(x = value, fill = group)) +
geom_histogram(position = "stack") +
theme_classic() +
facet_wrap(~ key, scales = 'free')
# alternate example using geom density
diamonds %>%
rename(group = cut) %>%
group_by(group) %>%
select_if(is.numeric) %>%
ungroup() %>%
tidyr::gather(key, value, -group) %>%
ggplot(aes(x = value, color = group)) +
geom_density() +
theme_classic() +
facet_wrap(~ key, scales = 'free')

Set ggplot title to reflect dplyr grouping

I've got a grouped dataframe generated in dplyr where each group reflects a unique combination of factor variable levels. I'd like to plot the different groups using code similar to this post. However, I can't figure out how to include two (or more) variables in the title of my plots, which is a hassle since I've got a bunch of different combinations.
Fake data and plotting code:
library(dplyr)
library(ggplot2)
spiris<-iris
spiris$site<-as.factor(rep(c("A","B","C")))
spiris$year<-as.factor(rep(2012:2016))
spiris$treatment<-as.factor(rep(1:2))
g<-spiris %>%
group_by(site, Species) %>%
do(plots=ggplot(data=.) +
aes(x=Petal.Width)+geom_histogram()+
facet_grid(treatment~year))
##Need code for title here
g[[3]] ##view plots
I need the title of each plot to reflect both "site" and "Species". Any ideas?
Use split() %>% purrr::map2() instead of group_by() %>% do() like this:
spiris %>%
split(list(.$site, .$Species)) %>%
purrr::map2(.y = names(.),
~ ggplot(data=., aes(x=Petal.Width)) +
geom_histogram()+
facet_grid(treatment~year) +
labs(title = .y) )
You just need to set the title with ggtitle():
g <- spiris %>% group_by(site, Species) %>% do(plots = ggplot(data = .) +
aes(x = Petal.Width) + geom_histogram() + facet_grid(treatment ~
year) + ggtitle(paste(.$Species,.$site,sep=" - ")))

How to reorder factor levels in a tidy way?

Hi I usually use some code like the following to reorder bars in ggplot
or other types of plots.
Normal plot (unordered)
library(tidyverse)
iris.tr <-iris %>% group_by(Species) %>% mutate(mSW = mean(Sepal.Width)) %>%
select(mSW,Species) %>%
distinct()
ggplot(iris.tr,aes(x = Species,y = mSW, color = Species)) +
geom_point(stat = "identity")
Ordering the factor + ordered plot
iris.tr$Species <- factor(iris.tr$Species,
levels = iris.tr[order(iris.tr$mSW),]$Species,
ordered = TRUE)
ggplot(iris.tr,aes(x = Species,y = mSW, color = Species)) +
geom_point(stat = "identity")
The factor line is extremely unpleasant to me and I wonder why arrange() or some other function can't simplify this. I am missing something?
Note:
This do not work but I would like to know if something like this exists in the tidyverse.
iris.tr <-iris %>% group_by(Species) %>% mutate(mSW = mean(Sepal.Width)) %>%
select(mSW,Species) %>%
distinct() %>%
arrange(mSW)
ggplot(iris.tr,aes(x = Species,y = mSW, color = Species)) +
geom_point(stat = "identity")
Using ‹forcats›:
iris.tr %>%
mutate(Species = fct_reorder(Species, mSW)) %>%
ggplot() +
aes(Species, mSW, color = Species) +
geom_point()
Reordering the factor using base:
iris.ba = iris
iris.ba$Species = with(iris.ba, reorder(Species, Sepal.Width, mean))
Translating to dplyr:
iris.tr = iris %>% mutate(Species = reorder(Species, Sepal.Width, mean))
After that, you can continue on to summarize and plot as in your question.
A couple comments: reordering a factor is modifying a data column. The dplyr command to modify a data column is mutate. All arrange does is re-order rows, this has no effect on the levels of the factor and hence no effect on the order of a legend or axis in ggplot.
All factors have an order for their levels. The difference between an ordered = TRUE factor and a regular factor is how the contrasts are set up in a model. ordered = TRUE should only be used if your factor levels have a meaningful rank order, like "Low", "Medium", "High", and even then it only matters if you are building a model and don't want the default contrasts comparing everything to a reference level.
If you happen to have a character vector to order, for example:
iris2 <- iris %>%
mutate(Species = as.character(Species)) %>%
group_by(Species) %>%
mutate(mean_sepal_width = mean(Sepal.Width)) %>%
ungroup()
You can also order the factor level using the behavior of the forcats::as_factor function :
"Compared to base R, this function creates levels in the order in which they appear"
library(forcats)
iris2 %>%
# Change the order
arrange(mean_sepal_width) %>%
# Create factor levels in the order in which they appear
mutate(Species = as_factor(Species)) %>%
ggplot() +
aes(Species, Sepal.Width, color = Species) +
geom_point()
Notice how the species names on the x axis are not ordered alphabetically but by increasing value of their mean_sepal_width. Remove the line containing as_factor to see the difference.
In case you'd like to order levels manually: You can do so also with forcats using https://forcats.tidyverse.org/reference/fct_relevel.html

Resources