How to reorder factor levels in a tidy way? - r

Hi I usually use some code like the following to reorder bars in ggplot
or other types of plots.
Normal plot (unordered)
library(tidyverse)
iris.tr <-iris %>% group_by(Species) %>% mutate(mSW = mean(Sepal.Width)) %>%
select(mSW,Species) %>%
distinct()
ggplot(iris.tr,aes(x = Species,y = mSW, color = Species)) +
geom_point(stat = "identity")
Ordering the factor + ordered plot
iris.tr$Species <- factor(iris.tr$Species,
levels = iris.tr[order(iris.tr$mSW),]$Species,
ordered = TRUE)
ggplot(iris.tr,aes(x = Species,y = mSW, color = Species)) +
geom_point(stat = "identity")
The factor line is extremely unpleasant to me and I wonder why arrange() or some other function can't simplify this. I am missing something?
Note:
This do not work but I would like to know if something like this exists in the tidyverse.
iris.tr <-iris %>% group_by(Species) %>% mutate(mSW = mean(Sepal.Width)) %>%
select(mSW,Species) %>%
distinct() %>%
arrange(mSW)
ggplot(iris.tr,aes(x = Species,y = mSW, color = Species)) +
geom_point(stat = "identity")

Using ‹forcats›:
iris.tr %>%
mutate(Species = fct_reorder(Species, mSW)) %>%
ggplot() +
aes(Species, mSW, color = Species) +
geom_point()

Reordering the factor using base:
iris.ba = iris
iris.ba$Species = with(iris.ba, reorder(Species, Sepal.Width, mean))
Translating to dplyr:
iris.tr = iris %>% mutate(Species = reorder(Species, Sepal.Width, mean))
After that, you can continue on to summarize and plot as in your question.
A couple comments: reordering a factor is modifying a data column. The dplyr command to modify a data column is mutate. All arrange does is re-order rows, this has no effect on the levels of the factor and hence no effect on the order of a legend or axis in ggplot.
All factors have an order for their levels. The difference between an ordered = TRUE factor and a regular factor is how the contrasts are set up in a model. ordered = TRUE should only be used if your factor levels have a meaningful rank order, like "Low", "Medium", "High", and even then it only matters if you are building a model and don't want the default contrasts comparing everything to a reference level.

If you happen to have a character vector to order, for example:
iris2 <- iris %>%
mutate(Species = as.character(Species)) %>%
group_by(Species) %>%
mutate(mean_sepal_width = mean(Sepal.Width)) %>%
ungroup()
You can also order the factor level using the behavior of the forcats::as_factor function :
"Compared to base R, this function creates levels in the order in which they appear"
library(forcats)
iris2 %>%
# Change the order
arrange(mean_sepal_width) %>%
# Create factor levels in the order in which they appear
mutate(Species = as_factor(Species)) %>%
ggplot() +
aes(Species, Sepal.Width, color = Species) +
geom_point()
Notice how the species names on the x axis are not ordered alphabetically but by increasing value of their mean_sepal_width. Remove the line containing as_factor to see the difference.

In case you'd like to order levels manually: You can do so also with forcats using https://forcats.tidyverse.org/reference/fct_relevel.html

Related

How do I use a dynamically declared variable in R ggplot when using count() and factor() functions?

I would like to plot some relative frequency data using ggplot in a more efficient manner.
I have many variables of interest, and want to plot a separate barchart for each. The following is my current code for one variables of interest Gender:
chart.gender <- data %>%
count(Gender = factor(Gender)) %>%
mutate(Gender = fct_reorder(Gender,desc(n))) %>%
mutate(pct = prop.table(n)) %>%
ggplot(aes(x=Gender, y=n, fill=Gender)) +
geom_col()
This works, but the variable Gender is repeated many times. Since I need to repeat plots for many variables of interest (Gender, Age, Location, etc.) with similar code, I would like to simplify this by declaring the variable of interest once at the top and using that declared variable for the rest of the code. Intuitively, something like:
var <- "Gender"
chart.gender <- data %>%
count(var = factor(var)) %>%
mutate(var = fct_reorder(var,desc(n))) %>%
mutate(pct = prop.table(n)) %>%
ggplot(aes(x=var, y=n, fill=var)) +
geom_col()
Which does not result in a plot of three-level factor count of gender frequencies, but merely a single column named 'Gender'. I believe I see why it's not working, but I do not know the solution for it: I want R to retrieve the variable name I stored in var, and then use that to retrieve the data for that variable in 'data'.
With some research I've found suggestions like using as.name(var), but there seems to (at the least) be a problem with declaring the variable var as a factor within the count() function.
Some reproducible data:
library(tidyverse)
library(ggplot2)
set.seed(1)
data <- data.frame(sample(c("Male", "Female", "Prefer not to say"),20,replace=TRUE))
colnames(data) <- c("Gender")
I'm using the following packages in R: tidyverse, ggplot2
Use .data pronound to subset the column with var as variable.
library(tidyverse)
var <- "Gender"
data %>%
count(var = factor(.data[[var]])) %>%
mutate(var = fct_reorder(var,desc(n))) %>%
mutate(pct = prop.table(n)) %>%
ggplot(aes(x=var, y=n, fill=var)) +
geom_col()
Or another way would be using sym and !!
data %>%
count(var = factor(!!sym(var))) %>%
mutate(var = fct_reorder(var,desc(n))) %>%
mutate(pct = prop.table(n)) %>%
ggplot(aes(x=var, y=n, fill=var)) +
geom_col()
If you use as.name() when you set the variable initially, you can use !! ("bang-bang") to unquote the variable for the count() step.
var <- as.name("Gender")
chart.gender <- data %>%
count(var = factor(!! var)) %>%
mutate(var = fct_reorder(var,desc(n))) %>%
mutate(pct = prop.table(n)) %>%
ggplot(aes(x=var, y=n, fill=var)) +
geom_col()

bicolor heatmap with factor levels

I have this dataframe:
set.seed(0)
df <- data.frame(id = factor(sample(1:100, 10000, replace=TRUE), levels=1:100),
year = factor(sample(1950:2019, 10000, replace=TRUE), levels=1950:2019)) %>% unique() %>% arrange(id, year)
And I'm looking to plot a heatmap graph where the ids are in the X-axis, years at the Y-axis, and the color is blue when the data point exists and the color is red when the data doesn't exist. I'm almost there, but I can't figure out to change the fill argument for the two colors:
ggplot(df, aes(id, year, fill= year)) +
geom_tile()
The objective to plot both variables as factors is to plot them even when some year doesn't have any id (and plotting its whole row as red).
EDIT:
Two things I forgot to add (hope it's not too late):
How to add alpha transparency to geom_tile() without messing it?
I need to sort the ids from maximum missings to minimum missings.
The complete() function from the tidyr package is useful for filling in missing combinations. First, you need to set a flag variable to indicate if the data is present or not, and then expand the data frame with the missing combinations and fill the new flag variable with 0:
df <- df %>%
mutate(flag = TRUE) %>%
complete(id, year, fill = list(flag = FALSE))
ggplot(df, aes(id, year, fill = flag)) +
geom_tile()
EDIT1: To add transparency, add alpha = 0.x within geom_tile(), where x is a value indicating the transparency. The lower the value, the more transparent.
EDIT2: To sort by missingness add the following code prior to the ggplot code:
# Determine the order of the IDs
df_order <- df %>%
group_by(id) %>%
summarize(sum = sum(flag)) %>%
arrange(desc(sum)) %>%
mutate(order = row_number()) %>%
select(id, order)
# Set the IDs in order on the chart
df <- df %>%
left_join(df_order) %>%
mutate(id = fct_reorder(id, order))
I think you need to do some pre-processing before plotting. Create a temporary variable (data_exist) which denotes data is present for that id and year. Then use complete to fill the missing years for each id and plot it.
library(tidyverse)
df %>%
mutate_all(~as.integer(as.character(.))) %>%
mutate(data_exist = 1) %>%
complete(id, year = min(year):max(year), fill = list(data_exist = 0)) %>%
mutate(data_exist = factor(data_exist)) %>%
ggplot() + aes(id, year, fill= data_exist) + geom_tile()
With expand.gridyou can create a dataframe with all combinations of ids and years, then left join on this combinations to see if you had them in df
all <- expand.grid(id=levels(df$id),year=levels(df$year)) %>%
left_join(df) %>%
mutate(present=ifelse(is.na(present),'0','1'))
ggplot(all, aes(as.numeric(id), as.numeric(year), fill= present)) +
geom_tile() +
scale_fill_manual(values=c('0'='red','1'='blue')) + # change default colors
theme(legend.position="None") # hide legend

Set ggplot title to reflect dplyr grouping

I've got a grouped dataframe generated in dplyr where each group reflects a unique combination of factor variable levels. I'd like to plot the different groups using code similar to this post. However, I can't figure out how to include two (or more) variables in the title of my plots, which is a hassle since I've got a bunch of different combinations.
Fake data and plotting code:
library(dplyr)
library(ggplot2)
spiris<-iris
spiris$site<-as.factor(rep(c("A","B","C")))
spiris$year<-as.factor(rep(2012:2016))
spiris$treatment<-as.factor(rep(1:2))
g<-spiris %>%
group_by(site, Species) %>%
do(plots=ggplot(data=.) +
aes(x=Petal.Width)+geom_histogram()+
facet_grid(treatment~year))
##Need code for title here
g[[3]] ##view plots
I need the title of each plot to reflect both "site" and "Species". Any ideas?
Use split() %>% purrr::map2() instead of group_by() %>% do() like this:
spiris %>%
split(list(.$site, .$Species)) %>%
purrr::map2(.y = names(.),
~ ggplot(data=., aes(x=Petal.Width)) +
geom_histogram()+
facet_grid(treatment~year) +
labs(title = .y) )
You just need to set the title with ggtitle():
g <- spiris %>% group_by(site, Species) %>% do(plots = ggplot(data = .) +
aes(x = Petal.Width) + geom_histogram() + facet_grid(treatment ~
year) + ggtitle(paste(.$Species,.$site,sep=" - ")))

grouped by factor level in ggplot2()

I've got a data frame with four three-level categorical variables: before_weight, after_weight, before_pain, and after_pain.
I'd like to make a bar plot featuring the proportion for each level of the variables. That my current code achieves.
The problem's the presentation of the data. I'd like the respective before and after bars to be grouped together, so that the bar representing the people that answered 1 in the before_weight variable is grouped next to the bar representing the people that answered 1 in the after_weight variable, and so forth for both the weight and pain variables.
I've been trying to use dplyr, mutate() with numerous ifelse() statements, to make a new variable pairing up the groups in question, but can't seem to get it to work.
Any help would be much appreciated.
starting point (df):
df <- data.frame(before_weight=c(1,2,3,2,1),before_pain=c(2,2,1,3,1),after_weight=c(1,3,3,2,3),after_pain=c(1,1,2,3,1))
current code:
library(tidyr)
dflong <- gather(df, varname, score, before_weight:after_pain, factor_key=TRUE)
df$score<- as.factor(df$score)
library(ggplot2)
library(dplyr)
dflong %>%
group_by(varname) %>%
count(score) %>%
mutate(prop = 100*(n / sum(n))) %>%
ggplot(aes(x = varname, y = prop, fill = factor(score))) + scale_fill_brewer() + geom_col(position = 'dodge', colour = 'black')
UPDATE:
I'd like proportions rather than counts, so I've attempted to tweak Nate's code. Since I'm using the question variable to group the data to get the proportions, I can't seem use gsub() to change the content of that variable. Instead I added question2 and passed it into facet_wrap(). It seems to work.:
df %>% gather("question", "val") %>%
count(question, val) %>%
group_by(question) %>%
mutate(percent = 100*(n / sum(n))) %>%
mutate(time= factor(ifelse(grepl("before", question), "before", "after"), c("before", "after"))) %>%
mutate(question2= ifelse(grepl("weight", question), "weight", "pain")) %>%
ggplot(aes(x=val, y=percent, fill = time)) + geom_col(position = "dodge") + facet_wrap(~question2)
Does this code make the visual comparisons you are after? One ifelse and a gsub will help make variables we can use for facetting and filling in ggplot.
df %>% gather("question", "val") %>% # go long
mutate(time = factor(ifelse(grepl("before", question), "before", "after"),
c("before", "after")), # use factor with levels to control order
question = gsub(".*_", "", question)) %>% # clean for facets
ggplot(aes(x = val, fill = time)) + # use fill not color for whole bar
geom_bar(position = "dodge") + # stacking is the default option
facet_wrap(~question) # two panels

How to Plot Every Column in Descending Order in R

I intend to plot every categorical column in the dataframe in a descending order depends on the frequency of levels in a variable.
I have already found out how to plot every column and reorder the levels, but I cannot figure out how to combine them together. Could you please give me some suggestions?
Code for plot every column:
require(purrr)
library(tidyr)
library(ggplot2)
diamonds %>%
keep(is.factor) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_bar()
Code for reorder the levels of one variable:
tb <- table(x)
factor(x, levels = names(tb[order(tb, decreasing = TRUE)]))
BTW, if you feel there is a better way writing these codes, please let me know.
Thanks.
Alternative 1
No need to use gridExtra to emulate facet_wrap, just include the function reorder_size inside aes:
reorder_size <- function(x) {
factor(x, levels = names(sort(table(x), decreasing = TRUE)))
}
diamonds %>%
keep(is.factor) %>%
gather() %>%
ggplot(aes(x = reorder_size(value))) +
facet_wrap(~ key, scales = "free") +
geom_bar()
Alternative 2
Using dplyrto calculate the count grouping by key and value. Then we reorder the value in descending order by count inside aes.
library(dplyr)
diamonds %>%
keep(is.factor) %>%
gather() %>%
group_by(key,value) %>%
summarise(n = n()) %>%
ggplot(aes(x = reorder(value, -n), y = n)) +
facet_wrap(~ key, scales = "free") +
geom_bar(stat='identity')
Output
The problem with your approach is that the long form of your data-frame will introduce a lot of factors that would be plotted as 0 for the geom_bar().
Instead of relying on facet_wrap and dealing with the long data-form, here's an alternative.
Reordering by size function:
reorder_size <- function(x) {
factor(x, levels = names(sort(table(x), decreasing=T)))
}
Using gridExtra::grid.arrange function to deliver similar facet_wrap style figure:
library(gridExtra)
a <- ggplot(diamonds, aes(x=reorder_size(cut))) + geom_bar()
b <- ggplot(diamonds, aes(x=reorder_size(color))) + geom_bar()
c <- ggplot(diamonds, aes(x=reorder_size(clarity))) + geom_bar()
grid.arrange(a,b,c, nrow=1)

Resources