I have a DF like this:
df_test <- data.frame (ID = c(88888, 99999),
Cluster1 = c(0, 1),Cluster2 = c(0, 2),Cluster3 = c(1, 3)
)
ID Cluster1 Cluster2 Cluster3
1 88888 0 0 1
2 99999 1 2 3
Now I want a horizontal bar graph with the clusters on the y axis. All bars should go from 0-3 (min - max), because this is the range of the clusters. As color I want to have three gradations, 0-1 red, 1-2 yellow and 2-3 green. The values from the DF should then be shown as an arrow or line on the overall bar. Is this somehow possible with ggplot2?
You can use the geom_col following example here: https://ggplot2.tidyverse.org/reference/geom_bar.html
library(dplyr)
library(tidyr)
library(ggplot2)
First, make the data tidy:
df <- df_test %>% pivot_longer(cols = 2:4,
names_to = "Cluster",
values_to = "value")
Keep the largest of each cluster for making a bar chart:
df <- df %>% group_by(Cluster) %>%
filter(value == max(value)) %>%
ungroup() %>%
# identify color scheme:
mutate(cols = case_when(value <=1 ~ "red",
value > 1 & value <= 2 ~ "yellow",
value > 2 ~ "green"))
ggplot(df) + geom_col(aes(x = value, y=Cluster, fill = Cluster)) +
scale_colour_manual(
values = df$cols,
aesthetics = c("colour", "fill")
)
Related
Suppose I wish to make a range plot with the design below using ggplot with the following dummy data:
with following legend.
set.seed(1)
test.dat <- data.frame(
yval = sample(1:100, 40),
xcat = rep(LETTERS[1:4], 10),
base = sample(c(1, 0),40, replace=T),
col = rep(c("red", "blue"), 40)
)
> head(test.dat)
yval xcat base col
1 68 A 0 red
2 39 B 0 blue
3 1 C 0 red
4 34 D 1 blue
5 87 A 0 red
6 43 B 0 blue
The gray portion shows the range of the data where base == 1 and the whisker-like line (that resembles errorbar) shows the range of the data where base == 0 using the respective color designed for each xcat.
So using this dummy data, I would expect:
minmax <- function(x){
return(
c(min(x),max(x))
)
}
> minmax(test.dat[test.dat$xcat == "D" & test.dat$base == 1,]$yval)
[1] 24 99
> minmax(test.dat[test.dat$xcat == "D" & test.dat$base == 0,]$yval)
[1] 21 82
> unique(test.dat[test.dat$xcat == "D",]$col)
[1] "blue"
for xcat == "D", a gray bar to range from 24 to 99, and a blue whisker line to range from 21 to 82.
How can I achieve this? It looks like there is no straightforward ggplot function to create a range plot.
My approach idea was to adjust geom_boxplot's hinges and whisper definition for gray part, and use geom_line or geom_linerange to create the whisker-line part, but I am unsure how to do that.
Thank you.
You first create a dataframe where you have min and max for each combination of (xcat, base and col)
data2 <- test.dat %>% group_by(xcat, base, col) %>% summarise(min = min(yval), max=max(yval))
Then you use geom_linerange for the gray "bars" and geom_errorbar for the whisker line:
ggplot()+
geom_linerange(data= data2 %>% filter(base==1), aes(x= xcat, ymin=min, ymax=max), size=12, alpha=0.5)+
geom_errorbar(data= data2 %>% filter(base==0), aes(x= xcat, ymin=min, ymax=max), colour=data2[data2$base==1,]$col, width=.2)
And this is the
Plot
I would suggest doing some reshaping first using dplyr/tidyr, and then geom_tile:
library(tidyverse)
test.dat %>%
group_by(xcat, base, col) %>%
summarize(mid = mean(range(yval)),
range = diff(range(yval)), .groups = "drop") %>%
pivot_wider(names_from = base, values_from = mid:range) %>%
ggplot(aes(x = xcat)) +
geom_tile(aes(y = mid_0, height = range_0), fill = "gray70", color = "black") +
geom_tile(aes(y = mid_1, height = range_1, fill = col), color = "black") +
scale_fill_identity()
I'm trying to understand what the fct_reorder2() does but the R documentation is not clear on this. Below is copied from help for this function:
df <- tibble::tribble(
~color, ~a, ~b,
"blue", 1, 2,
"green", 6, 2,
"purple", 3, 3,
"red", 2, 3,
"yellow", 5, 1
)
df$color <- factor(df$color)
#> Levels: blue green purple red yellow
fct_reorder(df$color, df$a, min)
#>Levels: blue red purple yellow green
fct_reorder2(df$color, df$a, df$b)
#> Levels: purple red blue green yellow
Can someone help me understand why fct_reorder2() sorts the data this way?
The example in the docs is poor, because there is only one row of data for each factor level.
What helped me was a chapter of R4DS: tidyverse and beyond, which is based on an example in the original R4DS, which I reproduce below.
fct_reorder2() reorders the factor by the y values associated with the largest x values. This makes the plot easier to read because the line colours line up with the legend.
library(tidyverse)
by_age <- gss_cat %>%
filter(!is.na(age)) %>%
count(age, marital) %>%
group_by(age) %>%
mutate(prop = n / sum(n))
ggplot(by_age, aes(age, prop, colour = marital)) +
geom_line(na.rm = TRUE)
ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) +
geom_line() +
labs(colour = "marital")
Created on 2021-03-03 by the reprex package (v1.0.0)
The fct_reorder2 is using the .fun as last2. If we check the source code
last2
function (.x, .y)
{
.y[order(.x, na.last = FALSE)][length(.y)]
}
and now using last2
last2(df$a, df$b)
#[1] 2
df$b[order(df$a, na.last = FALSE)][length(df$b)]
#[1] 2
Now, if we check the source code of fct_reorder, it is calling lvls_reorder, which calls refactor
s1 <- tapply(seq_along(df$a), df$color, function(i) last2(df$a[i], df$b[i]))
forcats:::refactor(df$color, levels(df$color)[order(s1, decreasing = TRUE)])
#[1] blue green purple red yellow
#Levels: purple red blue green yellow
Code lineage
fct_reorder2
function (.f, .x, .y, .fun = last2, ..., .desc = TRUE)
{
f <- check_factor(.f)
stopifnot(length(f) == length(.x), length(.x) == length(.y))
ellipsis::check_dots_used()
summary <- tapply(seq_along(.x), f, function(i) .fun(.x[i],
.y[i], ...))
if (is.list(summary)) {
stop("`fun` must return a single value per group", call. = FALSE)
}
lvls_reorder(.f, order(summary, decreasing = .desc))
}
lvls_reorder
function (f, idx, ordered = NA)
{
f <- check_factor(f)
if (!is.numeric(idx)) {
stop("`idx` must be numeric", call. = FALSE)
}
if (!setequal(idx, lvls_seq(f)) || length(idx) != nlevels(f)) {
stop("`idx` must contain one integer for each level of `f`",
call. = FALSE)
}
refactor(f, levels(f)[idx], ordered = ordered)
}
I was also confused by this function and it toke me a lot of time to understand. Here below is my understanding. I hope it helps.
Imagine that you are trying to draw a line plot with multiple lines. The lines are distinguished by their colors. Allow me to use code from #nacnudus, for clarity.
library(tidyverse)
by_age <- gss_cat %>%
filter(!is.na(age)) %>%
count(age, marital) %>%
group_by(age) %>%
mutate(prop = n / sum(n)) %>%
ungroup()
# here we use a customized sequential palette
my_palette = c("#7a0177", "#c51b8a", "#f768a1", "#fa9fb5", "#fcc5c0", "#feebe2")
scales::show_col(my_palette)
my_palette
ggplot(by_age, aes(age, prop, colour = marital)) +
scale_color_manual(values = my_pallete) +
geom_line(na.rm = TRUE)
line plot without reorder
The color without reorder the marital variable looks messy. To make colors match the data, we generally hope the color is sequentially correspond to the value that on the rightest end of each line, which means the larger of the value at the rightest point, the darker the color.
So our next mission is to reorder the variable marital according to their vlaue that in the rightest point (i.e, the value of prop at each marital's largest age).
new_order = by_age %>%
group_by(marital) %>% # for each marital
arrange(age, prop) %>% # group first by age, and then prop
dplyr::filter(age == max(age)) %>% # find the largest age in each marital
ungroup() %>%
arrange(desc(prop)) %>% # reorder the data by prop
pull(marital) %>% # get the new order
as.character() # convert the new order to general character
# Then we plot the reordered line plot
ggplot(by_age, aes(age, prop, colour = fct_relevel(marital, new_order))) + # now we reorder the marital by new orders
geom_line() +
scale_color_manual(values = my_palette) +
labs(colour = "marital")
reordered line plot 1
Now it's colors look pretty reasonable.
And it's EXACTLY what fct_reorder2() does for you.
ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) + # now we use fct_reorder2 to reorder marital
geom_line() +
scale_color_manual(values = my_palette) +
labs(colour = "marital")
reordered line plot 2
What it does is for each marital, first order the data by age, then by prop, and then pick the last prop value (In this case, there is only one value). The picked values are used to reorder the factor level (by default .fun = last2()). If you define .fun = first2(), the function will reorder marital by the prop vlaue at each marital's smallest age:
ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop, .fun = first2))) +
geom_line() +
scale_color_manual(values = my_palette) +
labs(colour = "marital")
reordered line plot 3
I hope it helps.
I have a dataset with multiple columns. I am visually summarizing several columns using simple bar plots. A simple example:
set.seed(123)
df <-
data.frame(
a = sample(1:2, 20, replace = T),
b = sample(0:1, 20, replace = T)
)
ggplot(gather(df,, factor_key = TRUE), aes(x = factor(value))) +
geom_bar() +
facet_wrap(~ key, scales = "free_x", as.table = TRUE) +
xlab("")
Now, I want to add percentages above each of the 4 columns, saying what percent of rows in the dataframe each column represents. I.e., here, the following numbers would right above the four columns, from left to right in this order: 55%, 45%, 60%, 40%.
How can I automate this---given that I have a large number of columns I have to do this for? (Note I want to keep the raw count of responses on the Y axis and just have percentages appear in the plots.)
In addition to the answer proposed by #BappaDas, in your particular case you want to preserve the count and add percentage whereas the proposed answer has percentages both on y axis and text labeling.
Here, a modified solution is to compute the count for each variable and calculate the percentage. A possible way of doing it is to use tidyr (for reshaping the data in a "long" form) and dplyr package:
library(tidyr)
library(dplyr)
df %>% pivot_longer(everything(), names_to = "var", values_to = "val") %>%
group_by(var) %>% count(val) %>%
mutate(Label = n/sum(n))
# A tibble: 4 x 4
# Groups: var [2]
var val n Label
<chr> <int> <int> <dbl>
1 a 1 11 0.55
2 a 2 9 0.45
3 b 0 12 0.6
4 b 1 8 0.4
Now at the end of this pipe sequence, you can add ggplot plotting code in order to obtain the desired output by passing the count as y argument and the percentage as label argument:
library(tidyr)
library(dplyr)
library(ggplot2)
df %>% pivot_longer(everything(), names_to = "var", values_to = "val") %>%
group_by(var) %>% count(val) %>%
mutate(Label = n/sum(n)) %>%
ggplot(aes(x = factor(val), y = n))+
geom_col()+
facet_wrap(~var, scales = "free", as.table = TRUE)+
xlab("")+
geom_text(aes(label = scales::percent(Label)), vjust = -0.5)
For data called df that reads:
car suv pickup
1 2 1
2 3 4
4 1 2
5 4 2
3 1 1
total = apply(df,1,sum)
barplot(total,col= rainbow(5))
So what I did right now is plotting a barplot on total number of cars, which are in fact, the sum of each row. What I want to do now is to present it as a stack barplot on the sum.
For now, it would just show "total" without any lines indicating whether 1 car, 2 suv, 1 pickup addes to 4 "total".
Note. It is different from barplot(matrix(df)), because that's just dividing it my car,suv,pickup, that disregards total number.
You can achieve this easily using ggplot2 and reshape2.
You will need an ID column to track the rows, so I have added that in. I melt the data to long type so that the different groups can be managed and plotted accordingly.
Then plot using geom_bar, specifying the row ids as the x axis and the groupings (fill and colour) for the stack plot and legend.
library(reshape2)
library(ggplot2)
df <- data.frame("ID" = c(1,2,3,4,5), "car" = c(1,2,4,5,3), "suv" = c(2,3,1,4,1), "pickup" = c(1, 4, 2, 2, 1))
long_df <- df %>% melt(id.vars = c("ID") ,value.name = "Number", variable.name = "Type")
ggplot(data = long_df, aes(x = ID, y = Number)) +
geom_bar(aes(fill = Type, colour = Type),
stat = "identity",
position = "stack")
With base R
df %>% melt(id.vars = c("ID") ,value.name = "Number", variable.name = "Type") %>%
dcast(Type ~ ID, value.var = "Number") %>%
as.matrix() %>%
barplot()
Are you after something like this?
library(tidyverse)
df %>%
rowid_to_column("row") %>%
gather(k, v, -row) %>%
ggplot(aes(row, v, fill = k)) +
geom_col()
We use a stacked barplot here, so there is no need to manually calculate the sum. The key here is to transform data from wide to long and keep track of the row.
Sample data
df <- read.table(text =
"car suv pickup
1 2 1
2 3 4
4 1 2
5 4 2
3 1 1", header = T)
I want to plot the rolling mean of data of different time series with ggplot2. My data have the following structure:
library(dplyr)
library(ggplot2)
library(zoo)
library(tidyr)
df <- data.frame(episode=seq(1:1000),
t_0 = runif(1000),
t_1 = 1 + runif(1000),
t_2 = 2 + runif(1000))
df.tidy <- gather(df, "time", "value", -episode) %>%
separate("time", c("t", "time"), sep = "_") %>%
subset(select = -t)
> head(df.tidy)
# episode time value
#1 1 0 0.7466480
#2 2 0 0.7238865
#3 3 0 0.9024454
#4 4 0 0.7274303
#5 5 0 0.1932375
#6 6 0 0.1826925
Now, the code below creates a plot where the lines for time = 1 and time = 2 towards the beginning of the episodes do not represent the data because value is filled with NAs and the first numeric entry in value is for time = 0.
ggplot(df.tidy, aes(x = episode, y = value, col = time)) +
geom_point(alpha = 0.2) +
geom_line(aes(y = rollmean(value, 10, align = "right", fill = NA)))
How do I have to adapt my code such that the rolling-mean lines are representative of my data?
Your issue is you are applying a moving average over the whole column, which makes data "leak" from one value of time to another.
You could group_by first to apply the rollmean to each time separately:
ggplot(df.tidy, aes(x = episode, y = value, col = time)) +
geom_point(alpha = 0.2) +
geom_line(data = df.tidy %>%
group_by(time) %>%
mutate(value = rollmean(value, 10, align = "right", fill = NA)))