How to subset in ggplot based on two character variables? - r

I am trying to subset my data in ggplot based on two characters variables: model and letter. I want to subset "m1" who has the letter "a". In the original data, i have multiple rows who has "m1" and "a", but below is just a small reproducible example. Can someone guide me with how to subset it inside the command of ggplot?
model value letter
m1 5 a
m2 11 b
m3 2 c
m1 4 d
m2 22 e
m3 6 f
structure(list(model = structure(c("m1", "m2", "m3", "m1", "m2",
"m3"), format.stata = "%9s"), value = structure(c(5, 11, 2, 4,
22, 6), format.stata = "%9.0g"), letter = structure(c("a", "b",
"c", "d", "e", "f"), format.stata = "%9s")), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))

We could do a group by filter and this can be used as input to ggplot
library(dplyr)
library(ggplot2)
df1 %>%
group_by(model) %>%
filter('a' %in% letter) %>%
ggplot(aes(x = letter, y = value)) +
geom_col()
Or if it is just 'm1' and 'a', do the filter at once
df1 %>%
filter(model == 'm1', letter == 'a') %>%
ggplot(aes(x = letter, y = value)) +
geom_col()

Does this work?
ggplot(subset(df,model=='m1' & letter=='a'),aes(x=letter,y=value))+
geom_point()
Explanation:
In ggplot2 the data argument allows using other functions like subset().

Related

R ggplot legend with Waffle chart

library(tidyverse)
library(waffle)
df_2 <- structure(list(group = c(2, 2, 2, 1, 1, 1),
parts = c("A", "B", "C", "A", "B", "C"),
values = c(1, 39, 60, 14, 15, 71)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
df_2 %>% ggplot(aes(label = parts)) +
geom_pictogram(
n_rows = 10, aes(color = parts, values = values),
family = "fontawesome-webfont",
flip = TRUE
) +
scale_label_pictogram(
name = "Case",
values = c("male"),
breaks = c("A", "B", "C"),
labels = c("A", "B", "C")
) +
scale_color_manual(
name = "Case",
values = c("A" = "red", "B" = "green", "C" = "grey85"),
breaks = c("A", "B", "C"),
labels = c("A", "B", "C")
) +
facet_grid(~group)
With the above code, I got the legend what I expected:
However, when I replaced df_2 with the following df_1 dataframe, I was unable to combine two legends.
df_1 <- structure(list(group = c(2, 2, 2, 1, 1, 1),
parts = c("A", "B", "C", "A", "B", "C"),
values = c(0, 0, 100, 0, 0, 100)),
row.names = c(NA,-6L), class = c("tbl_df", "tbl", "data.frame"))
I kind of know the cause of the problem (0 values) but I would like to keep the legend the same as the graph above. Any suggestions would be appreciated.
To make it clear, the package "waffle" referred to here is not the CRAN package "waffle", but the GitHub-only package:
remotes::install_github("hrbrmstr/waffle")
library(waffle)
You will also need a way of displaying the pictograms, such as:
library(emojifont)
load.fontawesome()
Now, as with any other discrete scale, if you want to add values that are not present in the (post-stat) data, you need to use the limits argument:
df_1 %>% ggplot(aes(label = parts)) +
geom_pictogram(
n_rows = 10, aes(color = parts, values = values),
family = "fontawesome-webfont",
flip = TRUE
) +
scale_label_pictogram(
name = "Case",
values = c("male"),
breaks = c("A", "B", "C"),
labels = c("A", "B", "C"),
limits = c("A", "B", "C")
) +
scale_color_manual(
name = "Case",
values = c("A" = "red", "B" = "green", "C" = "grey85"),
breaks = c("A", "B", "C"),
labels = c("A", "B", "C")
) +
facet_grid(~group)
It is a bit tricky, but what you could do is say let's add 1 to all values so it will plot it like before. But using ggplot_build to remove from each case one row to get it in the right amount like this:
library(tidyverse)
library(waffle)
library(ggplot2)
library(dplyr)
library(emojifont)
library(waffle)
library(extrafont)
p <- df_1 %>% ggplot(aes(label = parts)) +
geom_pictogram(
n_rows = 10, aes(color = parts, values = values+1),
family = "fontawesome-webfont",
flip = TRUE
) +
scale_label_pictogram(
name = "Case",
values = c("male"),
breaks = c("A", "B", "C"),
labels = c("A", "B", "C")
) +
scale_color_manual(
name = "Case",
values = c("A" = "red", "B" = "green", "C" = "grey85"),
breaks = c("A", "B", "C"),
labels = c("A", "B", "C")
) +
facet_grid(~group)
q <- ggplot_build(p)
q$data[[1]] <- q$data[[1]] %>%
group_by(PANEL) %>%
slice(4:n())
q <- ggplot_gtable(q)
plot(q)
Created on 2022-10-20 with reprex v2.0.2

How do you create a grouped barplot in R from only certain columns?

I have a data frame that looks like
Role <- letters(1:3)
df <- data.frame(Role,
Female1=c(1,4,2),
Male1 = c(3,0,0),
Female2 = c(3,5,3),
Male2 = c(1,3,0),
FemaleTotal = Female1+Female2,
MaleTotal = Male1+Male2)
And want to create a barplot grouped with Male,Female for each column category, (in this example it would be 1 and 2), stacked with Roles and also another plot with just the totals. To do just the totals I could use melt() and subset the dataframe to only have those columns, but that seems messy and doesnt help witht the main plot I want to make.
An option would be to reshape to 'long' format
library(dplyr)
library(tidyr)
library(ggplot2)
df %>%
pivot_longer(cols = -Role, names_to = c( "group", '.value'),
names_sep="(?<=[a-z])(?=(\\d+|Total))") %>%
pivot_longer(-c(Role, group)) %>%
ggplot(aes(x = Role, y = value, fill = group)) +
geom_col() +
facet_wrap(~ name)
-output
data
df <- structure(list(Role = c("a", "b", "c"), Female1 = c(1, 4, 2),
Male1 = c(3, 0, 0), Female2 = c(3, 5, 3), Male2 = c(1, 3,
0), FemaleTotal = c(4, 9, 5), MaleTotal = c(4, 3, 0)), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))

Plotting based on occurrence in group

I would to make a bar chart that plots the bar as a proportion of the total group rather than the usual percentage. For a var to "count" it only needs to occur once in a group. For example in this df where id is the grouping variable
df <-
tibble(id = c(rep(1, 3), rep(2, 3), rep(3, 3)),
vars = c("a", NA, "b", "c", "d", "e", "a", "a", "a"))
The a bars would be:
a = 2/3 # since a occurs in 2 out of 3 groups
b = 1/3
c = 1/3
d = 1/3
e = 1/3
If I understand you correctly, a one-liner would suffice:
ggplot(distinct(df)) + geom_bar(aes(vars, stat(count) / n_distinct(df$id)))
Working answer:
tibble(id = c(rep(1, 3), rep(2, 3), rep(3, 3)),
vars = c("a", "a", "b", "c", "d", "e", "a", "a", "a")) %>%
group_by(id) %>%
distinct(vars) %>%
ungroup() %>%
add_count(vars) %>%
mutate(prop = n / n_distinct(id)) %>%
distinct(vars, .keep_all = T) %>%
ggplot(aes(vars, prop)) +
geom_col()

Conditionally fill empty cells

I have a named vector with some missing values:
x = c(99, 88, 1, 2, 3, NA, NA)
names(x) = c("A", "C", "AA", "AB", "AC", "AD", "CA")
And a second dataframe which reflects the hierarchical naming structure (e.g. A is a superordinate to AA, AB, & AC)
filler = data.frame(super = c("A", "A", "A", "A", "C"), sub = c("AA", "AB", "AC", "AD", "CA"))
If a value is missing in x, I want to fill it with the superordinate from filler. So that the outcome would be
x = c(99, 88, 1, 2, 3, 99, 88)
Does anyone have any clever way to do this without looping through each possibility?
We can create a logical vector ('i1') based on the NA elements, get the index of matching elements in 'filler' with match and then do the assignmnt
i1 <- is.na(x)
x[i1] <- x[match(filler$super[match(names(x[i1]), filler$sub)], names(x))]
as.vector(x)
#[1] 99 88 1 2 3 99 88
As x is a named vector we could convert it to a dataframe (enframe) and then do a join, replace NA values with corresponding value and if needed convert it into vector again. (deframe).
library(dplyr)
library(tibble)
enframe(x) %>%
left_join(filler, by = c("name" = "sub")) %>%
mutate(value = if_else(is.na(value), value[match(super, name)], value)) %>%
select(-super) %>%
deframe()
# A C AA AB AC AD CA
#99 88 1 2 3 99 88

how to plot number of valid rows with ggplot2

With a dataframe as
df <- data.frame(name = c("a", "b", "c", "d", "e"),
class = c("a1", "a1", "a1", "b1", "b1"),
var1 = c("S", "S", "R", "S", "S"),
var2 = c("S", "R", NA, NA, "R"),
var3 = c(NA, "R", "R", "S", "S"))
I would like to plot the number of rows without NAs for var1 from var3.
One way I found is to generate another dataframe as
df_count <- matrix(nrow=3, ncol=2)
df_count <- as.data.frame(df_count)
names(df_count) <- c("var_num", "count")
df_count$var_num <- as.factor(names(df)[3:5])
for (i in 1:3) {
df_count[i,2] <- sum(!is.na(df[,i+2]))
}
and then plot as
ggplot(df_count, aes(x=var_num, y=count)) + geom_bar(stat="identity")
Is there an easier way to choose var1 through var3 and count the valid rows without generating a new dataframe?
library('ggplot2')
library('reshape2')
df <- melt(df, id.vars = c('name', 'class')) # melt data
df <- df[!is.na(df$value), ] # remove NA
df <- with(df, aggregate(df, by = list(variable), FUN = length )) # compute length by grouping variable
ggplot(df, aes( x = Group.1, y = value, fill = Group.1 )) +
geom_bar(stat="identity")
stacked bar
df <- melt(df, id.vars = c('name', 'class')) # melt data
df <- df[!is.na(df$value), ] # remove NA
df <- with(df, aggregate(df, by = list(variable, value), FUN = length )) # compute length by grouping variable and value
ggplot(df, aes( x = Group.1, y = value, fill = Group.2 )) +
geom_bar(stat="identity")
Data:
df <- data.frame(name = c("a", "b", "c", "d", "e"),
class = c("a1", "a1", "a1", "b1", "b1"),
var1 = c("S", "S", "R", "S", "S"),
var2 = c("S", "R", NA, NA, "R"),
var3 = c(NA, "R", "R", "S", "S"))

Resources