How to loop through columns in R to create plots? - r

I have three columns in a dataframe: age, gender and income.
I want to loop through these columns and create plots based on the data in them.
I know in stata you can loop through variables and then run commands with those variables. However the code below does not seem to work, is there an equivalent way to do what I want to do in R?
groups <- c(df$age, df$gender, df$income)
for (i in groups){
df %>% group_by(i) %>%
summarise(n = n()) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(y = prop, x = i)) +
geom_col()
}

you can also use the tidyverse. Loop through a vector of grouping variable names with map. On every iteration, you can evaluate !!sym(variable) the variable name to group_by. Alternatively, we can use across(all_of()), wihch can take strings directly as column names. The rest of the code is pretty much the same you used.
library(dplyr)
library(purrr)
groups <- c('age', 'gender', 'income')
## with !!(sym(.x))
map(groups, ~
df %>% group_by(!!sym(.x)) %>%
summarise(n = n()) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(y = prop, x = i)) +
geom_col()
)
## with across(all_of())
map(groups, ~
df %>% group_by(across(all_of(.x))) %>%
summarise(n = n()) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(y = prop, x = i)) +
geom_col()
)
If you want to use a for loop:
groups <- c('age', 'gender', 'income')
for (i in groups){
df %>% group_by(!!sym(i)) %>%
summarise(n = n()) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(y = prop, x = i)) +
geom_col()
}

You can use lapply
df <- data.frame(age = sample(c("26-30", "31-35", "36-40", "41-45"), 20, replace = T),
gender = sample(c("M", "F"), 20, replace = T),
income = sample(c("High", "Medium", "Low"), 20, replace = T),
prop = runif(20))
lapply(df[,c(1:3)], function(x) ggplot(data = df, aes(y = df$prop, x = x))+ geom_col())

Related

Error in is.finite(x); need to add an additional line to a line chart (dplyr)

I have already attempted to search for this problem to no result. Have managed to reproduce the error below.
The problem: I'm trying to add a fourth line which represents the aggregate mean of all letters for each year. So far, I'm only able to generate the mean values for each letter. Everything runs fine until the last geom_line(), which is meant to generate the aggregate error. I've also tried inserting abline(). One other consideration is adding the "Mean" values under Letters so that they are generated anyway, but I believe there is a simpler method.
library(tidyverse)
Letters <- rep(c("A","B","C"),20)
Years <- rep(c(1990:1999),6)
Numbers <- runif(60, min = 0, max = 20)
df <- data.frame(Letters, Years, Numbers) %>%
group_by(Letters,Years) %>%
summarise(Letter_Mean= mean(Numbers),.groups = 'drop')
meanallletters <- df %>%
group_by(Years) %>%
summarise(all_mean = mean(Numbers),.groups = 'drop') %>%
select(-Years)
lineplotsample <- df %>%
ggplot(aes(x=Years, y=Letter_Mean, color = Letters))
## this doesn't work
lineplotsample + geom_line() + geom_point() + geom_line(aes(Years, y= meanallletters))
## this works, but missing the line representing aggregate mean
lineplotsample + geom_line() + geom_point()
I would summarize the data and then bind it to the bottom of the original data, like this:
library(tidyverse)
Letters <- rep(c("A","B","C"),20)
Years <- rep(c(1990:1999),6)
Numbers <- runif(60, min = 0, max = 20)
df <- data.frame(Letters, Years, Numbers) %>%
group_by(Letters,Years) %>%
summarise(Letter_Mean= mean(Numbers),.groups = 'drop')
meanallletters <- df %>%
group_by(Years) %>%
summarise(Letters = "All",
Letter_Mean = mean(Letter_Mean)) %>%
bind_rows(df,.) %>%
ungroup %>%
mutate(Letters = factor(Letters, levels=c("A", "B", "C", "All")))
meanallletters %>%
ggplot(aes(x=Years, y=Letter_Mean, color = Letters)) +
geom_line() +
geom_point()
Created on 2023-02-12 by the reprex package (v2.0.1)
Here's a more general way of specifying the levels. It also deals with the situation wither Letters is initially a factor.
library(tidyverse)
Letters <- rep(LETTERS,20)
Years <- rep(c(1990:1999),26)
Numbers <- runif(26*10, min = 0, max = 20)
df <- data.frame(Letters, Years, Numbers) %>%
group_by(Letters,Years) %>%
summarise(Letter_Mean= mean(Numbers),.groups = 'drop')
meanallletters <- df %>%
mutate(Letters = as.character(Letters)) %>%
group_by(Years) %>%
summarise(Letters = "All",
Letter_Mean = mean(Letter_Mean)) %>%
bind_rows(df,.) %>%
ungroup %>%
mutate(Letters = factor(Letters, levels=c(levels(as.factor(df$Letters)), "All")))
meanallletters %>%
ggplot(aes(x=Years, y=Letter_Mean, color = Letters)) +
geom_line() +
geom_point()
Created on 2023-02-12 by the reprex package (v2.0.1)

Take the sum of all columns and create a frequency plot of top higher frequencies

From a dataframe like this
data.frame(id = c(1,2,3), google = c(1,1,0), amazon = c(1,1,0), yahoo = c(0,0,1))
how is it possible to ignore the first column id and take the sum of all columns and create a frequency plot of top higher frequencies (top 2)?
We can use transmute to get the rowSums of the columns except the 'id', then with ggplot/geom_col, get the bar plot of the top_n (n = 2) elements
library(dplyr)
library(ggplot2)
df1 %>%
transmute(id = factor(id), Sum = select(., -id) %>%
rowSums) %>%
top_n(2) %>%
ggplot(aes(id, Sum)) +
geom_col()
The above was based on 'id', if it is based on the company
library(tidyr)
df1 %>%
pivot_longer(cols = -id) %>%
group_by(name) %>%
summarise(Sum = sum(value)) %>%
top_n(2) %>%
ggplot(aes(name, Sum)) +
geom_col()
In the newer version of dplyr, instead of top_n, we can use slice_max
df1 %>%
transmute(id = factor(id), Sum = select(., -id) %>%
rowSums) %>%
slice_max(Sum, n = 2) %>%
ggplot(aes(id, Sum)) +
geom_col()
Use colSums to get the frequencies by column, sort and keep the bottom 2. Pass the result to barplot. All in one code line.
barplot(tail(sort(colSums(df1[-1])), 2))
Data
df1 <- data.frame(id = c(1,2,3),
google = c(1,1,0),
amazon = c(1,1,0),
yahoo = c(0,0,1))

How can we data wrangling to obtain shown ratio/proportion chart shown

Goal is to produce a visualization indicating ratio.
Please help us how can we produce such ratio chart (high lighted) in R ?
library(tidyverse)
# Dataset creation
df <- data.frame(cls = c(rep("A",4),rep("B",4)),
grd = c("A1",rep("A2",3),rep(c("B1","B2"), 2)),
typ = c(rep("m",2),rep("o",2),"m","n",rep("p",2)),
pnts = c(rep(1:4,2)))
df
#### Data wrangling
df1 <- df %>%
group_by(cls) %>%
summarise(cls_pct = sum(pnts))
df1
df2 <- df %>%
group_by(cls,grd) %>%
summarize(grd_pct = sum(pnts))
df2
df3 <- df %>%
group_by(cls,grd,typ) %>%
summarise(typ_pct = sum(pnts))
df3
#### Attempt to combine all df1,df2,df3
# but mutate and summarise are mixing up leading to wrong results
df3 %>%
group_by(cls,grd) %>%
mutate(grd_pct = sum(typ_pct)) %>%
group_by(cls) %>%
mutate(cls_pct = sum(grd_pct))
Attempt to visualize all the ratios in 1 chart
data %>%
pivot_longer(cols = -c(cls:pnts),
names_to = "per_cat",
values_to = "percent") %>%
ggplot(aes(cls,percent, col = typ, fill = grd)) +
geom_bar(stat = "identity") +
coord_flip() +
theme_bw()
plot of the same.
EDIT -- added formula version with more useful output for visualization.
ORIG: At this point it may be worth making a function to reduce copying and pasting, but this may get you what you need:
library(tidyverse)
df %>%
group_by(cls) %>%
mutate(per1 = sum(pnts),
per1_pct = per1 / sum(per1)) %>%
group_by(cls, grd) %>%
mutate(per2 = sum(pnts),
per2_pct = per2 / sum(per2)) %>%
group_by(cls, grd, typ) %>%
mutate(per3 = sum(pnts),
per3_pct = per3 / sum(per3)) %>%
ungroup()
EDIT: Here's a general function to calculate the stats for a given grouping, making it easier to combine a few groupings together in long format better suited for visualization.
df_sum <- function(df, level, ...) {
df %>%
group_by(...) %>%
summarize(grp_ttl = sum(pnts)) %>%
mutate(ttl = sum(grp_ttl),
pct = grp_ttl / ttl) %>%
ungroup() %>%
mutate(level = {{ level }} )
}
df_sum(df, level = 1, cls) %>%
bind_rows(df_sum(df, level = 2, cls, grd)) %>%
bind_rows(df_sum(df, level = 3, cls, grd, typ)) %>%
mutate(label = coalesce(as.character(typ), # This grabs the first non-NA
as.character(grd),
as.character(cls))) -> df_summed
df_summed %>%
ggplot(aes(level, grp_ttl)) +
geom_col(color = "white") +
geom_text(aes(label = paste0(label, "\n", grp_ttl, "/", ttl)),
color = "white",
position = position_stack(vjust = 0.5)) +
scale_x_reverse() + # To make level 1 at the top
coord_flip() # To switch from vertical to horizontal orientation

How to add labels and percentages to printable R sunburst

I'm trying to add labels and percentages to each layer within a sunburst chart using R - so it looks like this Sunburst.
I can create a sunburst chart (using this guide) but I can't figure out how to add the labels or percentages. I also want to be able to print the chart with all labels and percentages.
Here's my code so far.
# libraries
library(dplyr)
library(treemap)
library(sunburstR)
library(readxl)
library(vcd)
## Load Arthritis as example
Data <- data.frame(Arthritis)
Data <- Data %>% select(-ID) %>%
mutate(Age=ifelse(Age<50,"Young","Old")) %>% group_by(Treatment,Sex,Improved,Age) %>%
summarise(Count=n()) %>%
mutate(Path=paste(Treatment,Sex,Improved,Age,sep="-")) %>%
ungroup() %>%
select(Path,Count)
sunburst(Data)
Any help would be great.
Thanks.
I suggest the ggsunburst package https://github.com/didacs/ggsunburst
library(ggsunburst)
library(dplyr)
library(vcd) # just for the Arthritis dataset
Data <- data.frame(Arthritis)
# compute percentage using tally
# add column leaf, with format "name->attribute:value"
# ggsunburst considers everything after "->" as attributes
# the attribute "size" is used as the size of the arc
df <- Data %>%
mutate(Age=ifelse(Age<50,"Young","Old")) %>%
group_by(Treatment,Sex,Improved,Age) %>%
tally() %>%
mutate(percentage = n/nrow(Data)*100,
size=paste("->size:",round(percentage,2),sep=""),
leaf=paste(Improved,size,sep = "")) %>%
ungroup() %>%
select(Treatment,Sex,Age,leaf)
# sunburst_data reads from a file so you need to create one
write.table(df, file = 'data.csv', row.names = F, col.names = F, sep = ",")
# specify node_attributes = "size" to add labels with percentages in terminal nodes
sb <- sunburst_data('data.csv', type = "lineage", sep = ',', node_attributes = "size")
# compute percentages for internal nodes
tre <- Data %>%
group_by(Treatment) %>%
tally() %>%
mutate(percent=n/nrow(Data)*100,
name=Treatment) %>%
ungroup() %>%
select(name,percent)
sex <- Data %>%
group_by(Treatment,Sex) %>%
tally() %>%
mutate(percent=n/nrow(Data)*100,
name=Sex) %>%
ungroup() %>%
select(name,percent)
age <- Data %>%
mutate(Age=ifelse(Age<50,"Young","Old")) %>%
group_by(Treatment,Sex,Age) %>%
tally() %>%
mutate(percent=n/nrow(Data)*100,
name=Age) %>%
ungroup() %>%
select(name,percent)
x <- rbind(tre, sex, age)
# the rows in x are in the same order as sb$node_labels, cbind works here only because of that
x <- cbind(sb$node_labels, round(x[,"percent"],2))
percent <- x %>% mutate(name_percent = paste(label,percent,"%"))
sunburst(sb, node_labels.min = 0) +
geom_text(data = sb$leaf_labels, aes(x=x, y=0.1, label=paste(size,"%"), angle=angle, hjust=hjust), size = 2) +
geom_text(data = percent, aes(x=x, y=y, label=name_percent, angle=pangle), size=2)

How to combine ggplot and dplyr into a function?

Consider this simple example
library(dplyr)
library(ggplot2)
dataframe <- data_frame(id = c(1,2,3,4),
group = c('a','b','c','c'),
value = c(200,400,120,300))
# A tibble: 4 x 3
id group value
<dbl> <chr> <dbl>
1 1 a 200
2 2 b 400
3 3 c 120
4 4 c 300
Here I want to write a function that takes the dataframe and the grouping variable as input. Ideally, after grouping and aggregating I would like to print a ggpplot chart.
This works:
get_charts2 <- function(data, mygroup){
quo_var <- enquo(mygroup)
df_agg <- data %>%
group_by(!!quo_var) %>%
summarize(mean = mean(value, na.rm = TRUE),
count = n()) %>%
ungroup()
df_agg
}
> get_charts2(dataframe, group)
# A tibble: 3 x 3
group mean count
<chr> <dbl> <int>
1 a 200 1
2 b 400 1
3 c 210 2
Unfortunately, adding ggplot into the function above FAILS
get_charts1 <- function(data, mygroup){
quo_var <- enquo(mygroup)
df_agg <- data %>%
group_by(!!quo_var) %>%
summarize(mean = mean(value, na.rm = TRUE),
count = n()) %>%
ungroup()
ggplot(df_agg, aes(x = count, y = mean, color = !!quo_var, group = !!quo_var)) +
geom_point() +
geom_line()
}
> get_charts1(dataframe, group)
Error in !quo_var : invalid argument type
I dont understand what is wrong here. Any ideas?
Thanks!
EDIT: interesting follow-up here how to create factor variables from quosures in functions using ggplot and dplyr?
ggplot does not yet support tidy eval syntax (you can't use the !!). You need to use more traditional standard evaluation calls. You can use aes_q in ggplot to help with this.
get_charts1 <- function(data, mygroup){
quo_var <- enquo(mygroup)
df_agg <- data %>%
group_by(!!quo_var) %>%
summarize(mean = mean(value, na.rm = TRUE),
count = n()) %>%
ungroup()
ggplot(df_agg, aes_q(x = quote(count), y = quote(mean), color = quo_var, group = quo_var)) +
geom_point() +
geom_line()
}
get_charts1(dataframe, group)
ggplot2 v3.0.0 released in July 2018 supports !! (bang bang), !!!, and :=. aes_()/aes_q() and aes_string() are soft-deprecated.
OP's original code should work
library(tidyverse)
get_charts1 <- function(data, mygroup){
quo_var <- enquo(mygroup)
df_agg <- data %>%
group_by(!!quo_var) %>%
summarize(mean = mean(value, na.rm = TRUE),
count = n()) %>%
ungroup()
ggplot(df_agg, aes(x = count, y = mean,
color = !!quo_var, group = !!quo_var)) +
geom_point() +
geom_line()
}
get_charts1(dataframe, group)
Edit: using the tidy evaluation pronoun .data[] to slice the chosen variable from the data frame also works
get_charts2 <- function(data, mygroup){
df_agg <- data %>%
group_by(.data[[mygroup]]) %>%
summarize(mean = mean(value, na.rm = TRUE),
count = n()) %>%
ungroup()
ggplot(df_agg, aes(x = count, y = mean,
color = .data[[mygroup]], group = .data[[mygroup]])) +
geom_point() +
geom_line()
}
get_charts2(dataframe, "group")
Created on 2018-04-04 by the reprex package (v0.2.0).

Resources