Plot the sum of all but one variable - r

I have a dataframe with data on the number of TVs and radios owned by survey respondents in three different countries (Canada, Mexico, US) at two different points in time (now and before):
DF <- data.frame(TV_now = as.numeric(c(4, 9, 1, 0, 4, NA)),
TV_before = as.numeric(c(4, 1, 2, 4, 5, 2)),
Radio_now = as.numeric(c(4, 5, 1, 5, 6, 9)),
Radio_before = as.numeric(c(6, 5, 3, 6, 7, 10)),
Country = as.factor(c("Mexico", "Canada", "US", "US", "Canada", "US")))
I want to sum the total value of each variable and then create a barplot that shows the number of TVs and radios owned by survey respondents now and before per country.
Now, if my dataframe didn't contain the Country factor, I could generate the plot in this way:
library(tidyverse)
library(ggplot2)
DF %>% mutate_all(funs(sum), na.rm = TRUE) %>%
gather(key=Device, value=Number) %>%
ggplot(aes(x=Device,fill=Device)) +
geom_bar(aes(x = Device, y = Number), position = "dodge", stat = "identity")
However, the variation
DF %>% mutate_all(funs(sum), na.rm = TRUE) %>%
gather(key=Device, value=Number, -Country) %>%
ggplot(aes(x=Device,fill=Device)) +
geom_bar(aes(x = Device, y = Number), position = "dodge", stat = "identity") +
facet_wrap(~Country)
results in the error:
Error in mutate_impl(.data, dots) :
Evaluation error: ‘sum’ not meaningful for factors.
Is there a way to exclude the factor from sum, or another way to generate the intended plot?

You can use the summarise function to sum up the different columns. Below I have summed up the numeric columns using dplyr's summarise_if() function.
DF <- data.frame(TV_now = as.numeric(c(4, 9, 1, 0, 4, NA)),
TV_before = as.numeric(c(4, 1, 2, 4, 5, 2)),
Radio_now = as.numeric(c(4, 5, 1, 5, 6, 9)),
Radio_before = as.numeric(c(6, 5, 3, 6, 7, 10)),
Country = as.factor(c("Mexico", "Canada", "US", "US", "Canada", "US")))
DF %>%
group_by(Country) %>%
summarise_if(is.numeric,sum,na.rm=TRUE) %>%
gather(key=Device, value=Number, -Country) %>%
ggplot(aes(x=Device,fill=Device)) +
geom_bar(aes(x = Device, y = Number),position = "dodge", stat = "identity") +
facet_wrap(~Country)
The result is:

Related

Get row columns by group for geom_col in ggplot

I am trying to calculate row percentages by demographics of various score levels--in my data, that would be what % of white people (or % of black people, or % male, or % who have education level 2, and so on) have a score of 0 (or 1, 2, or 3)--and then use that to create a big plot.
So in my example data below, 8.33% of race == 1 (which is white) have a score of 0, 25% have a score of 1, 25% have a score of 2, and 41.67% have a score of 3.
Then the ultimate end goal would be to get some type of bar plot where the 4 levels of 'score' are across the x axis, and the various comparisons of demographics run down the y axis. Something that looks visually sort of like this, but with the levels of 'score' across the top instead of education levels: .
I already have some code to make the actual figure, which I've done in other instances but with externally/already-calculated percentages:
ggplot(data, aes(x = percent, y = category, fill = group)) +
geom_col(orientation = "y", width = .9) +
facet_grid(group~score_var,
scales = "free_y", space = "free_y") +
labs(title = "Demographic breakdown of 'Score'") +
theme_bw()
I am just struggling to figure out the best way to calculate these row percentages, presumably using group_by() and summarize and then storing or configuring them in a way that they can be plotted. Thank you.
d <- structure(list(race = c(1, 1, 2, 2, 3, 3, 1, 1, 2, 2, 3, 3, 1,
1, 2, 2, 3, 3, 1, 1, 2, 2, 3, 3, 1, 1, 2, 2, 3, 3, 1, 1, 2, 2,
3, 3), gender = c(0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1,
0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1
), education = c(1, 3, 3, 2, 1, 3, 2, 3, 4, 4, 2, 3, 3, 2, 3,
4, 1, 3, 1, 3, 3, 2, 1, 3, 2, 3, 4, 4, 2, 3, 3, 2, 3, 4, 1, 3
), score = c(1, 2, 2, 1, 2, 3, 3, 2, 0, 0, 1, 2, 1, 3, 0, 0,
3, 3, 3, 3, 3, 3, 3, 3, 2, 1, 2, 3, 1, 3, 3, 0, 1, 2, 2, 0)), row.names = c(NA,
-36L), spec = structure(list(cols = list(race = structure(list(), class = c("collector_double",
"collector")), gender = structure(list(), class = c("collector_double",
"collector")), education = structure(list(), class = c("collector_double",
"collector")), score = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = ","), class = "col_spec"), problems = <pointer: 0x000001bd978b0df0>, class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
This may get you started:
library(dplyr)
library(ggplot2)
prop <- data %>%
mutate(race = factor(race, levels = c(1, 2, 3), labels = c("White", "Black", "Others"))) %>%
group_by(race) %>%
mutate(race_n = n()) %>%
group_by(race, score) %>%
summarise(percent = round(100*n()/race_n[1], 1))
prop %>%
ggplot(aes(x = percent, y = score, fill = race)) +
geom_col(orientation = "y", width = .9) +
geom_text(aes(label = percent), hjust = 1)+
facet_grid(~race) +
labs(title = "Demographic breakdown of 'Score'") +
theme_bw()
Edit
To put all characters together, you can get a bigger graph:
df <- data %>% mutate(
gender = factor(2-gender),
race = factor(race),
education = factor(education)) %>%
pivot_longer(!score, names_to = "character", values_to = "levels")
df %>% group_by(character, levels) %>%
mutate(group_n = n()) %>%
group_by(character, levels, score) %>%
summarise(percent = round(100*n()/group_n[1], 1)) %>%
ggplot(aes(x = percent, y = score, fill = character)) +
geom_col(orientation = "y", width = .9) +
geom_text(aes(label = percent), hjust = 1)+
facet_grid(character ~ levels) +
labs(title = "Demographic breakdown of 'Score'") +
theme_bw()
please note: I have changed the code for gender.
Taking inspiration from #Zhiqiang Wang's excellent first pass at this, I finally figured out a solution. I still need to change the order of the labels (to put the education levels in order, and move the race variables to the top of the figure) but this is basically what I was envisioning.
d_test <- d %>% mutate(
gender = factor(2-gender),
race = factor(race),
education = factor(education)) %>%
pivot_longer(!score, names_to = "group", values_to = "levels")
d_test <- d_test %>% group_by(group, levels) %>%
mutate(group_n = n()) %>%
group_by(group, levels, score) %>%
summarise(percent = round(100*n()/group_n[1], 1))
d_test <- d_test %>%
mutate(var = case_when(group == "gender" & levels == 1 ~ "female",
group == "gender" & levels == 2 ~ "male",
group == "race" & levels == 1 ~ "white",
group == "race" & levels == 2 ~ "black",
group == "race" & levels == 3 ~ "hispanic",
group == "education" & levels == 1 ~ "dropout HS",
group == "education" & levels == 2 ~ "grad HS",
group == "education" & levels == 3 ~ "some coll",
group == "education" & levels == 4 ~ "grad coll"))
ggplot(d_test, aes(x = percent, y = var, fill = group)) +
geom_col(orientation = "y", width = .9) +
facet_grid(group ~ score,
scales = "free_y", space = "free_y") +
labs(title = "Demographic breakdown of 'Score'",
y = "",
x = "Percent") +
theme_minimal() +
theme(legend.position = "none",
strip.text.y = element_blank())

Alluvial diagram with varying aesthetic / colors over one flow in R based on ggplot2

I have created the following alluvial diagram in R as follows:
df <- data.frame(Variable = c("X1", "X2", "X3", "X4", "X5", "X6"),
Pearson1 = c(6, 3, 2, 5, 4, 1),
Spearman1 = c(6, 5, 1, 2, 3, 4),
Kendall1 = c(6, 5, 1, 2, 3, 4),
Pearson2 = c(6, 5, 1, 2, 3, 4),
Spearman2 = c(6, 5, 1, 2, 4, 3),
Kendall2 = c(6, 5, 1, 2, 3, 4))
df$freq<-1
alluvial(df[1:7], freq=df$freq, cex = 0.7,col= "red")
which results in
How can I set some specific lines to have different col than red? e.g. X1 from Variables to Pearson1, and then again from Kendall1 to Spearman2 and X3 in all states? I see I can't do that based on alluvial(). How can I recreate the above alluvial based on another function??
ggalluvial allows for varying aesthetics over one "flow" (or alluvium). The documentation provides a trick to use geom_flow with stat = "alluvium" and to specify "lode.guidance = "frontback".
The actual aesthetic (color) will need to be added to the data. geom_flow and geom_stratum will require different columns for the aesthetic, (try what happens when you use the same for both). I am passing the color directly and using scale_identity, but you can of course also use random values and then define your colors with scale_manual.
library(ggalluvial)
#> Loading required package: ggplot2
library(tidyverse)
df <- data.frame(Variable = c("X1", "X2", "X3", "X4", "X5", "X6"),
Pearson1 = c(6, 3, 2, 5, 4, 1),
Spearman1 = c(6, 5, 1, 2, 3, 4),
Kendall1 = c(6, 5, 1, 2, 3, 4),
Pearson2 = c(6, 5, 1, 2, 3, 4),
Spearman2 = c(6, 5, 1, 2, 4, 3),
Kendall2 = c(6, 5, 1, 2, 3, 4))
df_long <-
df %>%
## reshape your data in order to bring it to the right shape
mutate(across(everything(), as.character)) %>%
rownames_to_column("ID") %>%
pivot_longer(-ID) %>%
## correct order of your x
mutate(
name = factor(name, levels = names(df)),
## now hard code where you want to change the color.
## lodes need a different highlighting then your strata
## there are of course many ways to add this information, I am using case_when here
## you could also create separate vectors and add them to your data frame
highlight_lode = case_when(
ID == 3 ~ "blue",
ID == 1 & name %in% c("Variable", "Kendall1", "Pearson2") ~ "orange",
TRUE ~ "red"
),
highlight_stratum = case_when(
ID == 3 ~ "blue",
ID == 1 & name %in% c(
"Variable", "Pearson1", "Kendall1", "Pearson2",
"Spearman2"
) ~ "orange",
TRUE ~ "red"
)
)
ggplot(df_long,
## now use different color aesthetics in geom_flow and geom_stratum
aes(x = name, stratum = value, alluvium = ID, label = value)) +
## I took this trick with lode guidance from the documentation - this allows varying aesthetics over one flow.
geom_flow(aes(fill = highlight_lode), stat = "alluvium", lode.guidance = "frontback", color = "darkgray") +
geom_stratum(aes(fill = highlight_stratum)) +
geom_text(stat = "stratum") +
## as I have named the colors directly, it is appropriate to use scale_identity
scale_fill_identity()
#> Warning: Using the `size` aesthetic in this geom was deprecated in ggplot2 3.4.0.
#> ℹ Please use `linewidth` in the `default_aes` field and elsewhere instead.
Created on 2023-01-29 with reprex v2.0.2

R: Creating new column to represent hi/mid/low bins by mean and standard deviation

I've got a batch of survey data that I'd like to be able to subset on a few specific columns which have 0-10 scale data (e.g. Rank your attitude towards x as 0 to 10) so that I can plot using using ggplot() + facet_grid. Faceting will be using 3 hi/med/low bins calculated as +1 / -1 standard deviation above the mean. I have working code, which splits the overall dataframe into 3 parts like so:
# Generate sample data:
structure(list(Q4 = c(2, 3, 3, 5, 4, 3), Q5 = c(1, 3, 3, 3, 2,
2), Q6 = c(4, 3, 3, 3, 4, 4), Q7 = c(4, 2, 3, 5, 5, 5), Q53_1 = c(5,
8, 4, 5, 4, 5)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
# Aquire Q53_1 data as factors
political_scale <- factor(climate_experience_data$Q53_1, levels = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
# Generate thresholds based on mean and standard deviation thresholds
low_threshold <- round(mean(as.numeric(political_scale, na.rm = T)) - sd(as.numeric(political_scale)), digits = 0)
high_threshold <- round(mean(as.numeric(political_scale, na.rm = T)) + sd(as.numeric(political_scale)), digits = 0)
# Generate low/med/high bins based on Mean and SD
political_lr_low <- filter(climate_experience_data, Q53_1 <= low_threshold)
political_lr_mid <- filter(climate_experience_data, Q53_1 < high_threshold & Q53_1 > low_threshold)
political_lr_high <- filter(climate_experience_data, Q53_1 >= high_threshold)
What I've realised is that this approach really doesn't lend itself to faceting. What I suspect is that I need to use a combination of mutate() across() where() and group_by() to add data to a new column Q53_scale with "hi" "med" "low" based on where Q53_1 values fall in relation to those low/high thresholds (e.g. SD +1 over mean and -1 under mean). My first few dozen attempts have fallen short - has anyone managed to use sd() to bin data for faceting in this way?
library(tidyverse)
climate_experience_data <- structure(list(Q4 = c(2, 3, 3, 5, 4, 3), Q5 = c(
1, 3, 3, 3, 2,
2
), Q6 = c(4, 3, 3, 3, 4, 4), Q7 = c(4, 2, 3, 5, 5, 5), Q53_1 = c(
5,
8, 4, 5, 4, 5
)), row.names = c(NA, -6L), class = c(
"tbl_df",
"tbl", "data.frame"
))
climate_experience_data %>%
mutate(
bin = case_when(
Q53_1 > mean(Q53_1) + sd(Q53_1) ~ "high",
Q53_1 < mean(Q53_1) - sd(Q53_1) ~ "low",
TRUE ~ "medium"
) %>% factor(levels = c("low", "medium", "high"))
) %>%
ggplot(aes(Q4, Q5)) +
geom_point() +
facet_grid(~bin)
Created on 2022-03-10 by the reprex package (v2.0.0)

How to define score of popularity for list of tags in R?

There is a dataset where each object has a list of tags of categories comma separated. I would like to have aggregated categories score per object based on categories' popularities. I can define the sum, min, and max of popularities but it's not clear to me how an aggregated score can be calculated.
library(tidyverse)
library(tibble)
library(stringr)
# 1. Data
df <- tribble(
~object, ~category,
1, "Software, Model, Cloud",
2, "Model",
3, "Cloud, Software",
4, "Train, Test, Model",
5, "Test, Model"
)
# 2. List of categories
list_category <- trimws(unlist(str_split(df$category, ",")))
# 3. Categories popularity
data.frame(category = list_category) %>%
group_by(category) %>%
summarise(n_count = n()) %>%
arrange(-n_count) %>%
ungroup()
# 4. Outcome with undefined 'score_category' feature that I'd like to know how to score
tribble(
~object, ~sum_category, ~min_category, ~max_category, ~score_category,
1, sum(c(2, 4, 2)), min(c(2, 4, 2)), max(c(2, 4, 2)), NA,
2, sum(c(4)), min(c(4)), max(c(4)), NA,
3, sum(c(2, 2)), min(c(2, 2)), max(c(2, 2)), NA,
4, sum(c(1, 2, 4)), min(c(1, 2, 4)), max(c(1, 2, 4)), NA,
5, sum(c(2, 4)), min(c(2, 4)), max(c(2, 4)), NA
)
Any ideas and code are welcome!

How can I calculate mean and sd by group and format as dataframe?

My data is currently in the format of df1:
outcome <- c("success", "failure", "success", "failure", "success", "failure")
basketball <- c(10, 7, 7, 8, 9, 10)
soccer <- c(8, 21, 30, 21, 6, 10)
football <- c(9, 2, 1, 3, 1, 5)
df1 <- data.frame(outcome, basketball, soccer, football)
And I would like it to be in the format of df2, so I can more easily create a bar graph with ggplot2.
symptom <- c("basketball", "basketball", "soccer", "soccer", "football", "football")
mean <- c(10, 6, 9, 7, 3, 1)
sd <- c(1, 2, 1, 3, 0.5, 0.2)
df2 <- data.frame(outcome, symptom, mean, sd)
Currently I have a lot of code that can get me there in a roundabout way, but I feel like there must be a streamlined way to do this in a few lines of code. Is there a way to use this using dplyr or tidyr verbs?
Thanks!
We can reshape to 'long' format with pivot_longer and then do a group by operation
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = basketball:football, names_to = 'symptom') %>%
group_by(outcome, symptom) %>%
summarise(mean = mean(value), sd = sd(value), .groups = 'drop')
If we also need to plot
library(ggplot2)
df1 %>%
pivot_longer(cols = basketball:football, names_to = 'symptom') %>%
group_by(outcome, symptom) %>%
summarise(mean = mean(value), sd = sd(value), .groups = 'drop') %>%
ggplot(aes(x = outcome, y = mean, fill = symptom)) +
geom_bar(position = position_dodge(), stat = 'identity') +
geom_errorbar(aes(ymin = mean - sd, ymax = mean + sd),
width = .2, position = position_dodge(.9))

Resources