Order bars by difference between variables - r

My intention is to plot a barchart, with to variables visible:
"HH_FIN_EX", "ACT_IND_CON_EXP" but having them ordered by the variable diff, in ascending order. diff itself should not be included in chart
library(eurostat)
library(tidyverse)
#getting the data
data1 <- get_eurostat("nama_10_gdp",time_format = "num")
#filtering
data_1_4 <- data1 %>%
filter(time=="2016",
na_item %in% c("B1GQ", "P31_S14_S15", "P41"),
geo %in% c("BE","BG","CZ","DK","DE","EE","IE","EL","ES","FR","HR","IT","CY","LV","LT","LU","HU","MT","NL","AT","PL","PT","RO","SI","SK","FI","SE","UK"),
unit=="CP_MEUR")%>% select(-unit, -time)
#transformations and calculations
data_1_4 <- data_1_4 %>%
spread(na_item, values)%>%
na.omit() %>%
mutate(HH_FIN_EX = P31_S14_S15/B1GQ, ACT_IND_CON_EXP=P41/B1GQ, diff=ACT_IND_CON_EXP-HH_FIN_EX) %>%
gather(na_item, values, 2:7)%>%
filter(na_item %in% c("HH_FIN_EX", "ACT_IND_CON_EXP", "diff"))
#plotting
ggplot(data=data_1_4, aes(x=reorder(geo, values), y=values, fill=na_item))+
geom_bar(stat="identity", position=position_dodge(), colour="black")+
labs(title="", x="Countries", y="As percentage of GDP")
I appreciate any suggestions how to do this, as aes(x=reorder(geo, values[values=="diff"]) results in an error.

First of all, you shouldn't include diff (your result column) when using gather, it complicates things.
Change line gather(na_item, values, 2:7) to gather(na_item, values, 2:6).
You can use this code to calculate difference and order (using dplyr::arange) rows in descending order:
plotData <- data_1_4 %>%
spread(na_item, values) %>%
na.omit() %>%
mutate(HH_FIN_EX = P31_S14_S15 / B1GQ,
ACT_IND_CON_EXP = P41 / B1GQ,
diff = ACT_IND_CON_EXP - HH_FIN_EX) %>%
gather(na_item, values, 2:6) %>%
filter(na_item %in% c("HH_FIN_EX", "ACT_IND_CON_EXP")) %>%
arrange(desc(diff))
And plot it with:
ggplot(plotData, aes(geo, values, fill = na_item))+
geom_bar(stat = "identity", position = "dodge", color = "black") +
labs(x = "Countries",
y = "As percentage of GDP") +
scale_x_discrete(limits = plotData$geo)

You can explicitly figure out the order that you want -- this is stored in country_order below -- and force the factor geo to have its levels in that order. Then run ggplot after filtering out the diff variable. So replace your call to ggplot with the following:
country_order = (data_1_4 %>% filter(na_item == 'diff') %>% arrange(values))$geo
data_1_4$geo = factor(data_1_4$geo, country_order)
ggplot(data=filter(data_1_4, na_item != 'diff'), aes(x=geo, y=values, fill=na_item))+
geom_bar(stat="identity", position=position_dodge(), colour="black")+
labs(title="", x="Countries", y="As percentage of GDP")
Doing this, I get the plot below:

is this what you are looking for?
data_1_4 %>% mutate(Val = fct_reorder(geo, values, .desc = TRUE)) %>%
filter(na_item %in% c("HH_FIN_EX", "ACT_IND_CON_EXP")) %>%
ggplot(aes(x=Val, y=values, fill=na_item)) +
geom_bar(stat="identity", position=position_dodge(), colour="black") +
labs(title="", x="Countries", y="As percentage of GDP")

Related

Difference between geom_col() and geom_point() for same value

So, I'm trying to plot missing values here over time (longitudinal data).
I would prefer placing them in a geom_col() to fill up with colours of certain treatments afterwards. But for some weird reason, geom_col() gives me weird values, while geom_point() gives me the correct values using the same function. I'm trying to wrap my head around why this is happening. Take a look at the y-axis.
Disclaimer:
I know the missing values dissappear on day 19-20. This is why I'm making the plot.
Sorry about the lay-out of the plot. Not polished yet.
For the geom_point:
gaussian_transformed %>% group_by(factor(time)) %>% mutate(missing = sum(is.na(Rose_width))) %>% ggplot(aes(x = factor(time), y = missing)) + geom_point()
Picture: geom_point
For the geom_col:
gaussian_transformed %>% group_by(factor(time)) %>% mutate(missing = sum(is.na(Rose_width))) %>% ggplot(aes(x = factor(time), y = missing)) + geom_col()
Picture: geom_col
The problem is that you're using mutate and creating several rows for your groups. You cannot see that, but you will have plenty of points overlapping in your geom_point plot.
One way is to either use summarise, or you use distinct
Compare
library(tidyverse)
msleep %>% group_by(order) %>%
mutate(missing = sum(is.na(sleep_cycle))) %>%
ggplot(aes(x = order, y = missing)) +
geom_point()
The points look ugly because there is a lot of over plotting.
msleep %>% group_by(order) %>%
mutate(missing = sum(is.na(sleep_cycle))) %>%
distinct(order, .keep_all = TRUE) %>%
ggplot(aes(x = order, y = missing)) +
geom_col()
msleep %>% group_by(order) %>%
mutate(missing = sum(is.na(sleep_cycle))) %>%
ggplot(aes(x = order, y = missing)) +
geom_col()
Created on 2021-06-02 by the reprex package (v2.0.0)
So after some digging:
What happens was that the geom_col() function sums up all the missing values while geom_point() does not. Hence the large values for y. Why this is happening, I do not know. However doing the following worked fine for me:
gaussian_transformed$time <- as.factor(gaussian_transformed$time)
gaussian_transformed %>% group_by(time) %>% summarise(missing = sum(is.na(Rose_width))) -> gaussian_transformed
gaussian_transformed %>% ggplot(aes(x = time, y = missing)) + geom_col(fill = "blue", alpha = 0.5) + theme_minimal() + labs(title = "Missing values in Gaussian Outcome over the days", x = "Time (in days)", y = "Amount of missing values") + scale_y_continuous(breaks = seq(0, 10, 1))
With the plot: GaussianMissing

Plot variable with column chart with ggplot with data from read.csv2

I have the height of male and females in my data grouped by cm of 10. I want to plot them togheter side by side.
My graph looks somewhat what I want it to be, but the x-axis says factor(male). It should be height in cm.
Also I got three bars, but there should be two, one for male and one for female.
# Library
library(ggplot2)
library(tidyverse) # function "%>%"
# 1. Define data
data = read.csv2(text = "Height;Male;Female
160-170;5;2
170-180;5;5
180-190;6;5
190-200;2;2")
# 2. Print table
df <- as.data.frame(data)
df
# 3. Plot Variable with column chart
ggplot(df, aes(factor(Male),
fill = factor(Male))) +
geom_bar(position = position_dodge(preserve = "single")) +
theme_classic()
pivot_longer to longformat
Then use geom_bar with fill
library(tidyverse)
df1 <- df %>% pivot_longer(
cols = c(Male, Female),
names_to = "Gender",
values_to = "N"
)
# 3. Plot Variable with column chart
ggplot(df1, aes(x=Height, y=N)) +
geom_bar(aes(fill = Gender), position = "dodge", stat="identity") +
theme_classic()
One solution would be:
df %>%
pivot_longer(cols = 2:3, names_to = "gender", values_to = "count") %>%
ggplot(aes(x = Height, y = count, fill = gender)) +
geom_bar(stat = "identity", position = "dodge") +
theme_classic()

Plot number of people (observations) in a data set?

I want to use a barplot to display the number of male and female participants for a program in the month of July. However, I keep getting a percentage instead. I know there's something wrong with my code but I'm not sure what.
july_all%>%
filter(month == 0)%>%
group_by(sex)%>%
summarize(id=round(100*n()/nrow(july_all)))%>%
ggplot(aes(x =sex,y =id)) +
geom_bar(stat ="identity") +
labs(y ="Number of participants")
You haven't provided any data, but I'm guessing the following is a reasonable approximation:
set.seed(1)
july_all <- data.frame(sex = sample(c("Male", "Female"), 500, TRUE), month = 0)
Now, running your code, we get:
july_all %>%
filter(month == 0) %>%
group_by(sex) %>%
summarize(id = round(100*n()/nrow(july_all))) %>%
ggplot(aes(x =sex, y = id)) +
geom_bar(stat ="identity") +
labs(y ="Number of participants")
Which shows percentages. Why? Because that's what you have calculated with summarize(id = round(100*n()/nrow(july_all))) . Here, id is the percentage of each sex present in the data.
If you want raw counts, your code can be much simpler, since geom_bar plots counts by default, so you need only do:
july_all %>%
filter(month == 0) %>%
ggplot(aes(x = sex)) +
geom_bar() +
labs(y ="Number of participants")
And you'll see we have the raw counts plotted.

Ordering geom_bar() without a y defined variable

Is there a way to order the bars in geom_bar() when y is just the count of x?
Example:
ggplot(dat) +
geom_bar(aes(x = feature_1))
I tried using reorder() but it requires a defined y variable within aes().
Made up data:
dfexmpl <- data.frame(stringsAsFactors = FALSE,
group = c("a","a","a","a","a","a",
"a","a","a","b","b","b","b","b","b","b","b","b",
"b","b","b","b","b","b"))
plot code - reorder is doing the work of arranging by count:
dfexmpl %>%
group_by(group) %>%
mutate(count = n()) %>%
ggplot(aes(x = reorder(group, -count), y = count)) +
geom_bar(stat = "identity")
results in:

Reorder vertical axis alphabetically and change position of binary variable of stacked percent bar graph (ggplot2)

I have a dataset with two variables: 1) ID, 2) Infection Status (Binary:1/0).
I would like to use ggplot2 to
Create a stacked percentage bar graph with the various ID on the verticle-axis (arranged alphabetically with A starting on top), and the percent on the horizontal-axis. I can't seem to get a code that will automatically sort the ID alphabetically as my original dataset has quite a number of categories and will be difficult to arrange them manually.
I also hope to have the infected category (1) to be red and towards the left of the blue non-infected category (0). Is it also possible to change the sub-heading of the legend box from "Non_infected" to "Non-infected"?
I hope that the displayed ID in the plot will include the count of the number of times the ID appeared in the dataset. E.g. "A (n=6)", "B (n=3)"
My sample code is as follow:
ID <- c("A","A","A","A","A","A",
"B","B","B",
"C","C","C","C","C","C","C",
"D","D","D","D","D","D","D","D","D")
Infection <- sample(c(1, 0), size = length(ID), replace = T)
df <- data.frame(ID, Infection)
library(ggplot2)
library(dplyr)
library(reshape2)
df.plot <- df %>%
group_by(ID) %>%
summarize(Infected = sum(Infection)/n(),
Non_Infected = 1-Infected)
df.plot %>%
melt() %>%
ggplot(aes(x = ID, y = value, fill = variable)) + geom_bar(stat = "identity", position = "stack") +
xlab("ID") +
ylab("Percent Infection") +
scale_fill_discrete(guide = guide_legend(title = "Infection Status")) +
coord_flip()
Right now I managed to get this output:
I hope to get this:
Thank you so much!
First, we need to add a count to your original data.frame.
df.plot <- df %>%
group_by(ID) %>%
summarize(Infected = sum(Infection)/n(),
Non_Infected = 1-Infected,
count = n())
Then, we augment our ID column, turn the Infection Status into a factor variable, use forcats::fct_rev to reverse the ID ordering, and use scale_fill_manual to control your legend.
df.plot %>%
mutate(ID = paste0(ID, " (n=", count, ")")) %>%
select(-count) %>%
melt() %>%
mutate(variable = factor(variable, levels = c("Non_Infected", "Infected"))) %>%
ggplot(aes(x = forcats::fct_rev(ID), y = value, fill = variable)) +
geom_bar(stat = "identity", position = "stack") +
xlab("ID") +
ylab("Percent Infection") +
scale_fill_manual("Infection Status",
values = c("Infected" = "#F8766D", "Non_Infected" = "#00BFC4"),
labels = c("Non-Infected", "Infected"))+
coord_flip()

Resources