R GGplot geom_area data perhaps unintentionally overlapping - r

I am working on the Tidy Tuesday data this week and ran into my geom_area doing what I think is overlapping the data. If I facet_wrap the data then there are no missing values in any year, but as soon as I make an area plot and fill it the healthcare/education data seems to disappear.
Below are example plots of what I mean.
library(tidyverse)
chain_investment <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-08-10/chain_investment.csv')
plottable_investment <- chain_investment %>%
filter(group_num == c(12,17)) %>%
mutate(small_cat = case_when(
group_num == 12 ~ "Transportation",
group_num == 17 ~ "Education/Health"
)) %>%
group_by(small_cat, year, category) %>%
summarise(sum(gross_inv_chain)) %>%
ungroup %>%
rename(gross_inv_chain = 4)
# This plot shows that there is NO missing education, health, or highway data
# Goal is to combine the data on one plot and fill based on the category
plottable_investment %>%
ggplot(aes(year, gross_inv_chain)) +
geom_area() +
facet_wrap(~category)
# Some of the data in the health category gets lost? disappears? unknown
plottable_investment %>%
ggplot(aes(year, gross_inv_chain, fill = category)) +
geom_area()
# Something is going wrong here?
plottable_investment %>%
filter(category %in% c("Education","Health")) %>%
ggplot(aes(year, gross_inv_chain, fill = category)) +
geom_area(position = "identity")
# The data is definitely there
plottable_investment %>%
filter(category %in% c("Education","Health")) %>%
ggplot(aes(year, gross_inv_chain)) +
geom_area() +
facet_wrap(~category)

The issue is that you filtered your data using == instead of using %in%.
In your case using == has the subtle side effect that for some categories (e.g. Health) your filtered data contains only obs for even years, while for others (e.g. Education) we end up with obs for only uneven years. As a result you end up with "two" area charts which overlap each other.
This could be easily seen by switching to geom_col which gives you a "dodged" bar plot as we have only one category per year.
plottable_investment %>%
filter(category %in% c("Education","Health")) %>%
ggplot(aes(year, gross_inv_chain, fill = category)) +
geom_col()
Using %in% instead gives the desired stacked area chart with all observations per category:
plottable_investment1 <- chain_investment %>%
filter(group_num %in% c(12,17)) %>%
mutate(small_cat = case_when(
group_num == 12 ~ "Transportation",
group_num == 17 ~ "Education/Health"
)) %>%
group_by(small_cat, year, category) %>%
summarise(gross_inv_chain = sum(gross_inv_chain)) %>%
ungroup()
#> `summarise()` has grouped output by 'small_cat', 'year'. You can override using the `.groups` argument.
plottable_investment1 %>%
filter(category %in% c("Education","Health")) %>%
ggplot(aes(year, gross_inv_chain, fill = category)) +
geom_area()

Related

RStudio: Using filter, group_by, and summarise as input for ggplot2

I'm trying to create a column chart showing the number of rides by gender, for only non-members. It would have two columns: one showing total number of rides for non-member males, and one showing total number of rides for non-member females.
This code works fine, but it includes both members and non-members:
FullData %>%
group_by(CustomerType, Gender) %>%
summarise(number_of_rides = n()) %>%
arrange(CustomerType, Gender) %>%
ggplot(aes(x = Gender, y = number_of_rides, fill = CustomerType)) +
geom_col(position = "dodge")
Below is my latest attempt at creating the chart to only show non-members, but it doesn't work - the plot pane sets up with just axis labels of Gender and number_of_rides, the rest is a gray empty plot area where the columns should be.
Plus, I get this message: summarise() has grouped output by 'CustomerType'. You can override using the .groups argument.
FullData %>%
dplyr::filter(CustomerType == 'non-member') %>%
group_by(CustomerType, Gender) %>%
summarise(number_of_rides = n()) %>%
arrange(CustomerType, Gender) %>%
ggplot(aes(x = Gender, y = number_of_rides, fill = CustomerType)) +
geom_col(position = "dodge")

Wanting Top ten to plot in ggplot

I have 16000 ish missing persons data that I am trying to order by Count and then plot on a graph. this is the code i am using. I am wanting to plot only the top ten.
mp.city <-mp.All %>%
group_by(State, City, Sex) %>%
summarise(Count = n())
mp.city %>%
arrange(desc(Count)) %>%
slice(1:10) %>%
ggplot(aes(y = City)) +
geom_bar()
the code will run but the plot is garbage. Any help would be amazing thank you!
I think you can manage it con head():
url<-'https://raw.githubusercontent.com/kitapplegate/fall2020/master/mpAll.csv'
mp.All<-read.csv(url)
library(ggplot2)
library(dplyr)
mp.city <-mp.All %>%
group_by(State, City, Sex) %>%
summarise(Count = n())
mp.city %>%
# sort
arrange(desc(Count)) %>%
# top 10 overall
head(10) %>%
# plot ordered
ggplot(aes(x = reorder(City,Count), y = Count))+
geom_bar( stat = "identity") +
# flipped
coord_flip() +
# label for x axis (flipped)
xlab("City")
P.S.
Next time try to share your data with dput(head(yourdata)) and posting the result, it's way better.

How to Arrange Stacked geom_bar by Ascending Proportion

I'm am looking at an R Tidy Tuesday dataset (European Energy) . I have wrangled the Imports and Exports as proportions and am looking to arrange the ggplot with an ascend on the Imports values. Just looking to make it look tidy, but can't seem to control the order to see each subsequent country with the next biggest import value.
I have left a couple of attempts in the code but commented out. Thnx in advance.
library(tidyverse)
country_totals <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-08-04/country_totals.csv')
country_totals %>%
filter(!is.na(country_name)) %>%
filter(type %in% c("Imports","Exports")) %>%
group_by(country_name) %>%
mutate(country_type_ttl = sum(`2018`)) %>%
mutate(country_type_pct = `2018`/country_type_ttl) %>%
ungroup() %>%
mutate(type_hold = type) %>%
pivot_wider(names_from = type_hold, values_from = `2018`) %>%
# ggplot(aes(country_name, country_type_pct, fill = type)) +
# ggplot(aes(reorder(country_name, Imports), country_type_pct, fill = type)) +
ggplot(aes(fct_reorder(country_name, Imports), country_type_pct, fill = type)) +
geom_bar(stat = "identity") +
coord_flip()
This could be achieved by adding a column with the value by which you want to reorder, i.e. the percentage share of imports in 2018 using e.g. imports_2018 = country_type_pct[type == "Imports"]. Then reorder the counters according to this column:
`
library(tidyverse)
country_totals <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-08-04/country_totals.csv')
country_totals %>%
filter(!is.na(country_name)) %>%
filter(type %in% c("Imports","Exports")) %>%
group_by(country_name) %>%
mutate(country_type_ttl = sum(`2018`)) %>%
mutate(country_type_pct = `2018`/country_type_ttl,
imports_2018 = country_type_pct[type == "Imports"]) %>%
ungroup() %>%
mutate(type_hold = type) %>%
ggplot(aes(fct_reorder(country_name, imports_2018), country_type_pct, fill = type)) +
geom_bar(stat = "identity") +
coord_flip()
#> Warning: Removed 2 rows containing missing values (position_stack).

Plotting in ggplot using cumsum

I am trying to use ggplot2 to plot a date column vs. a numeric column.
I have a dataframe that I am trying to manipulate with country as either china or not china, and successfully created the dataframe linked below with:
is_china <- confirmed_cases_worldwide %>%
filter(country == "China", type=='confirmed') %>%
group_by(country) %>%
mutate(cumu_cases = cumsum(cases))
is_not_china <- confirmed_cases_worldwide %>%
filter(country != "China", type=='confirmed') %>%
mutate(cumu_cases = cumsum(cases))
is_not_china$country <- "Not China"
china_vs_world <- rbind(is_china,is_not_china)
Now essentially I am trying to plot a line graph with cumu_cases and date between "china" and "not china"
I am trying to execute this code:
plt_china_vs_world <- ggplot(china_vs_world) +
geom_line(aes(x=date,y=cumu_cases,group=country,color=country)) +
ylab("Cumulative confirmed cases")
Now I keep getting a graph looking like this:
Don't understand why this is happening, been trying to convert data types and other methods.
Any help is appreciated, I linked both csv below
https://github.com/king-sules/Covid
The 'date' for other 'country' are repeated because the 'country' is now changed to 'Not China'. It would be either changed in the OP's 'is_not_china' step or do this in 'china_vs_world'
library(ggplot2)
library(dplyr)
china_vs_world %>%
group_by(country, date) %>%
summarise(cumu_cases = sum(cases)) %>%
ungroup %>%
mutate(cumu_cases = cumsum(cumu_cases)) %>%
ggplot() +
geom_line(aes(x=date,y=cumu_cases,group=country,color=country)) +
ylab("Cumulative confirmed cases")
-output
NOTE: It is the scale that shows the China numbers to be small.
As #Edward mentioned a log scale would make it more easier to understand
china_vs_world %>%
group_by(country, date) %>%
summarise(cumu_cases = sum(cases)) %>%
ungroup %>%
mutate(cumu_cases = cumsum(cumu_cases)) %>%
ggplot() +
geom_line(aes(x=date,y=cumu_cases,group=country,color=country)) +
ylab("Cumulative confirmed cases") +
scale_y_continuous(trans='log')
Or with a facet_wrap
china_vs_world %>%
group_by(country, date) %>%
summarise(cumu_cases = sum(cases)) %>%
ungroup %>%
mutate(cumu_cases = cumsum(cumu_cases)) %>%
ggplot() +
geom_line(aes(x=date,y=cumu_cases,group=country,color=country)) +
ylab("Cumulative confirmed cases") +
facet_wrap(~ country, scales = 'free_y')
data
china_vs_world <- read.csv("https://raw.githubusercontent.com/king-sules/Covid/master/china_vs_world.csv", stringsAsFactors = FALSE)
china_vs_world$date <- as.Date(china_vs_world$date)

grouped by factor level in ggplot2()

I've got a data frame with four three-level categorical variables: before_weight, after_weight, before_pain, and after_pain.
I'd like to make a bar plot featuring the proportion for each level of the variables. That my current code achieves.
The problem's the presentation of the data. I'd like the respective before and after bars to be grouped together, so that the bar representing the people that answered 1 in the before_weight variable is grouped next to the bar representing the people that answered 1 in the after_weight variable, and so forth for both the weight and pain variables.
I've been trying to use dplyr, mutate() with numerous ifelse() statements, to make a new variable pairing up the groups in question, but can't seem to get it to work.
Any help would be much appreciated.
starting point (df):
df <- data.frame(before_weight=c(1,2,3,2,1),before_pain=c(2,2,1,3,1),after_weight=c(1,3,3,2,3),after_pain=c(1,1,2,3,1))
current code:
library(tidyr)
dflong <- gather(df, varname, score, before_weight:after_pain, factor_key=TRUE)
df$score<- as.factor(df$score)
library(ggplot2)
library(dplyr)
dflong %>%
group_by(varname) %>%
count(score) %>%
mutate(prop = 100*(n / sum(n))) %>%
ggplot(aes(x = varname, y = prop, fill = factor(score))) + scale_fill_brewer() + geom_col(position = 'dodge', colour = 'black')
UPDATE:
I'd like proportions rather than counts, so I've attempted to tweak Nate's code. Since I'm using the question variable to group the data to get the proportions, I can't seem use gsub() to change the content of that variable. Instead I added question2 and passed it into facet_wrap(). It seems to work.:
df %>% gather("question", "val") %>%
count(question, val) %>%
group_by(question) %>%
mutate(percent = 100*(n / sum(n))) %>%
mutate(time= factor(ifelse(grepl("before", question), "before", "after"), c("before", "after"))) %>%
mutate(question2= ifelse(grepl("weight", question), "weight", "pain")) %>%
ggplot(aes(x=val, y=percent, fill = time)) + geom_col(position = "dodge") + facet_wrap(~question2)
Does this code make the visual comparisons you are after? One ifelse and a gsub will help make variables we can use for facetting and filling in ggplot.
df %>% gather("question", "val") %>% # go long
mutate(time = factor(ifelse(grepl("before", question), "before", "after"),
c("before", "after")), # use factor with levels to control order
question = gsub(".*_", "", question)) %>% # clean for facets
ggplot(aes(x = val, fill = time)) + # use fill not color for whole bar
geom_bar(position = "dodge") + # stacking is the default option
facet_wrap(~question) # two panels

Resources