Plotting in ggplot using cumsum - r

I am trying to use ggplot2 to plot a date column vs. a numeric column.
I have a dataframe that I am trying to manipulate with country as either china or not china, and successfully created the dataframe linked below with:
is_china <- confirmed_cases_worldwide %>%
filter(country == "China", type=='confirmed') %>%
group_by(country) %>%
mutate(cumu_cases = cumsum(cases))
is_not_china <- confirmed_cases_worldwide %>%
filter(country != "China", type=='confirmed') %>%
mutate(cumu_cases = cumsum(cases))
is_not_china$country <- "Not China"
china_vs_world <- rbind(is_china,is_not_china)
Now essentially I am trying to plot a line graph with cumu_cases and date between "china" and "not china"
I am trying to execute this code:
plt_china_vs_world <- ggplot(china_vs_world) +
geom_line(aes(x=date,y=cumu_cases,group=country,color=country)) +
ylab("Cumulative confirmed cases")
Now I keep getting a graph looking like this:
Don't understand why this is happening, been trying to convert data types and other methods.
Any help is appreciated, I linked both csv below
https://github.com/king-sules/Covid

The 'date' for other 'country' are repeated because the 'country' is now changed to 'Not China'. It would be either changed in the OP's 'is_not_china' step or do this in 'china_vs_world'
library(ggplot2)
library(dplyr)
china_vs_world %>%
group_by(country, date) %>%
summarise(cumu_cases = sum(cases)) %>%
ungroup %>%
mutate(cumu_cases = cumsum(cumu_cases)) %>%
ggplot() +
geom_line(aes(x=date,y=cumu_cases,group=country,color=country)) +
ylab("Cumulative confirmed cases")
-output
NOTE: It is the scale that shows the China numbers to be small.
As #Edward mentioned a log scale would make it more easier to understand
china_vs_world %>%
group_by(country, date) %>%
summarise(cumu_cases = sum(cases)) %>%
ungroup %>%
mutate(cumu_cases = cumsum(cumu_cases)) %>%
ggplot() +
geom_line(aes(x=date,y=cumu_cases,group=country,color=country)) +
ylab("Cumulative confirmed cases") +
scale_y_continuous(trans='log')
Or with a facet_wrap
china_vs_world %>%
group_by(country, date) %>%
summarise(cumu_cases = sum(cases)) %>%
ungroup %>%
mutate(cumu_cases = cumsum(cumu_cases)) %>%
ggplot() +
geom_line(aes(x=date,y=cumu_cases,group=country,color=country)) +
ylab("Cumulative confirmed cases") +
facet_wrap(~ country, scales = 'free_y')
data
china_vs_world <- read.csv("https://raw.githubusercontent.com/king-sules/Covid/master/china_vs_world.csv", stringsAsFactors = FALSE)
china_vs_world$date <- as.Date(china_vs_world$date)

Related

Problem understanding the dot operator in dplyr

I'm trying the edX Harvard R Basics and Data Visualization courses, but I'm having quite a hard time trying to understand the functionality of the dot (.) operator.
I tried the code below:
gapminder %>%
filter(year %in% c(1970, 2010) & !is.na(gdp)) %>%
mutate(group = ifelse(region %in% west, "West", "Developing")) %>%
ggplot(aes(dollars_per_day)) +
geom_histogram(binwidth = 1, color = "black") +
scale_x_continuous(trans = "log2") +
facet_grid(year ~ group)
Here's where I get stuck, because I'm trying to intersect both lists, but if I put the "%>% .$country" in both lists, intersect them, then go to the histogram, everything runs well.
country_list_1 <- gapminder %>%
filter(year == 1970 & !is.na(dollars_per_day)) %>% .$country
country_list_2 <- gapminder %>%
filter(year == 2010 & !is.na(dollars_per_day)) %>% .$country
country_list <- intersect(country_list_1, country_list_2)
gapminder %>%
filter(year %in% c(1970, 2010) & country %in% country_list) %>%
mutate(group = ifelse(region %in% west, "West", "Developing")) %>%
ggplot(aes(dollars_per_day)) +
geom_histogram(binwidth = 1, color = "black") +
scale_x_continuous(trans = "log2") +
facet_grid(year ~ group)
But if I do this (skip the %>% .$country) it returns the error "Faceting variables must have at least one value":
country_list_1 <- gapminder %>%
filter(year == 1970 & !is.na(dollars_per_day))
country_list_2 <- gapminder %>%
filter(year == 2010 & !is.na(dollars_per_day))
country_list <- intersect(country_list_1, country_list_2)
gapminder %>%
filter(year %in% c(1970, 2010) & country %in% country_list) %>%
mutate(group = ifelse(region %in% west, "West", "Developing")) %>%
ggplot(aes(dollars_per_day)) +
geom_histogram(binwidth = 1, color = "black") +
scale_x_continuous(trans = "log2") +
facet_grid(year ~ group)
I don't quite get the logic of that, nor the function of the dot per se.
Section 3, 3.2 Using the Gapminder Dataset, 5th video "comparing distributions" of the Data Science: Visualization in R course HarvardX
In addition to the comments above, you may need to learn about subsetting with $. When you run
my_new_df <- gapminder %>%
filter(year == 1970 & !is.na(dollars_per_day))
you get back a data set (a tibble) with fewer rows then before. But you still have all of the columns. Now the $ lets you pick out a single column. So the list of countries is in the vector.
my_new_df$country
As mentioned in the comments to your question, the . operator just says to put everything coming in from the left side of the pipe %>% into that spot.

Issue with filter inside of geom in ggplot. "comparison (1) is possible only for atomic and list types"

I have a simple two-column time-series dataset that looks like this:
Date Signups
22-Feb-18 601
23-Feb-18 500
24-Feb-18 6000
...
27-Apr-22 999
28-Apr-22 998
29-Apr-22 123
30-Apr-22 321
And I'm trying to make a simple line chart that shows the monthly total over time and then a point at the most recent month. But the filter within the geom_point is giving me a hard time. Here's what I have:
library(tidyverse)
library(scales)
library(lubridate)
signups %>%
mutate(Date = dmy(Date)) %>%
group_by(month(Date), year(Date)) %>%
mutate(month = paste0(month(Date),"-",year(Date))) %>%
mutate(month = my(month)) %>%
mutate(monthly_total = sum(signups)) %>%
ungroup() %>%
dplyr::filter(month >= "2018-03-01") %>%
ggplot(aes(month, monthly_total)) +
geom_line() +
geom_point(data = signups %>% dplyr::filter(month == "2022-03-01")) +
expand_limits(y = 0, x = as.Date(c("2018-03-01", "2024-03-01"))) +
scale_y_continuous(labels = comma)
If I comment out the geom_point it gives me the line chart that I'm looking for. But when the geom_point is included here it throws this error:
Error in dplyr::filter(., month == "2022-03-01") :
Caused by error in `month == "2022-03-01"`:
! comparison (1) is possible only for atomic and list types
I've tried using subset instead of filter and it didn't help. Let me know if you have any suggestions. Thanks!
The comment from Limey got us there. Here's what I needed to do:
signups <- signups %>%
mutate(Date = dmy(Date)) %>%
mutate(just_month = paste0(month(Date),"-",year(Date))) %>%
mutate(just_month = my(just_month)) %>%
group_by(month(Date), year(Date)) %>%
mutate(monthly_total = sum(signups)) %>%
ungroup()
signups %>%
dplyr::filter(just_month >= "2018-03-01") %>%
ggplot(aes(just_month, monthly_total)) +
geom_line(aes(just_month, monthly_total)) +
geom_point(data = dplyr::filter(signups, just_month == "2022-04-01")) +
expand_limits(y = 0, x = as.Date(c("2018-03-01", "2024-03-01"))) +
scale_y_continuous(labels = comma)

How to display variable and value labels in ggplot bar chart?

I'm trying to get the variable labels and value labels to be displayed on a stacked bar chart.
library(tidyverse)
data <- haven::read_spss("http://staff.bath.ac.uk/pssiw/stats2/SAQ.sav")
data %>%
select(Q01:Q04) %>%
gather %>%
group_by(key, value) %>%
tally %>%
mutate(n = n/sum(n)*100, round = 1) %>%
mutate(n = round(n, 2)) %>%
ggplot(aes(x=key, y=n, fill=factor(value))) +
geom_col() +
geom_text(aes(label=as_factor(n)), position=position_stack(.5)) +
coord_flip() +
theme(aspect.ratio = 1/3) + scale_fill_brewer(palette = "Set2")
Instead of Q01, Q02, Q03, Q04, I would like to use the variable labels.
library(labelled)
var_label(data$Q01)
Statistics makes me cry
var_label(data$Q02)
My friends will think Im stupid for not being able to cope with SPSS
var_label(data$Q03)
Standard deviations excite me
var_label(data$Q04)
I dream that . . .
along with associated value labels
val_labels(data$Q01)
Strongly agree Agree Neither Disagree Strongly disagree Not answered
1 2 3 4 5 9
I tried using label = as_factor(n) but that didn't work.
We may extract the labels and then do a join
library(forcats)
library(haven)
library(dplyr)
library(tidyr)
library(labelled)
subdat <- data %>%
select(Q01:Q04)
d1 <- subdat %>%
summarise(across(everything(), var_label)) %>%
pivot_longer(everything())
subdat %>%
pivot_longer(everything(), values_to = 'val') %>%
left_join(d1, by = 'name') %>%
mutate(name = value, value = NULL) %>%
count(name, val) %>%
mutate(n = n/sum(n)*100, round = 1) %>%
mutate(n = round(n, 2)) %>%
ungroup %>%
mutate(labels = names(val_labels(val)[val])) %>%
ggplot(aes(x=name, y=n, fill=labels)) +
geom_col() +
geom_text(aes(label=as_factor(n)),
position=position_stack(.5)) +
coord_flip() +
theme(aspect.ratio = 1/3) +
scale_fill_brewer(palette = "Set2")
-output

R GGplot geom_area data perhaps unintentionally overlapping

I am working on the Tidy Tuesday data this week and ran into my geom_area doing what I think is overlapping the data. If I facet_wrap the data then there are no missing values in any year, but as soon as I make an area plot and fill it the healthcare/education data seems to disappear.
Below are example plots of what I mean.
library(tidyverse)
chain_investment <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-08-10/chain_investment.csv')
plottable_investment <- chain_investment %>%
filter(group_num == c(12,17)) %>%
mutate(small_cat = case_when(
group_num == 12 ~ "Transportation",
group_num == 17 ~ "Education/Health"
)) %>%
group_by(small_cat, year, category) %>%
summarise(sum(gross_inv_chain)) %>%
ungroup %>%
rename(gross_inv_chain = 4)
# This plot shows that there is NO missing education, health, or highway data
# Goal is to combine the data on one plot and fill based on the category
plottable_investment %>%
ggplot(aes(year, gross_inv_chain)) +
geom_area() +
facet_wrap(~category)
# Some of the data in the health category gets lost? disappears? unknown
plottable_investment %>%
ggplot(aes(year, gross_inv_chain, fill = category)) +
geom_area()
# Something is going wrong here?
plottable_investment %>%
filter(category %in% c("Education","Health")) %>%
ggplot(aes(year, gross_inv_chain, fill = category)) +
geom_area(position = "identity")
# The data is definitely there
plottable_investment %>%
filter(category %in% c("Education","Health")) %>%
ggplot(aes(year, gross_inv_chain)) +
geom_area() +
facet_wrap(~category)
The issue is that you filtered your data using == instead of using %in%.
In your case using == has the subtle side effect that for some categories (e.g. Health) your filtered data contains only obs for even years, while for others (e.g. Education) we end up with obs for only uneven years. As a result you end up with "two" area charts which overlap each other.
This could be easily seen by switching to geom_col which gives you a "dodged" bar plot as we have only one category per year.
plottable_investment %>%
filter(category %in% c("Education","Health")) %>%
ggplot(aes(year, gross_inv_chain, fill = category)) +
geom_col()
Using %in% instead gives the desired stacked area chart with all observations per category:
plottable_investment1 <- chain_investment %>%
filter(group_num %in% c(12,17)) %>%
mutate(small_cat = case_when(
group_num == 12 ~ "Transportation",
group_num == 17 ~ "Education/Health"
)) %>%
group_by(small_cat, year, category) %>%
summarise(gross_inv_chain = sum(gross_inv_chain)) %>%
ungroup()
#> `summarise()` has grouped output by 'small_cat', 'year'. You can override using the `.groups` argument.
plottable_investment1 %>%
filter(category %in% c("Education","Health")) %>%
ggplot(aes(year, gross_inv_chain, fill = category)) +
geom_area()

Can't change x scale from Date to number

I got coronavirus df and I need to compare Israel and UK data from the time both countries had more than 10 confirmed patients, this is my code :
library(ggplot2)
library(dplyr)
#Data frame
df.raw <- read.csv(url('https://raw.githubusercontent.com/datasets/covid-19/master/data/countries-aggregated.csv'))
str(df)
df <- df.raw
df$Date <- as.Date(df$Date)
str(df)
df.israel <- df %>% filter(Country == 'Israel', Confirmed>10)
df.uk <- df %>% filter(Country == 'United Kingdom', Confirmed>10)
if(df.israel$Date[1] > df.uk$Date[1]){
df.uk <- df.uk %>% filter(Date >= df.israel$Date[1])
} else {
df.israel <- df.israel %>% filter(Date >= df.uk$Date[1])
}
ggplot() +
geom_point(data = df.israel, aes(Date, Confirmed), color = 'blue') +
geom_point(data = df.uk, aes(Date,Confirmed), color = 'red')
Now, I need that my X axis will be numeric (1,2,3 etc) but I don't know how (tried xlim, scale_x_continuous) someone knows how to do this?
My graph
You can use match to get numbers instead of Date. Also it is better to get data in long format instead of creating two separate dataframes.
library(dplyr)
library(ggplot2)
df %>%
filter(Country %in% c('Israel', 'United Kingdom') & Confirmed>10) %>%
tidyr::pivot_longer(cols = Country) %>%
arrange(Date) %>%
mutate(day = match(Date, unique(Date))) %>%
ggplot() + aes(day, Confirmed, color = value) + geom_point() +
scale_color_manual(values = c('blue', 'red'))

Resources