I am using the datase who (available in the library datasets
tidyr), which for 34 years counts the number of TB cases registered for 56 groups (combinations of gender, age and method of testing) for a number of countries. There is one row per country per year, and the first 4 entries are to do with year, country name and such.
I want to calculate the sum of new cases per country per year, but I just can't make it work.
I was ecpecting something like
group_by(who, country) %>% summarise(count = rowsum(.[5:60]))
would work, but it doesn't.
Can anyone help me understand why it doesn't work, and what to do instead?
You're missing a first step, which is to gather the data into a 'tidy' format. Try this:
who%>%
gather(key=type,value=cases,-country,-iso2,-iso3,-year)%>%
filter(!is.na(cases))%>%
group_by(country,year)%>%
summarise(sum(cases))
Which gives output:
# A tibble: 3,484 × 3
# Groups: country [219]
country year `sum(cases)`
<chr> <int> <int>
1 Afghanistan 1997 128
2 Afghanistan 1998 1778
3 Afghanistan 1999 745
4 Afghanistan 2000 2666
5 Afghanistan 2001 4639
library(tidyverse)
(long_who <- who |> pivot_longer(cols = -(1:4)))
long_who |> filter(startsWith(name,"new")) |> # dont want things like "Population"
group_by(country) |>
summarise(sum_of_new_ = sum(value,na.rm=TRUE))
A base r approach
data.frame(who[,c("country", "year")],
cnt = rowSums(who[5:60], na.rm = TRUE))
#> + country year cnt
#> 1 Afghanistan 1980 0
#> 2 Afghanistan 1981 0
#> 3 Afghanistan 1982 0
#> 4 Afghanistan 1983 0
#> 5 Afghanistan 1984 0
#> 6 Afghanistan 1985 0
You could also do without the long format by using rowSums and across:
library(dplyr)
who |>
group_by(country, year) |>
summarise(count = rowSums(across(5:58), na.rm = TRUE)) |>
ungroup()
Alternatives to across(5:58):
across(starts_with("new"))
across(-(1:4))
Output:
# A tibble: 20 × 3
# Groups: country [1]
country year count
<chr> <int> <dbl>
1 Afghanistan 1980 0
2 Afghanistan 1981 0
3 Afghanistan 1982 0
4 Afghanistan 1983 0
5 Afghanistan 1984 0
6 Afghanistan 1985 0
7 Afghanistan 1986 0
8 Afghanistan 1987 0
9 Afghanistan 1988 0
10 Afghanistan 1989 0
11 Afghanistan 1990 0
12 Afghanistan 1991 0
13 Afghanistan 1992 0
14 Afghanistan 1993 0
15 Afghanistan 1994 0
16 Afghanistan 1995 0
17 Afghanistan 1996 0
18 Afghanistan 1997 128
19 Afghanistan 1998 1778
20 Afghanistan 1999 745
I have a dafa frame which looks roughly like this:
Location Date code total_cases total_vaccinations
Afghanistan 2022-04-23 NA 5.00 NA
Afghanistan 2022-04-22 3 3.00 2
Afghanistan 2022-04-21 2 3.00 NA
Albania 2022-04-24 3 9.00 NA
Albania 2022-04-23 NA 9.00 NA
Albania 2022-04-22 5 7.00 NA
Albania 2022-04-21 7 3.00 NA
Bolivia 2022-04-24 2 NA 1
Bolivia 2022-04-23 3 3.00 0
........
My problem is trying to make a new data frame which will contain each country once and each row will contain the most recent values * which isn't NA, if available *. For the above table the result should look like this:
Location Date code total_cases total_vaccinations
Afghanistan 2022-04-23 3 5.00 2
Albania 2022-04-24 3 5.00 NA
Bolivia 2022-04-24 2 3.00 1
So far I tried:
new_data <- main_data %>%
group_by(Location) %>%
arrange(desc(Date)) %>%
filter(date==max(Date))
But that doesn't work. Would appricate any help.
A possible solution, based on tidyverse:
library(tidyverse)
df %>%
group_by(Location) %>%
arrange(Date) %>%
fill(-Date, .direction="down") %>%
slice_max(Date) %>%
ungroup
#> # A tibble: 3 × 5
#> Location Date code total_cases total_vaccinations
#> <chr> <chr> <int> <dbl> <int>
#> 1 Afghanistan 2022-04-23 3 5 2
#> 2 Albania 2022-04-24 3 9 NA
#> 3 Bolivia 2022-04-24 2 3 1
My data looks like this:
Country
GDP
Year
A
10
1972
A
15
1973
A
20
1973
A
18
1975
B
25
1950
B
30
1951
B
35
1951
B
36
1953
I have so many observations look like data that I presented above. I want to change the duplicated years. However, I want to change first duplicated row of the year variable. I want to see my data like this:
Country
GDP
Year
A
10
1972
A
20
1973
A
15
1974
A
18
1975
B
25
1950
B
35
1951
B
30
1952
B
36
1953
Thank you for your time!
Here is one possible option with tidyverse:
library(tidyverse)
df %>%
group_by(Country, Year) %>%
mutate(dup = case_when(n() == 1 ~ FALSE,
min(GDP) == GDP ~ TRUE,
TRUE ~ FALSE)) %>%
mutate(Year = ifelse(dup == TRUE, Year + 1, Year)) %>%
arrange(Country, Year) %>%
ungroup %>%
select(-dup)
Output
Country GDP Year
<chr> <int> <dbl>
1 A 10 1972
2 A 20 1973
3 A 15 1974
4 A 18 1975
5 B 25 1950
6 B 35 1951
7 B 30 1952
8 B 36 1953
How about this ?
library(dplyr)
df %>%
arrange(Country, Year) %>%
group_by(Country) %>%
mutate(Year = min(Year) + row_number() - 1) %>%
ungroup
# Country GDP Year
# <chr> <int> <dbl>
#1 A 10 1972
#2 A 15 1973
#3 A 20 1974
#4 A 18 1975
#5 B 25 1950
#6 B 30 1951
#7 B 35 1952
#8 B 36 1953
This increments every Year by 1 starting from minimum value in each Country.
I have some cumulative data on covid-19 cases for countries and i am trying to calculate the difference in a new column called Diff. I can't remove the NA values because it wouldn't show the dates when there were no tests carried out. So i have made it so that if there is an NA value, to set the Diff value to 0 to indicate there was no difference, hence no tests conducted that day.
I am also trying to make a statement which says that if Diff is also NA, indicating that there was no tests conducted the day before, then to set the difference to the confirmed cases value for that day.
As you can see from my results at the bottom, i am almost there but i am creating a new column called ifelse. I tried to fix this but i think there is a simple error i am making somewhere. If anyone could point it out to me i would really appreciate it, thank you.
Edit: I realised i made a logical error with my thinking about setting the daily cases to confirmed cases when the lag calculation = NA because this is giving a misleading answer.
I used the below code on the large dataset to fill down and repeat the previous values when NAs appear. I filtered by group so as not to simply propagate forward values across countries.
I then calculated the lag and then used Ronak Shah's code to get the daily values.
data <- data %>%
group_by(CountryName) %>%
fill(ConfirmedCases, .direction = "down")
data <- data %>%
mutate(lag1 = ConfirmedCases - lag(ConfirmedCases))
data <- data %>% mutate(DailyCases = replace_na(coalesce(lag1, ConfirmedCases), 0))
library(tidyverse)
data <- data.frame(
stringsAsFactors = FALSE,
CountryName = c("Afghanistan","Afghanistan",
"Afghanistan","Afghanistan","Afghanistan",
"Afghanistan","Afghanistan",
"Afghanistan","Afghanistan","Afghanistan",
"Afghanistan","Afghanistan","Afghanistan",
"Afghanistan","Afghanistan",
"Afghanistan","Afghanistan","Afghanistan",
"Afghanistan","Afghanistan","Afghanistan",
"Afghanistan","Afghanistan",
"Afghanistan","Afghanistan","Afghanistan",
"Afghanistan","Afghanistan","Afghanistan",
"Afghanistan","Afghanistan"),
ConfirmedCases = c(NA,7L,NA,NA,NA,10L,16L,21L,
22L,22L,22L,24L,24L,34L,40L,42L,
75L,75L,91L,106L,114L,141L,166L,
192L,235L,235L,270L,299L,337L,367L,
423L),
Diff = c(NA,NA,NA,NA,NA,NA,6L,5L,1L,
0L,0L,2L,0L,10L,6L,2L,33L,0L,16L,
15L,8L,27L,25L,26L,43L,0L,35L,
29L,38L,30L,56L)
)
data2 <- data %>%
mutate(Diff = ifelse(is.na(ConfirmedCases) == TRUE, 0, ConfirmedCases - lag(ConfirmedCases)),
ifelse(is.na((ConfirmedCases - lag(ConfirmedCases))) == TRUE, ConfirmedCases, ConfirmedCases - lag(ConfirmedCases)))
head(data2, 10)
#> CountryName ConfirmedCases Diff ifelse(...)
#> 1 Afghanistan NA 0 NA
#> 2 Afghanistan 7 NA 7
#> 3 Afghanistan NA 0 NA
#> 4 Afghanistan NA 0 NA
#> 5 Afghanistan NA 0 NA
#> 6 Afghanistan 10 NA 10
#> 7 Afghanistan 16 6 6
#> 8 Afghanistan 21 5 5
#> 9 Afghanistan 22 1 1
#> 10 Afghanistan 22 0 0
Created on 2020-08-15 by the reprex package (v0.3.0)
Maybe this can help by creating a duplicate of your target column:
library(tidyverse)
data %>% mutate(D=ConfirmedCases,D=ifelse(is.na(D),0,D),
Diff2 = c(0,diff(D)),Diff2=ifelse(Diff2<0,0,Diff2)) %>% select(-D)
Output:
CountryName ConfirmedCases Diff Diff2
1 Afghanistan NA NA 0
2 Afghanistan 7 NA 7
3 Afghanistan NA NA 0
4 Afghanistan NA NA 0
5 Afghanistan NA NA 0
6 Afghanistan 10 NA 10
7 Afghanistan 16 6 6
8 Afghanistan 21 5 5
9 Afghanistan 22 1 1
10 Afghanistan 22 0 0
11 Afghanistan 22 0 0
12 Afghanistan 24 2 2
13 Afghanistan 24 0 0
14 Afghanistan 34 10 10
15 Afghanistan 40 6 6
16 Afghanistan 42 2 2
17 Afghanistan 75 33 33
18 Afghanistan 75 0 0
19 Afghanistan 91 16 16
20 Afghanistan 106 15 15
21 Afghanistan 114 8 8
22 Afghanistan 141 27 27
23 Afghanistan 166 25 25
24 Afghanistan 192 26 26
25 Afghanistan 235 43 43
26 Afghanistan 235 0 0
27 Afghanistan 270 35 35
28 Afghanistan 299 29 29
29 Afghanistan 337 38 38
30 Afghanistan 367 30 30
31 Afghanistan 423 56 56
I think you can use coalesce to get first non-NA value from Diff and ConfirmedCases and if both of them are NA replace it with 0.
library(dplyr)
data %>%
mutate(Diff2 = tidyr::replace_na(coalesce(Diff, ConfirmedCases), 0))
# CountryName ConfirmedCases Diff Diff2
#1 Afghanistan NA NA 0
#2 Afghanistan 7 NA 7
#3 Afghanistan NA NA 0
#4 Afghanistan NA NA 0
#5 Afghanistan NA NA 0
#6 Afghanistan 10 NA 10
#7 Afghanistan 16 6 6
#8 Afghanistan 21 5 5
#9 Afghanistan 22 1 1
#10 Afghanistan 22 0 0
#11 Afghanistan 22 0 0
#12 Afghanistan 24 2 2
#...
#...
Within RStudio, I have this code:
install(ggplot2)
install(dplyr)
Data is *gapminder_data.csv*
*str(gapminder_data.csv)*
'data.frame': 1704 obs. of 6 variables:
$ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
$ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ pop : num 8425333 9240934 10267083 11537966 13079460 ...
$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
$ lifeExp : num 28.8 30.3 32 34 36.1 ...
$ gdpPercap: num 779 821 853 836 740 ...
When I execute the following code:
gapminder_data.csv %>%
group_by(country) %>%
summarize(min(gdpPercap), max(gdpPercap))
it works:
# A tibble: 142 x 3
country `min(gdpPercap)` `max(gdpPercap)`
<fct> <dbl> <dbl>
1 Afghanistan 635. 978.
2 Albania 1601. 5937.
3 Algeria 2449. 6223.
4 Angola 2277. 5523.
5 Argentina 5911. 12779.
6 Australia 10040. 34435.
7 Austria 6137. 36126.
8 Bahrain 9867. 29796.
9 Bangladesh 630. 1391.
10 Belgium 8343. 33693.
But, I miss the corresponding years for the values:
min(gdpPercap) max(gdpPercap)
How can I solve it?
Thanks for your help.
Does this give you what you need?
mins <- gapminder_data.csv %>%
arrange(gdpPercap) %>%
group_by(country) %>%
slice(1) %>%
ungroup()
maxs <- gapminder_data.csv %>%
arrange(desc(gdpPercap)) %>%
group_by(country) %>%
slice(1) %>%
ungroup()
left_join(
select(mins, country, minyear=year, mingdp=gdpPercap),
select(maxs, country, maxyear=year, maxgdp=gdpPercap),
by = "country")
# # A tibble: 142 x 5
# country minyear mingdp maxyear maxgdp
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 Afghanistan 1997 635. 1982 978.
# 2 Albania 1952 1601. 2007 5937.
# 3 Algeria 1952 2449. 2007 6223.
# 4 Angola 1997 2277. 1967 5523.
# 5 Argentina 1952 5911. 2007 12779.
# 6 Australia 1952 10040. 2007 34435.
# 7 Austria 1952 6137. 2007 36126.
# 8 Bahrain 1952 9867. 2007 29796.
# 9 Bangladesh 1972 630. 2007 1391.
# 10 Belgium 1952 8343. 2007 33693.
# # ... with 132 more rows
We can do this pretty easily with a pivot. Since you didn't post a structure we could copy and paste (always helpful!), I've made a small sample tibble, but it should work on your larger set. After grouping by country, make a column to designate the max and min rows. We don't want the ones that aren't either, so drop those and finally spread the values to make a wide tibble with the max and min for each country. Generally, it's best to work in tidy (long-form) tibbles in R (what it is before the pivot), but you can easily get back there by using pivot_longer if need be.
tibble(
country = rep("Afghanistan",4),
year = seq(from = 1952, to = 1955),
gdpPercap = c(779, 821, 853, 836)
) %>%
group_by(country) %>%
mutate(
type = case_when(
gdpPercap == max(gdpPercap) ~ "max",
gdpPercap == min(gdpPercap) ~ "min"
)
) %>%
drop_na() %>%
pivot_wider(
id_cols = country,
names_from = type,
values_from = c(year, gdpPercap)
)
which produces:
# A tibble: 1 x 5
# Groups: country [1]
country year_min year_max gdpPercap_min gdpPercap_max
<chr> <int> <int> <dbl> <dbl>
1 Afghanistan 1952 1954 779 853