mutate function with nested ifelse statements creating two columns instead of one - r

I have some cumulative data on covid-19 cases for countries and i am trying to calculate the difference in a new column called Diff. I can't remove the NA values because it wouldn't show the dates when there were no tests carried out. So i have made it so that if there is an NA value, to set the Diff value to 0 to indicate there was no difference, hence no tests conducted that day.
I am also trying to make a statement which says that if Diff is also NA, indicating that there was no tests conducted the day before, then to set the difference to the confirmed cases value for that day.
As you can see from my results at the bottom, i am almost there but i am creating a new column called ifelse. I tried to fix this but i think there is a simple error i am making somewhere. If anyone could point it out to me i would really appreciate it, thank you.
Edit: I realised i made a logical error with my thinking about setting the daily cases to confirmed cases when the lag calculation = NA because this is giving a misleading answer.
I used the below code on the large dataset to fill down and repeat the previous values when NAs appear. I filtered by group so as not to simply propagate forward values across countries.
I then calculated the lag and then used Ronak Shah's code to get the daily values.
data <- data %>%
group_by(CountryName) %>%
fill(ConfirmedCases, .direction = "down")
data <- data %>%
mutate(lag1 = ConfirmedCases - lag(ConfirmedCases))
data <- data %>% mutate(DailyCases = replace_na(coalesce(lag1, ConfirmedCases), 0))
library(tidyverse)
data <- data.frame(
stringsAsFactors = FALSE,
CountryName = c("Afghanistan","Afghanistan",
"Afghanistan","Afghanistan","Afghanistan",
"Afghanistan","Afghanistan",
"Afghanistan","Afghanistan","Afghanistan",
"Afghanistan","Afghanistan","Afghanistan",
"Afghanistan","Afghanistan",
"Afghanistan","Afghanistan","Afghanistan",
"Afghanistan","Afghanistan","Afghanistan",
"Afghanistan","Afghanistan",
"Afghanistan","Afghanistan","Afghanistan",
"Afghanistan","Afghanistan","Afghanistan",
"Afghanistan","Afghanistan"),
ConfirmedCases = c(NA,7L,NA,NA,NA,10L,16L,21L,
22L,22L,22L,24L,24L,34L,40L,42L,
75L,75L,91L,106L,114L,141L,166L,
192L,235L,235L,270L,299L,337L,367L,
423L),
Diff = c(NA,NA,NA,NA,NA,NA,6L,5L,1L,
0L,0L,2L,0L,10L,6L,2L,33L,0L,16L,
15L,8L,27L,25L,26L,43L,0L,35L,
29L,38L,30L,56L)
)
data2 <- data %>%
mutate(Diff = ifelse(is.na(ConfirmedCases) == TRUE, 0, ConfirmedCases - lag(ConfirmedCases)),
ifelse(is.na((ConfirmedCases - lag(ConfirmedCases))) == TRUE, ConfirmedCases, ConfirmedCases - lag(ConfirmedCases)))
head(data2, 10)
#> CountryName ConfirmedCases Diff ifelse(...)
#> 1 Afghanistan NA 0 NA
#> 2 Afghanistan 7 NA 7
#> 3 Afghanistan NA 0 NA
#> 4 Afghanistan NA 0 NA
#> 5 Afghanistan NA 0 NA
#> 6 Afghanistan 10 NA 10
#> 7 Afghanistan 16 6 6
#> 8 Afghanistan 21 5 5
#> 9 Afghanistan 22 1 1
#> 10 Afghanistan 22 0 0
Created on 2020-08-15 by the reprex package (v0.3.0)

Maybe this can help by creating a duplicate of your target column:
library(tidyverse)
data %>% mutate(D=ConfirmedCases,D=ifelse(is.na(D),0,D),
Diff2 = c(0,diff(D)),Diff2=ifelse(Diff2<0,0,Diff2)) %>% select(-D)
Output:
CountryName ConfirmedCases Diff Diff2
1 Afghanistan NA NA 0
2 Afghanistan 7 NA 7
3 Afghanistan NA NA 0
4 Afghanistan NA NA 0
5 Afghanistan NA NA 0
6 Afghanistan 10 NA 10
7 Afghanistan 16 6 6
8 Afghanistan 21 5 5
9 Afghanistan 22 1 1
10 Afghanistan 22 0 0
11 Afghanistan 22 0 0
12 Afghanistan 24 2 2
13 Afghanistan 24 0 0
14 Afghanistan 34 10 10
15 Afghanistan 40 6 6
16 Afghanistan 42 2 2
17 Afghanistan 75 33 33
18 Afghanistan 75 0 0
19 Afghanistan 91 16 16
20 Afghanistan 106 15 15
21 Afghanistan 114 8 8
22 Afghanistan 141 27 27
23 Afghanistan 166 25 25
24 Afghanistan 192 26 26
25 Afghanistan 235 43 43
26 Afghanistan 235 0 0
27 Afghanistan 270 35 35
28 Afghanistan 299 29 29
29 Afghanistan 337 38 38
30 Afghanistan 367 30 30
31 Afghanistan 423 56 56

I think you can use coalesce to get first non-NA value from Diff and ConfirmedCases and if both of them are NA replace it with 0.
library(dplyr)
data %>%
mutate(Diff2 = tidyr::replace_na(coalesce(Diff, ConfirmedCases), 0))
# CountryName ConfirmedCases Diff Diff2
#1 Afghanistan NA NA 0
#2 Afghanistan 7 NA 7
#3 Afghanistan NA NA 0
#4 Afghanistan NA NA 0
#5 Afghanistan NA NA 0
#6 Afghanistan 10 NA 10
#7 Afghanistan 16 6 6
#8 Afghanistan 21 5 5
#9 Afghanistan 22 1 1
#10 Afghanistan 22 0 0
#11 Afghanistan 22 0 0
#12 Afghanistan 24 2 2
#...
#...

Related

Calculating rowsums of grouped data

I am using the datase who (available in the library datasets
tidyr), which for 34 years counts the number of TB cases registered for 56 groups (combinations of gender, age and method of testing) for a number of countries. There is one row per country per year, and the first 4 entries are to do with year, country name and such.
I want to calculate the sum of new cases per country per year, but I just can't make it work.
I was ecpecting something like
group_by(who, country) %>% summarise(count = rowsum(.[5:60]))
would work, but it doesn't.
Can anyone help me understand why it doesn't work, and what to do instead?
You're missing a first step, which is to gather the data into a 'tidy' format. Try this:
who%>%
gather(key=type,value=cases,-country,-iso2,-iso3,-year)%>%
filter(!is.na(cases))%>%
group_by(country,year)%>%
summarise(sum(cases))
Which gives output:
# A tibble: 3,484 × 3
# Groups: country [219]
country year `sum(cases)`
<chr> <int> <int>
1 Afghanistan 1997 128
2 Afghanistan 1998 1778
3 Afghanistan 1999 745
4 Afghanistan 2000 2666
5 Afghanistan 2001 4639
library(tidyverse)
(long_who <- who |> pivot_longer(cols = -(1:4)))
long_who |> filter(startsWith(name,"new")) |> # dont want things like "Population"
group_by(country) |>
summarise(sum_of_new_ = sum(value,na.rm=TRUE))
A base r approach
data.frame(who[,c("country", "year")],
cnt = rowSums(who[5:60], na.rm = TRUE))
#> + country year cnt
#> 1 Afghanistan 1980 0
#> 2 Afghanistan 1981 0
#> 3 Afghanistan 1982 0
#> 4 Afghanistan 1983 0
#> 5 Afghanistan 1984 0
#> 6 Afghanistan 1985 0
You could also do without the long format by using rowSums and across:
library(dplyr)
who |>
group_by(country, year) |>
summarise(count = rowSums(across(5:58), na.rm = TRUE)) |>
ungroup()
Alternatives to across(5:58):
across(starts_with("new"))
across(-(1:4))
Output:
# A tibble: 20 × 3
# Groups: country [1]
country year count
<chr> <int> <dbl>
1 Afghanistan 1980 0
2 Afghanistan 1981 0
3 Afghanistan 1982 0
4 Afghanistan 1983 0
5 Afghanistan 1984 0
6 Afghanistan 1985 0
7 Afghanistan 1986 0
8 Afghanistan 1987 0
9 Afghanistan 1988 0
10 Afghanistan 1989 0
11 Afghanistan 1990 0
12 Afghanistan 1991 0
13 Afghanistan 1992 0
14 Afghanistan 1993 0
15 Afghanistan 1994 0
16 Afghanistan 1995 0
17 Afghanistan 1996 0
18 Afghanistan 1997 128
19 Afghanistan 1998 1778
20 Afghanistan 1999 745

Make time-period observations into annual observations in R

I have a dataset (df1) on hundreds of national crises, where each observation is a crisis event at the country level with a start and an end date. I also have the date when the crisis was announced (yyyy-mm-dd format), and a bunch of other crisis characteristics.
df1 <- data.frame(cbind(eventID=c(1,2,3,4), country=c("ALB","ALB","ARG","ARG"), start=c(1994, 1998, 1998, 1991), end=c(1996,1999,1999,1993), announcement=c("1994-11-01","1998-03-01","1998-07-01","1992-01-01"), x1=c(6,2,8,7), x2=c("a","q","k","b")))
eventID country start end announcement x1 x2
1 ALB 1994 1996 1994-11-01 6 a
2 ALB 1998 1999 1998-03-01 2 q
3 ARG 1998 1999 1998-07-01 8 k
4 ARG 1991 1993 1992-01-01 7 b
I need to make df2, a panel of countries with annual observations from the earliest "start" year to the latest "end" year. I want to have a dummy variable, "crisis", that equals 1 for the years between "start" and "end" in df1, and 0 otherwise. I want "announcement" to contain the announcement date in df1 for the year with an announcement, and "NA" otherwise. I would like the extra crisis characteristics, x1 and x2, to show up for crisis years to which they correspond, and "NA" otherwise.
I also need observations for each country for years in which no country has a crisis (in df2: 1997).
df2 <- data.frame(cbind(year=c(1991,1992,1993,1994,1995,1996,1997,1998,1999,1991,1992,1993,1994,1995,1996,1997,1998,1999), country=c("ALB","ALB","ALB","ALB","ALB","ALB","ALB","ALB","ALB","ARG","ARG","ARG","ARG","ARG","ARG","ARG","ARG","ARG"),crisis=c(0,0,0,1,1,1,0,1,1,1,1,1,0,0,0,0,1,1), announcement=c(NA, NA,NA,"1994-11-01",NA,NA,NA,"1998-03-01",NA,NA,"1992-01-01",NA,NA,NA,NA,NA,"1998-07-01"), x1=c(NA,NA,NA,6,6,6,NA,2,2,8,8,8,NA,NA,NA,NA,7,7), x2=c(NA,NA,NA,"a","a","a",NA,"q","q","k","k","k",NA,NA,NA,NA,"b","b")))
year country crisis announcement x1 x2
1991 ALB 0 NA NA NA
1992 ALB 0 NA NA NA
1993 ALB 0 NA NA NA
1994 ALB 1 1994-11-01 6 a
1995 ALB 1 NA 6 a
1996 ALB 1 NA 6 a
1997 ALB 0 NA NA NA
1998 ALB 1 1998-03-01 2 q
1999 ALB 1 NA 2 q
1991 ARG 1 NA 8 k
1992 ARG 1 1992-01-01 8 k
1993 ARG 1 NA 8 k
1994 ARG 0 NA NA NA
1995 ARG 0 NA NA NA
1996 ARG 0 NA NA NA
1997 ARG 0 NA NA NA
1998 ARG 1 1998-07-01 7 b
1999 ARG 1 NA 7 b
I would love any suggestions! I'm stumped as to how to replicate the observations for each year, but only include x1 and x2 values when my new "crisis" dummy = 1
Thanks!
Making use of dplyr and tidyr this could be achieved like so:
library(dplyr)
library(tidyr)
df1 <- data.frame(cbind(eventID=c(1,2,3,4), country=c("ALB","ALB","ARG","ARG"), start=c(1994, 1998, 1998, 1991), end=c(1996,1999,1999,1993), announcement=c("1994-11-01","1998-03-01","1998-07-01","1992-01-01"), x1=c(6,2,8,7), x2=c("a","q","k","b")))
df1 %>%
mutate(year = factor(start, levels = min(start):max(end))) %>%
complete(year, country) %>%
mutate(year = as.numeric(as.character(year))) %>%
arrange(country, year) %>%
group_by(country) %>%
fill(eventID, end, x1, x2) %>%
ungroup() %>%
mutate(across(c(eventID, end, x1, x2), ~ ifelse(end < year, NA, .)),
crisis = as.numeric(!is.na(eventID)))
#> # A tibble: 18 x 9
#> year country eventID start end announcement x1 x2 crisis
#> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 1991 ALB <NA> <NA> <NA> <NA> <NA> <NA> 0
#> 2 1992 ALB <NA> <NA> <NA> <NA> <NA> <NA> 0
#> 3 1993 ALB <NA> <NA> <NA> <NA> <NA> <NA> 0
#> 4 1994 ALB 1 1994 1996 1994-11-01 6 a 1
#> 5 1995 ALB 1 <NA> 1996 <NA> 6 a 1
#> 6 1996 ALB 1 <NA> 1996 <NA> 6 a 1
#> 7 1997 ALB <NA> <NA> <NA> <NA> <NA> <NA> 0
#> 8 1998 ALB 2 1998 1999 1998-03-01 2 q 1
#> 9 1999 ALB 2 <NA> 1999 <NA> 2 q 1
#> 10 1991 ARG 4 1991 1993 1992-01-01 7 b 1
#> 11 1992 ARG 4 <NA> 1993 <NA> 7 b 1
#> 12 1993 ARG 4 <NA> 1993 <NA> 7 b 1
#> 13 1994 ARG <NA> <NA> <NA> <NA> <NA> <NA> 0
#> 14 1995 ARG <NA> <NA> <NA> <NA> <NA> <NA> 0
#> 15 1996 ARG <NA> <NA> <NA> <NA> <NA> <NA> 0
#> 16 1997 ARG <NA> <NA> <NA> <NA> <NA> <NA> 0
#> 17 1998 ARG 3 1998 1999 1998-07-01 8 k 1
#> 18 1999 ARG 3 <NA> 1999 <NA> 8 k 1

R - calculate annual population conditional on survival in every year

I have a data frame with three columns: birth_year, death_year, gender.
I have to calculate total alive male and female population for every year in a given range (1950:1980).
The data frame looks like this:
birth_year death_year gender
1934 1988 male
1922 1993 female
1890 1966 male
1901 1956 male
1946 2009 female
1909 1976 female
1899 1945 male
1887 1949 male
1902 1984 female
The person is alive in year x if death_year > x & birth year <= x
The output I am looking for is something like this:
year male female
1950 3 4
1951 2 3
1952 4 3
1953 4 5
.
.
1980 6 3
Thanks!
Does this work:
library(tidyr)
library(purrr)
library(dplyr)
df %>% mutate(year = map2(1950,1980, seq)) %>% unnest(year) %>%
mutate(isalive = case_when(year >= birth_year & year < death_year ~ 1, TRUE ~ 0)) %>%
group_by(year, gender) %>% summarise(alive = sum(isalive)) %>%
pivot_wider(names_from = gender, values_from = alive) %>% print( n = 50)
`summarise()` regrouping output by 'year' (override with `.groups` argument)
# A tibble: 31 x 3
# Groups: year [31]
year female male
<int> <dbl> <dbl>
1 1950 4 3
2 1951 4 3
3 1952 4 3
4 1953 4 3
5 1954 4 3
6 1955 4 3
7 1956 4 2
8 1957 4 2
9 1958 4 2
10 1959 4 2
11 1960 4 2
12 1961 4 2
13 1962 4 2
14 1963 4 2
15 1964 4 2
16 1965 4 2
17 1966 4 1
18 1967 4 1
19 1968 4 1
20 1969 4 1
21 1970 4 1
22 1971 4 1
23 1972 4 1
24 1973 4 1
25 1974 4 1
26 1975 4 1
27 1976 3 1
28 1977 3 1
29 1978 3 1
30 1979 3 1
31 1980 3 1
Data used:
df
# A tibble: 9 x 3
birth_year death_year gender
<dbl> <dbl> <chr>
1 1934 1988 male
2 1922 1993 female
3 1890 1966 male
4 1901 1956 male
5 1946 2009 female
6 1909 1976 female
7 1899 1945 male
8 1887 1949 male
9 1902 1984 female
Here's a simple base R solution. Summing a logical vector will get you your count of alive or dead because TRUE is 1 and FALSE is 0.
number_alive <- function(range, df){
sapply(range, function(x) sum((df$death_year > x) & (df$birth_year <= x)))
}
output <- data.frame('year' = 1950:1980,
'female' = number_alive(1950:1980, df[df$gender == 'female']),
'male' = number_alive(1950:1980, df[df$gender == 'male']))
# year female male
# 1 1950 4 3
# 2 1951 4 3
# 3 1952 4 3
# 4 1953 4 3
# 5 1954 4 3
# 6 1955 4 3
# 7 1956 4 2
# 8 1957 4 2
# 9 1958 4 2
# 10 1959 4 2
# 11 1960 4 2
# 12 1961 4 2
# 13 1962 4 2
# 14 1963 4 2
# 15 1964 4 2
# 16 1965 4 2
# 17 1966 4 1
# 18 1967 4 1
# 19 1968 4 1
# 20 1969 4 1
# 21 1970 4 1
# 22 1971 4 1
# 23 1972 4 1
# 24 1973 4 1
# 25 1974 4 1
# 26 1975 4 1
# 27 1976 3 1
# 28 1977 3 1
# 29 1978 3 1
# 30 1979 3 1
# 31 1980 3 1
This approach uses an ifelse to determine if alive (1) or dead (0).
Data:
df <- "birth_year death_year gender
1934 1988 male
1922 1993 female
1890 1966 male
1901 1956 male
1946 2009 female
1909 1976 female
1899 1945 male
1887 1949 male
1902 1984 female"
df <- read.table(text = df, header = TRUE)
Code:
library(dplyr)
library(tidyr)
library(tibble)
library(purrr)
df %>%
mutate(year = map2(1950,1980, seq)) %>%
unnest(year) %>%
select(year, birth_year, death_year, gender) %>%
mutate(
alive = ifelse(year >= birth_year & year <= death_year, 1, 0)
) %>%
group_by(year, gender) %>%
summarise(
is_alive = sum(alive)
) %>%
pivot_wider(
names_from = gender,
values_from = is_alive
) %>%
select(year, male, female)
Output:
#> # A tibble: 31 x 3
#> # Groups: year [31]
#> year male female
#> <int> <dbl> <dbl>
#> 1 1950 3 4
#> 2 1951 3 4
#> 3 1952 3 4
#> 4 1953 3 4
#> 5 1954 3 4
#> 6 1955 3 4
#> 7 1956 3 4
#> 8 1957 2 4
#> 9 1958 2 4
#> 10 1959 2 4
#> # … with 21 more rows
Created on 2020-11-11 by the reprex package (v0.3.0)

Trying to convert data long format to wide format

My data frame currently looks like
country_txt Year nkill_yr Countrycode Population deathsPer100k
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 1973 0 4 12028 0.000000e+00
2 Afghanistan 1979 53 4 13307 3.982866e-05
3 Afghanistan 1987 0 4 11503 0.000000e+00
4 Afghanistan 1988 128 4 11541 1.109089e-04
5 Afghanistan 1989 10 4 11778 8.490406e-06
6 Afghanistan 1990 12 4 12249 9.796718e-06
It contains a list of al countries, and the terrorist Deaths per 100,000 population.
Ideally I would Like a data frame in wide format that has the structure of:
country_txt 1970 1971 1972 1973 1974 1975
Afghanistan 3.98 1.1 0 4.3 0.8 0.09
Albania 0 0.4 0.5 0 0 0
Algeria 0 0 0 0.1 0.2 0
Angola 0 0.3 0 0 0 0
Except my function currently repeats like this:
YearCountryRatio<- spread(data = YearCountryRatio, Year, deathsPer100k )
country_txt 1970 1971 1972 1973
Afghanistan 3.98 NA NA NA
Afghanistan NA 1.1 NA NA
Afghanistan NA NA 0 NA
Afghanistan NA NA NA 4.3
And similarly for other countries,
Is there any way to either:
Collapse all of the NA values to show only one country or
Put it directly into wide format?
I've assumed you want each country_txt value reduced to a single row and are happy to drop the unused variables. (Note: I added a dummy country_txt value of "XYZ" to the sample data to show how multiple countries spread)
library(dplyr)
library(tidyr)
df <- read.table(text = "country_txt Year nkill_yr Countrycode Population deathsPer100k
1 Afghanistan 1973 0 4 12028 0.000000e+00
2 Afghanistan 1979 53 4 13307 3.982866e-05
3 Afghanistan 1987 0 4 11503 0.000000e+00
4 XYZ 1988 128 4 11541 1.109089e-04
5 XYZ 1989 10 4 11778 8.490406e-06
6 XYZ 1990 12 4 12249 9.796718e-06", header = TRUE)
df <- mutate(df, deathsPer100k = round(deathsPer100k*100000, 2))
select(df, country_txt, Year, deathsPer100k) %>% spread(Year, deathsPer100k, fill = 0)
#> country_txt 1973 1979 1987 1988 1989 1990
#> 1 Afghanistan 0 3.98 0 0.00 0.00 0.00
#> 2 XYZ 0 0.00 0 11.09 0.85 0.98

Summing in R with multiple conditions

I'm trying to sum columns 4 (child) ,5 (adult) and 6 (elderly) and return values for each country by year disregarding column 3 (sex). Reading through various forums I cannot combine these:
country year sex child adult elderly
1 Afghanistan 1995 male -1 -1 -1
2 Afghanistan 1996 female -1 -1 -1
3 Afghanistan 1996 male -1 -1 -1
4 Afghanistan 1997 female 5 96 1
5 Afghanistan 1997 male 0 26 0
6 Afghanistan 1998 female 45 1142 20
I was able to sum the 3 columns by row and create a separate column with the following but still need to combine the male and female rows for each country:
tuberculosiscases <-tuberculosis$child + tuberculosis$adult + tuberculosis$elderly
names(tuberculosiscases) <- c("tuberculosiscases")
tuberculosis <- data.frame(tuberculosis,tuberculosiscases)
head(tuberculosis)
country year sex child adult elderly tuberculosiscases
1 Afghanistan 1995 male -1 -1 -1 -3
2 Afghanistan 1996 female -1 -1 -1 -3
3 Afghanistan 1996 male -1 -1 -1 -3
4 Afghanistan 1997 female 5 96 1 102
5 Afghanistan 1997 male 0 26 0 26
6 Afghanistan 1998 female 45 1142 20 1207
If you want add the sum to your dataframe, have several options:
# with base R (1)
transform(dat, tuber.sum = ave(tuberculosiscases, country, year, FUN = sum))
# with base R (2)
dat$tuber.sum <- ave(dat$tuberculosiscases, dat$country, dat$year, FUN = sum))
# with the data.table package
library(data.table)
setDT(dat)[, tuber.sum:=sum(tuberculosiscases), by= .(country, year)]
# with the plyr package
library(plyr)
dat <- ddply(dat, .(country, year), transform, tuber.sum=sum(tuberculosiscases))
# with the dplyr package
library(dplyr)
dat <- dat %>%
group_by(country, year) %>%
mutate(tuber.sum=sum(tuberculosiscases))
all give:
> dat
country year sex child adult elderly tuberculosiscases tuber.sum
1: Afghanistan 1995 male -1 -1 -1 -3 -3
2: Afghanistan 1996 female -1 -1 -1 -3 -6
3: Afghanistan 1996 male -1 -1 -1 -3 -6
4: Afghanistan 1997 female 5 96 1 102 128
5: Afghanistan 1997 male 0 26 0 26 128
6: Afghanistan 1998 female 45 1142 20 1207 1207
If I correctly understand your question and assuming that the name of the initial data.frame is my_df I would use aggregate:
aggdata <-aggregate(my_df[,c("child", "adult", "elderly")],
by=list(my_df$country,my_df$year), FUN=sum, na.rm=TRUE)

Resources