I have a table of cash flows for various projects over time (years) and want to calculate the IRR for each project. I can't seem to select the appropriate columns, which vary, for each project. The table structure is as follows:
structure(list(`Portfolio Company` = c("Ventures II", "Pal III",
"River Fund II", "Ventures III"),
minc = c(2007, 2008, 2008, 2012),
maxc = c(2021, 2021, 2021, 2020),
num_pers = c(14, 13, 13, 8),
`2007` = c(-660000, NA, NA, NA),
`2008` = c(-525000, -954219, -1427182.55, NA),
`2009` = c(-351991.03, -626798, -1694353.41, NA),
`2010` = c(-299717.06, -243248, -1193954, NA),
`2011` = c(-239257.08, 465738, -288309, NA),
`2012` = c(-9057.31000000001, -369011, 128509.63, -480000),
`2013` = c(-237233.9, -131111, 53718, -411734.58),
`2014` = c(-106181.76, -271181, 887640, -600000),
`2015` = c(-84760.51, 441808, 906289, -900000),
`2016` = c(2770719.21, -377799, 166110, -150000),
`2017` = c(157820.08, -12147, 1425198, -255000),
`2018` = c(204424.36,-1626110, 361270, -180000),
`2019` = c(563463.62, 119577, 531555, 3300402.62),
`2020` = c(96247.29, 7057926, 2247027, 36111.6),
`2021` = c(614848.68, 1277996, 258289, NA)),
class = c("grouped_df", "tbl_df", "tbl", "data.frame"),
row.names = c(NA, -4L),
groups = structure(list(`Portfolio Company` =c("Ventures II","Ventures III","Pal III", "River Fund II"),
.rows = structure(list(1L, 4L, 2L, 3L),
ptype = integer(0),
class = c("vctrs_list_of", "vctrs_vctr", "list"))),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -4L), .drop = TRUE))
Each project (Portfolio Company) has a different start and end date which is captured by the minc and maxc columns. I would like to use the text in minc and maxc to select from minc:maxc for each project to perform the IRR calculation. I get a variety of errors including: object maxc not found, incorrect arg ... Have tried about 20 combinations of !!sym, as.String (from NLP package) ... none works.
This is the code that created the table and the problematic select code:
sum_fund_CF <- funds %>% group_by(`TX_YR`, `Portfolio Company`) %>%
summarise(CF=sum(if_else(is.na(Proceeds),0,Proceeds)-if_else(is.na(Investment),0,Investment))) %>% ungroup() #organizes source data and calculates cash flows
sum_fund_CF <- sum_fund_CF %>%
group_by(`Portfolio Company`) %>% mutate(minc=min(`TX_YR`),maxc=max(`TX_YR`),num_pers=maxc-minc) %>%
pivot_wider(names_from = TX_YR, values_from = `CF`) #creates the table and finds first year and last year of cash flow, and num of periods between them
sum_fund_CF %>% group_by(`Portfolio Company`)%>% select(!!sym(as.String(maxc))):!!sym(as.String(max))) #want to select appropriate columns for each record to do the IRR analysis ... IRR() ... need a string of cash flows and no NA.
I'm sure it's something simple, but this has me perplexed. Thanks !
You can modify your definition of IRR accordingly. I followed this article on how to calculate IRR using the jrvFinance package.
The filter function from the dplyr package is used after group_by, to select the years indicated by the minc and maxc columns.
library(tidyverse)
library(janitor)
#>
#> Attaching package: 'janitor'
#> The following objects are masked from 'package:stats':
#>
#> chisq.test, fisher.test
library(jrvFinance)
data <- structure(list(`Portfolio Company` = c("Ventures II", "Pal III",
"River Fund II", "Ventures III"),
minc = c(2007, 2008, 2008, 2012),
maxc = c(2021, 2021, 2021, 2020),
num_pers = c(14, 13, 13, 8),
`2007` = c(-660000, NA, NA, NA),
`2008` = c(-525000, -954219, -1427182.55, NA),
`2009` = c(-351991.03, -626798, -1694353.41, NA),
`2010` = c(-299717.06, -243248, -1193954, NA),
`2011` = c(-239257.08, 465738, -288309, NA),
`2012` = c(-9057.31000000001, -369011, 128509.63, -480000),
`2013` = c(-237233.9, -131111, 53718, -411734.58),
`2014` = c(-106181.76, -271181, 887640, -600000),
`2015` = c(-84760.51, 441808, 906289, -900000),
`2016` = c(2770719.21, -377799, 166110, -150000),
`2017` = c(157820.08, -12147, 1425198, -255000),
`2018` = c(204424.36,-1626110, 361270, -180000),
`2019` = c(563463.62, 119577, 531555, 3300402.62),
`2020` = c(96247.29, 7057926, 2247027, 36111.6),
`2021` = c(614848.68, 1277996, 258289, NA)),
class = c("grouped_df", "tbl_df", "tbl", "data.frame"),
row.names = c(NA, -4L),
groups = structure(list(`Portfolio Company` =c("Ventures II","Ventures III","Pal III", "River Fund II"),
.rows = structure(list(1L, 4L, 2L, 3L),
ptype = integer(0),
class = c("vctrs_list_of", "vctrs_vctr", "list"))),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -4L), .drop = TRUE))
clean_data <- data %>%
clean_names() %>%
ungroup() %>%
pivot_longer(cols = -1:-4,
names_to = "year",
values_to = "cashflow") %>%
mutate(year = str_replace(year, "x", ""),
year = as.numeric(year))
clean_data %>%
print(n = 20)
#> # A tibble: 60 x 6
#> portfolio_company minc maxc num_pers year cashflow
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Ventures II 2007 2021 14 2007 -660000
#> 2 Ventures II 2007 2021 14 2008 -525000
#> 3 Ventures II 2007 2021 14 2009 -351991.
#> 4 Ventures II 2007 2021 14 2010 -299717.
#> 5 Ventures II 2007 2021 14 2011 -239257.
#> 6 Ventures II 2007 2021 14 2012 -9057.
#> 7 Ventures II 2007 2021 14 2013 -237234.
#> 8 Ventures II 2007 2021 14 2014 -106182.
#> 9 Ventures II 2007 2021 14 2015 -84761.
#> 10 Ventures II 2007 2021 14 2016 2770719.
#> 11 Ventures II 2007 2021 14 2017 157820.
#> 12 Ventures II 2007 2021 14 2018 204424.
#> 13 Ventures II 2007 2021 14 2019 563464.
#> 14 Ventures II 2007 2021 14 2020 96247.
#> 15 Ventures II 2007 2021 14 2021 614849.
#> 16 Pal III 2008 2021 13 2007 NA
#> 17 Pal III 2008 2021 13 2008 -954219
#> 18 Pal III 2008 2021 13 2009 -626798
#> 19 Pal III 2008 2021 13 2010 -243248
#> 20 Pal III 2008 2021 13 2011 465738
#> # ... with 40 more rows
clean_data %>%
group_by(portfolio_company) %>%
filter(between(year, min(minc), max(maxc))) %>%
summarise(irr = irr(cashflow,
cf.freq = 1))
#> # A tibble: 4 x 2
#> portfolio_company irr
#> <chr> <dbl>
#> 1 Pal III 0.111
#> 2 River Fund II 0.0510
#> 3 Ventures II 0.0729
#> 4 Ventures III 0.0251
Created on 2022-01-04 by the reprex package (v2.0.1)
Another way to do it using jvrFinance::irr().
library(jrvFinance)
library(tidyverse)
df %>%
rowwise() %>%
summarise(irr = irr(na.omit(c_across(matches('^\\d')))), .groups = 'drop')
#> # A tibble: 4 × 2
#> `Portfolio Company` irr
#> <chr> <dbl>
#> 1 Ventures II 0.0729
#> 2 Pal III 0.111
#> 3 River Fund II 0.0510
#> 4 Ventures III 0.0251
Created on 2022-01-04 by the reprex package (v2.0.1)
Related
I would like to convert the dot (decimal separator) to comma as decimal separator.
I tried using format(decimal.mark=",") but got an error.
df<-structure(list(ponto = c("F01", "F02", "F03", "F04", "F05", "F06"
), `Vegetação Nativa` = c(0.09, 3.12, 8.22, 5.92, 1.95, 4.7),
Agricultura = c(91.78, 91.87, 100, 100, 91.5, 99.38), Pastagem = c(-16.99,
-33.16, -22.73, -24.12, -38, -47.3), `Área Urbana` = c(27.32,
27.32, 27.57, 27.57, 19.18, NaN), `Solo Exposto` = c(10.04,
2.13, 8.5, 6.64, -29.35, -442.86), `Corpo Hídrico` = c(-15.62,
-15.62, NaN, NaN, -17.11, -25.93)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), groups = structure(list(
ponto = c("F01", "F02", "F03", "F04", "F05", "F06"), .rows = structure(list(
1L, 2L, 3L, 4L, 5L, 6L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -6L), .drop = TRUE))
I tried this, but got an error:
df%>%
format(decimal.mark=",")
One way is to use mutate and across from dplyr. Though this will still change their type to character.
library(dplyr)
df %>%
mutate(across(everything(), format, decimal.mark = ","))
Output
# A tibble: 6 × 7
# Groups: ponto [6]
ponto `Vegetação Nativa` Agricultura Pastagem `Área Urbana` `Solo Exposto` `Corpo Hídrico`
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 F01 0,09 91,78 -16,99 27,32 10,04 -15,62
2 F02 3,12 91,87 -33,16 27,32 2,13 -15,62
3 F03 8,22 100 -22,73 27,57 8,5 NaN
4 F04 5,92 100 -24,12 27,57 6,64 NaN
5 F05 1,95 91,5 -38 19,18 -29,35 -17,11
6 F06 4,7 99,38 -47,3 NaN -442,86 -25,93
Additionally, if you are wanting to simply change how you are seeing the data while printing, plotting, etc. for anything that is as.character, then you can change the default options. You can also read more about it here (this post has a lot of discussion directly related to your question).
options(OutDec= ",")
Examples (after changing options):
c(1.5, 3.456, 40000.89)
# [1] 1,500 3,456 40000,890
However, the caveat is that the data must be character. So with your data, we could convert those to character, then they will display with the comma rather than period.
df %>% mutate(across(everything(), as.character))
# A tibble: 6 × 7
# Groups: ponto [6]
ponto `Vegetação Nativa` Agricultura Pastagem `Área Urbana` `Solo Exposto` `Corpo Hídrico`
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 F01 0,09 91,78 -16,99 27,32 10,04 -15,62
2 F02 3,12 91,87 -33,16 27,32 2,13 -15,62
3 F03 8,22 100 -22,73 27,57 8,5 NaN
4 F04 5,92 100 -24,12 27,57 6,64 NaN
5 F05 1,95 91,5 -38 19,18 -29,35 -17,11
6 F06 4,7 99,38 -47,3 NaN -442,86 -25,93
I have two datasets on the same 2 patients. With the second dataset I want to add new information to the first, but I can't seem to get the code right.
My first (incomplete) dataset has a patient ID, measurement time (either T0 or FU1), year of birth, date of the CT scan, and two outcomes (legs_mass and total_mass):
library(tidyverse)
library(dplyr)
library(magrittr)
library(lubridate)
df1 <- structure(list(ID = c(115, 115, 370, 370), time = structure(c(1L,
6L, 1L, 6L), .Label = c("T0", "T1M0", "T1M6", "T1M12", "T2M0",
"FU1"), class = "factor"), year_of_birth = c(1970, 1970, 1961,
1961), date_ct = structure(c(16651, 17842, 16651, 18535), class = "Date"),
legs_mass = c(9.1, NA, NA, NA), total_mass = c(14.5, NA,
NA, NA)), row.names = c(NA, -4L), class = c("tbl_df", "tbl",
"data.frame"))
# Which gives the following dataframe
df1
# A tibble: 4 x 6
ID time year_of_birth date_ct legs_mass total_mass
<dbl> <fct> <dbl> <date> <dbl> <dbl>
1 115 T0 1970 2015-08-04 9.1 14.5
2 115 FU1 1970 2018-11-07 NA NA
3 370 T0 1961 2015-08-04 NA NA
4 370 FU1 1961 2020-09-30 NA NA
The second dataset adds to the legs_mass and total_mass columns:
df2 <- structure(list(ID = c(115, 370), date_ct = structure(c(17842,
18535), class = "Date"), ctscan_label = c("PXE115_CT_20181107_xxxxx-3.tif",
"PXE370_CT_20200930_xxxxx-403.tif"), legs_mass = c(956.1, 21.3
), total_mass = c(1015.9, 21.3)), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"))
# Which gives the following dataframe:
df2
# A tibble: 2 x 5
ID date_ct ctscan_label legs_mass total_mass
<dbl> <date> <chr> <dbl> <dbl>
1 115 2018-11-07 PXE115_CT_20181107_xxxxx-3.tif 956. 1016.
2 370 2020-09-30 PXE370_CT_20200930_xxxxx-403.tif 21.3 21.3
What I am trying to do, is...
Add the legs_mass and total_mass column values from df2 to df1, based on ID number and date_ct.
Add the new columns of df2 (the one that is not in df1; ctscan_label) to df1, also based on the date of the ct and patient ID.
So that the final dataset df3 looks as follows:
df3 <- structure(list(ID = c(115, 115, 370, 370), time = structure(c(1L,
6L, 1L, 6L), .Label = c("T0", "T1M0", "T1M6", "T1M12", "T2M0",
"FU1"), class = "factor"), year_of_birth = c(1970, 1970, 1961,
1961), date_ct = structure(c(16651, 17842, 16651, 18535), class = "Date"),
legs_mass = c(9.1, 956.1, NA, 21.3), total_mass = c(14.5,
1015.9, NA, 21.3)), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
# Corresponding to the following tibble:
# A tibble: 4 x 6
ID time year_of_birth date_ct legs_mass total_mass
<dbl> <fct> <dbl> <date> <dbl> <dbl>
1 115 T0 1970 2015-08-04 9.1 14.5
2 115 FU1 1970 2018-11-07 956. 1016.
3 370 T0 1961 2015-08-04 NA NA
4 370 FU1 1961 2020-09-30 21.3 21.3
I have tried the merge function and rbind from baseR, and bind_rows from dplyr but can't seem to get it right.
Any help?
You can join the two datasets and use coalesce to keep one non-NA value from the two datasets.
library(dplyr)
left_join(df1, df2, by = c("ID", "date_ct")) %>%
mutate(leg_mass = coalesce(legs_mass.x , legs_mass.y),
total_mass = coalesce(total_mass.x, total_mass.y)) %>%
select(-matches('\\.x|\\.y'), -ctscan_label)
# ID time year_of_birth date_ct leg_mass total_mass
# <dbl> <fct> <dbl> <date> <dbl> <dbl>
#1 115 T0 1970 2015-08-04 9.1 14.5
#2 115 FU1 1970 2018-11-07 956. 1016.
#3 370 T0 1961 2015-08-04 NA NA
#4 370 FU1 1961 2020-09-30 21.3 21.3
We can use data.table methods
library(data.table)
setDT(df1)[setDT(df2), c("legs_mass", "total_mass") :=
.(fcoalesce(legs_mass, i.legs_mass),
fcoalesce(total_mass, i.total_mass)), on = .(ID, date_ct)]
-output
df1
ID time year_of_birth date_ct legs_mass total_mass
1: 115 T0 1970 2015-08-04 9.1 14.5
2: 115 FU1 1970 2018-11-07 956.1 1015.9
3: 370 T0 1961 2015-08-04 NA NA
4: 370 FU1 1961 2020-09-30 21.3 21.3
I am trying to use pipe to filter and calculate Cohen's d statistic, but for some reason R will not recognise the column. I've tried this so many ways and can't get it to run..
df2 %>% filter(`Spreadsheet Row` == "Self-driving cars", `Zone Name` != "slider") %>% cohen.d(., Question,alpha=.05, data = Response)
Throws error: Error in cohen.d(., Question, alpha = 0.05, data = Response) : object 'Question' not found
This is the dataframe:
> df2 %>% filter(`Spreadsheet Row` == "Self-driving cars", `Zone Name` != "slider")
# A tibble: 96 x 6
`Event Index` PID `Spreadsheet Row` Question `Zone Name` Response
<chr> <dbl> <fct> <ord> <chr> <dbl>
1 17 3799252 Self-driving cars Pre-stim core_belief 3
2 18 3799252 Self-driving cars Pre-stim right_wrong 2
3 19 3799252 Self-driving cars Pre-stim moral_issue 4
4 20 3799252 Self-driving cars Pre-stim just_know 3
5 25 3799252 Self-driving cars Post-stim core_belief 4
6 26 3799252 Self-driving cars Post-stim right_wrong 5
7 27 3799252 Self-driving cars Post-stim moral_issue 3
8 28 3799252 Self-driving cars Post-stim just_know 4
9 65 3799288 Self-driving cars Pre-stim core_belief 4
10 66 3799288 Self-driving cars Pre-stim right_wrong 4
And it is clearly recognised if I use SELECT:
df2 %>% filter(`Spreadsheet Row` == "Self-driving cars", `Zone Name` != "slider") %>% select(Question)
# A tibble: 96 x 1
Question
<ord>
1 Pre-stim
2 Pre-stim
3 Pre-stim
4 Pre-stim
5 Post-stim
6 Post-stim
7 Post-stim
8 Post-stim
9 Pre-stim
10 Pre-stim
# ... with 86 more rows
But as soon as I try and use the column in any way it throws the object not found error. Driving me nuts!
dput(head(df2))
structure(list(`Event Index` = c(2, 3, 4, 5, 6, 11), PID = c(3800586,
3800586, 3800586, 3800586, 3800586, 3800586), `Spreadsheet Row` = structure(c(4L,
4L, 4L, 4L, 4L, 4L), .Label = c("E-waste", "Meat", "Plastic",
"Self-driving cars"), class = "factor"), Question = c("Familiarisation",
"Pre-stim", "Pre-stim", "Pre-stim", "Pre-stim", "Post-stim"),
`Zone Name` = c("slider", "core_belief", "right_wrong", "moral_issue",
"just_know", "core_belief"), Response = c(6, 5, 4, 5, 3,
7)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
), problems = structure(list(row = c(1543L, 1543L), col = c("Event Index",
NA), expected = c("a double", "62 columns"), actual = c("END OF FILE",
"1 columns"), file = c("'data_exp_44331-v24_task-4xn8.csv'",
"'data_exp_44331-v24_task-4xn8.csv'")), row.names = c(NA, -2L
), class = c("tbl_df", "tbl", "data.frame")))
Using the rstatix package version works fine;
df2 %>% filter(`Spreadsheet Row` == "Self-driving cars", `Zone Name` != "slider") %>% rstatix::cohens_d(., Response ~ Question)
I'm trying to graph excess deaths for 2020 against confirmed covid-19 deaths.
I have 2 dataframes, one x_worldwide_weekly_deaths (covid-19) and the other containing excess deaths, I want to add an excess deaths column to x_worldwide_weekly_deaths and match by both ISO3 country code, and week number;
Not every country tracks excess deaths so I want those not within the original excess df to have an NA value
Likewise, not every country who track excess deaths are as up to date, some have 37 weeks of data, others might only have 24, so I want the NA values for the missing weeks also
Using the below, I've gotten halfway there, countries not on the original list have NA and those who are have a value, however it only uses the first value rather than changing total per week
x_worldwide_weekly_death_values["excess_2020"] <- excess_death_2020$DTotal[match(x_worldwide_weekly_death_values$ISO3,
excess_death_2020$ISO3)]
Example of the data not in the original excess_death_2020 file which have had NA's added successfully
ISO3 administrative_~ population pop_density_km2 week_number weekly_deaths date excess_2020
<chr> <chr> <int> <chr> <dbl> <dbl> <date> <dbl>
1 AFG Afghanistan 37172386 56.937760009803 1 0 2020-01-06 NA
2 AFG Afghanistan 37172386 56.937760009803 2 0 2020-01-13 NA
3 AFG Afghanistan 37172386 56.937760009803 3 0 2020-01-20 NA
dput() for the above:
dput(x_worldwide_weekly_death_values[1:3,])
structure(list(ISO3 = c("AFG", "AFG", "AFG"), administrative_area_level_1 = c("Afghanistan",
"Afghanistan", "Afghanistan"), population = c(37172386L, 37172386L,
37172386L), pop_density_km2 = c("56.937760009803", "56.937760009803",
"56.937760009803"), week_number = c(1, 2, 3), weekly_deaths = c(0,
0, 0), date = structure(c(18267, 18274, 18281), class = "Date"),
excess_2020 = c(NA_real_, NA_real_, NA_real_)), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
Compared to Austria, where the week 1 value has been added to all cells
ISO3 administrative_a~ population pop_density_km2 week_number weekly_deaths date excess_2020
<chr> <chr> <int> <chr> <dbl> <dbl> <date> <dbl>
1 AUT Austria 8840521 107.1279668605~ 1 0 2020-01-06 1610
2 AUT Austria 8840521 107.1279668605~ 2 0 2020-01-13 1610
3 AUT Austria 8840521 107.1279668605~ 3 0 2020-01-20 1610
dput() for the above:
dput(x_worldwide_weekly_death_values[371:373,])
structure(list(ISO3 = c("AUT", "AUT", "AUT"), administrative_area_level_1 = c("Austria",
"Austria", "Austria"), population = c(8840521L, 8840521L, 8840521L
), pop_density_km2 = c("107.127966860564", "107.127966860564",
"107.127966860564"), week_number = c(1, 2, 3), weekly_deaths = c(0,
0, 0), date = structure(c(18267, 18274, 18281), class = "Date"),
excess_2020 = c(1610, 1610, 1610)), row.names = c(NA, -3L
), class = c("tbl_df", "tbl", "data.frame"))
Expected output for excess_2020 column would be the DTotal column figures associated to the Week number; Week 1 = 1610, Week 2 = 1702, Week 3 = 1797
ISO3 Year Week Sex D0_14 D15_64 D65_74 D75_84 D85p DTotal R0_14 R15_64 R65_74 R75_84 R85p
<chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AUT 2020 1 b 1 220 221 481 687 1610 4.07e-5 0.00196 0.0134 0.0399 0.157
2 AUT 2020 2 b 8 231 261 490 712 1702 3.26e-4 0.00206 0.0158 0.0407 0.163
3 AUT 2020 3 b 12 223 272 537 753 1797 4.89e-4 0.00198 0.0165 0.0446 0.173
dput() for the above
dput(excess_death_2020[1:3,])
structure(list(ISO3 = c("AUT", "AUT", "AUT"), Year = c(2020,
2020, 2020), Week = c(1, 2, 3), Sex = c("b", "b", "b"), D0_14 = c(1,
8, 12), D15_64 = c(220, 231, 223), D65_74 = c(221, 261, 272),
D75_84 = c(481, 490, 537), D85p = c(687, 712, 753), DTotal = c(1610,
1702, 1797), R0_14 = c(4.07296256273503e-05, 0.000325837005018803,
0.000488755507528204), R15_64 = c(0.00195783568851069, 0.00205572747293622,
0.00198453344789947), R65_74 = c(0.0133964529296798, 0.0158211502925177,
0.0164879420672982), R75_84 = c(0.0399495248686277, 0.0406970211759409,
0.044600613003021), R85p = c(0.157436284517545, 0.163165406952681,
0.172561167746305), RTotal = c(0.00948052042945739, 0.0100222644539978,
0.0105816740445559), Split = c(0, 0, 0), SplitSex = c(0,
0, 0), Forecast = c(1, 1, 1), date = structure(c(18267, 18274,
18281), class = "Date")), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"))
I tried a few variations of the below with little success
x_worldwide_weekly_deaths["excess_2020"] <- excess_death_2020$DTotal[excess_death_2020$Week[match(x_worldwide_weekly_death_values$week_number
[x_worldwide_weekly_death_values$ISO3],
excess_death_2020$Week[excess_death_2020$CountryCode])]]
Should I not be using match() on multiple criteria or am I not formatting it correctly?
Really appreciate any help and suggestions!
dplyr is reaaly good/easy for this kind of thing. Here's a simplified example that achieves both of your goals (adding NA for countries that are not in the excess death data, and adding NA for weeks that are not in the excess death data)...
library(dplyr)
x_worldwide_weekly_death_values <-
tribble(
~iso3c, ~week, ~covid_deaths,
"AFG", 1, 0,
"AFG", 2, 10,
"AFG", 3, 30,
"AFG", 4, 50,
"AUT", 1, 120,
"AUT", 2, 200,
"AUT", 3, 320,
"AUT", 4, 465,
"XXX", 1, 10,
"XXX", 2, 20,
"XXX", 3, 30,
"XXX", 4, 40,
)
excess_death_2020 <-
tribble(
~iso3c, ~week, ~DTotal,
"AFG", 1, 0,
"AFG", 2, 0,
"AFG", 3, 0,
"AUT", 1, 1610,
"AUT", 2, 1702,
"AUT", 3, 1797,
)
x_worldwide_weekly_death_values %>%
left_join(excess_death_2020, by = c("iso3c", "week"))
#> # A tibble: 12 x 4
#> iso3c week covid_deaths DTotal
#> <chr> <dbl> <dbl> <dbl>
#> 1 AFG 1 0 0
#> 2 AFG 2 10 0
#> 3 AFG 3 30 0
#> 4 AFG 4 50 NA
#> 5 AUT 1 120 1610
#> 6 AUT 2 200 1702
#> 7 AUT 3 320 1797
#> 8 AUT 4 465 NA
#> 9 XXX 1 10 NA
#> 10 XXX 2 20 NA
#> 11 XXX 3 30 NA
#> 12 XXX 4 40 NA
Give a dataframe df as follows:
df <- structure(list(year = c(2001, 2002, 2003, 2004), `1` = c(22.0775,
24.2460714285714, 29.4039285714286, 27.7110714285714), `2` = c(27.2535714285714,
35.9996428571429, 26.39, 27.8557142857143), `3` = c(24.7710714285714,
25.4428571428571, 15.1142857142857, 19.9657142857143)), row.names = c(NA,
-4L), groups = structure(list(year = c(2001, 2002, 2003, 2004
), .rows = structure(list(1L, 2L, 3L, 4L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, 4L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
Out:
year 1 2 3
0 2001 22.07750 27.25357 24.77107
1 2002 24.24607 35.99964 25.44286
2 2003 29.40393 26.39000 15.11429
3 2004 27.71107 27.85571 19.96571
For column 1, 2 and 3, how could I calculate year-to-year absolute change?
The expected result will like this:
year 1 2 3
0 2002 2.16857 8.74607 0.67179
1 2003 5.15786 9.60964 10.32857
2 2004 1.69286 1.46571 4.85142
The final objective is to compare values of 1, 2, 3 columns across all years, find the largest change year and column, at this example, it should be 2003 and column 3.
How could I do that in R? Thanks.
You can use :
library(dplyr)
data <- df %>% ungroup %>% summarise(across(-1, ~abs(diff(.))))
data
# A tibble: 3 x 3
# `1` `2` `3`
# <dbl> <dbl> <dbl>
#1 2.17 8.75 0.672
#2 5.16 9.61 10.3
#3 1.69 1.47 4.85
To get max change
mat <- which(data == max(data), arr.ind = TRUE)
mat
# row col
#[1,] 2 3
#Year name
df$year[mat[, 1] + 1]
#[1] 2003
#Column name
mat[, 2]
#col
# 3
You can try:
library(reshape2)
library(dplyr)
#Melt
Melted <- reshape2::melt(df,id.vars = 'year')
#Group
Melted %>% group_by(variable) %>% mutate(Diff=c(0,abs(diff(value)))) %>% ungroup() %>%
filter(Diff==max(Diff))
# A tibble: 1 x 4
year variable value Diff
<dbl> <fct> <dbl> <dbl>
1 2003 3 15.1 10.3
We can apply the diff on the entire dataset by converting the numeric columns of interest to matrix in base R
cbind(year = df$year[-1], abs(diff(as.matrix(df[-1]))))
# year 1 2 3
#[1,] 2002 2.168571 8.746071 0.6717857
#[2,] 2003 5.157857 9.609643 10.3285714
#[3,] 2004 1.692857 1.465714 4.8514286