for loop in R to compute yearly evolution of a variable - r

I repost here what I posted on stats exchange, having been told it was better suited for stack overflow. Here is the structure of my dataset for reproducibility :
structure(list(numero = c("133", "62", "75", "76", "86", "281"
), tranche_age = c("20-30", "20-30", "20-30", "20-30", "20-30",
"20-30"), tranche_anciennete = c("5 ans et moins", "5 à 10 ans",
"5 ans et moins", "5 ans et moins", "5 à 10 ans", "5 à 10 ans"
), code_statut = c("C", "E", "E", "E", "E", "E"), code_contrat = c("A",
"A", "A", "A", "A", "A"), taux_demploi_mois = c(100, 100, 100,
100, 100, 100), echelon = c("E1", NA, NA, NA, NA, NA), niveau = c("N7",
NA, NA, NA, NA, NA), brut_mensuel = c(NA, 786.13, 1156.95, 1156.95,
904.79, 904.79), estimation_annuelle = c(NA, 10219.69, 15040.35,
15040.35, 11762.27, 11762.27), annee = c(2017, 2017, 2017, 2017,
2017, 2017), primes_en_montant = c(0, 0, 0, 0, 0, 0), primes_en_pourcentage =
c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), brut_mensuel_ETP = c(NA,
786.13, 1156.95, 1156.95, 904.79, 904.79)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
Each worker is identified with one number ("numero"), which doesn't change from year to year. I would like to compute a new variable, to add to this dataframe, representing the evolution of the "estimation_annuelle" (which is the yearly wage) of each worker, from year to year (from 2017 to 2021), and then the average annual growth rate over the 5 years. Then, I would like to view those who have less than a 2% raise on one year (2017-2018 for example), and see whether it has been caught up in the following years or no (that is, if one's wage has increased by less than 2% between 2017 and 2018, if the wage increased one had between 2018 and 2019 compensated, and by how much, the insufficient raise on the previous yearly period).
I have tried a code to compute the variable evolution from year to year, which doesn't work :
test <- liste_complete %>%
group_by(annee, numero) %>%
select(numero, annee, estimation_annuelle)%>%
data.frame()
for(i in 1:length(test$estimation_annuelle)) {
print((test[i+1,] - test[i,])/test[i,])
}
And I have not found anything to compute the average annual growth rate (here is the formula : https://investinganswers.com/dictionary/a/average-annual-growth-rate-aagr), nor computed whether the insufficient increase for those who are concerned has been made up for in the following years.
Could anyone help ?

We can use a summarise then a match.
df$annee <- c(2017, 2017, 2018,2018, 2019,2019)
df$brut_mensuel[1] <- 11000
# first, summarise
summary <- df %>% select(numero, annee, estimation_annuelle, brut_mensuel) %>%
group_by(annee) %>% summarise(estimation_annuelle=mean(brut_mensuel)) %>% arrange(annee) %>%
mutate(salaire_annee_prec = lag(estimation_annuelle),
variation_annee_precedente=(estimation_annuelle-salaire_annee_prec)/salaire_annee_prec)
# matching
df$variation_annee_prec <- summary$variation_annee_precedente[match(df$annee,summary$annee)]
df
# A tibble: 6 x 15
numero tranche_age tranche_anciennete code_statut code_contrat taux_demploi_mois echelon niveau brut_mensuel estimation_annuelle annee primes_en_montant
<chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 133 20-30 5 ans et moins C A 100 E1 N7 11000 NA 2017 0
2 62 20-30 5 à 10 ans E A 100 NA NA 786. 10220. 2017 0
3 75 20-30 5 ans et moins E A 100 NA NA 1157. 15040. 2018 0
4 76 20-30 5 ans et moins E A 100 NA NA 1157. 15040. 2018 0
5 86 20-30 5 à 10 ans E A 100 NA NA 905. 11762. 2019 0
6 281 20-30 5 à 10 ans E A 100 NA NA 905. 11762. 2019 0
primes_en_pourcentage brut_mensuel_ETP variation_annee_prec
<dbl> <dbl> <dbl>
1 NA NA NA
2 NA 786. NA
3 NA 1157. -0.804
4 NA 1157. -0.804
5 NA 905. -0.218
6 NA 905. -0.218

Related

replacing rowwise() operations in grouped data

Anonymised example subset of a much larger dataset (now edited to show an option with multiple competing types):
structure(list(`Sample File` = c("A", "A", "A", "A", "A", "A",
"A", "A", "A", "B", "B", "B", "B", "B", "C", "C", "C", "C"),
Marker = c("X", "X", "X", "X", "Y", "Y", "Y", "Y", "Y", "Z",
"Z", "Z", "Z", "Z", "q", "q", "q", "q"), Allele = c(19, 20,
22, 23, 18, 18.2, 19, 19.2, 20, 12, 13, 14, 15, 16, 10, 10.2,
11, 12), Size = c(249.15, 253.13, 260.64, 264.68, 366, 367.81,
369.97, 372.02, 373.95, 91.65, 95.86, 100, 104.24, 108.38,
177.51, 179.4, 181.42, 185.49), Height = c(173L, 1976L, 145L,
1078L, 137L, 62L, 1381L, 45L, 1005L, 38L, 482L, 5766L, 4893L,
19L, 287L, 36L, 5001L, 50L), Type = c("minusone", "allele",
"minusone", "allele", "ambiguous", "minushalf", "allele",
"minushalf", "allele", "minustwo", "ambiguous", "allele",
"allele", "plusone", "minusone", "minushalf", "allele", "plusone"
), LUS = c(11.75, 11.286, 13.375, 13.5, 18, 9, 19, 10, 20,
12, 11, 14, 15, 16, 9.5, NA, 11, 11.5)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -18L), groups = structure(list(
`Sample File` = c("A", "A", "B", "C"), Marker = c("X", "Y",
"Z", "q"), .rows = structure(list(1:4, 5:9, 10:14, 15:18), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -4L), .drop = TRUE))
I want to look up values based on the classification $Type.
"minustwo" means I want to look up the "Allele", "Height" and "LUS"
values for the row with "Allele" equal to the current row plus two,
with the same Sample File and Marker.
"minusone" means the same but for "Allele" equal to the current row plus one.
"minushalf" means the same but for "Allele" equal to the current row plus 0.2 but the dot values here are 25% each, so 12.1, 12.3, 12.3, 13, 13.1 etc - I have a helper function plusTwoBP() for this.
"plusone" means the same for "Allele" equal to the current row -1
"allele" or "ambiguous" don't need to do anything.
Ideal output:
# A tibble: 18 × 10
# Rowwise: Sample File, Marker
`Sample File` Marker Allele Size Height Type LUS ParentHeight ParentAllele ParentLUS
<chr> <chr> <dbl> <dbl> <int> <chr> <dbl> <int> <dbl> <dbl>
1 A X 19 249. 173 minusone 11.8 1976 20 11.3
2 A X 20 253. 1976 allele 11.3 NA NA NA
3 A X 22 261. 145 minusone 13.4 1078 23 13.5
4 A X 23 265. 1078 allele 13.5 NA NA NA
5 A Y 18 366 137 ambiguous 18 NA NA NA
6 A Y 18.2 368. 62 minushalf 9 1381 19 19
7 A Y 19 370. 1381 allele 19 NA NA NA
8 A Y 19.2 372. 45 minushalf 10 1005 20 20
9 A Y 20 374. 1005 allele 20 NA NA NA
10 B Z 12 91.6 38 minustwo 12 5766 14 14
11 B Z 13 95.9 482 ambiguous 11 NA NA NA
12 B Z 14 100 5766 allele 14 NA NA NA
13 B Z 15 104. 4893 allele 15 NA NA NA
14 B Z 16 108. 19 plusone 16 4893 15 15
15 C q 10 178. 287 minusone 9.5 5001 11 11
16 C q 10.2 179. 36 minushalf NA 5001 11 11
17 C q 11 181. 5001 allele 11 NA NA NA
18 C q 12 185. 50 plusone 11.5 5001 11 11
I have a rather belaboured way of doing it:
# eg for minustwo
sampleData %>%
filter(Type == "minustwo") %>%
rowwise() %>%
mutate(ParentHeight = sampleData$Height[sampleData$`Sample File` == `Sample File` & sampleData$Marker == Marker & sampleData$Allele == (Allele + 2)],
ParentAllele = sampleData$Allele[sampleData$`Sample File` == `Sample File` & sampleData$Marker == Marker & sampleData$Allele == (Allele + 2)],
ParentLUS = sampleData$LUS[sampleData$`Sample File` == `Sample File` & sampleData$Marker == Marker & sampleData$Allele == (Allele + 2)]) %>%
right_join(sampleData)
I then have to redo that for each of my Types
My real dataset is thousands of rows so this ends up being a little slow but manageable, but more to the point I want to learn a better way to do it, in particular the sampleData$'Sample File' == 'Sample File' & sampleData$Marker == Marker seems like it should be doable with grouping so I must be missing a trick there.
I have tried using group_map() but I've clearly not understood it correctly:
sampleData$ParentHeight <- sampleData %>%
group_by(`Sample File`, `Marker`) %>%
group_map(.f = \(.x, .y) {
pmap_dbl(.l = .x, .f = \(Allele, Height, Type, ...){
if(Type == "allele" | Type == "ambiguous") { return(0)
} else if (Type == "plusone") {
return(.x$Height[.x$Allele == round(Allele - 1, 1)])
} else if (Type == "minushalf") {
return(.x$Height[.x$Allele == round(plustwoBP(Allele), 1)])
} else if (Type == "minusone") {
return(.x$Height[.x$Allele == round(Allele + 1, 1)])
} else if (Type == "minustwo") {
return(.x$Height[.x$Allele == round(Allele + 2, 1)])
} else { stop("unexpected peak type") }
})}) %>% unlist()
Initially seems to work, but on investigation it's not respecting both layers of grouping, so brings matches from the wrong Marker. Additionally, here I'm assigning the output to a new column in the data frame, but if I try to instead wrap a mutate() around this so that I can create all three new columns in one go then the group_map() no longer works at all.
I also considered using complete() to hugely extend the data frame will all possible values of Allele (including x.0, x.1, x.2, x.3 variants) then use lag() to select the corresponding rows, then drop the spare rows. This seems like it'd make the data frame enormous in the interim.
To summarise
This works, but it feels ugly and like I'm missing a more elegant and obvious solution. How would you approach this?
You can create two versions of Allele: one identical to the original Allele, and one that is equal to an adjustment based on minusone, minustwo, etc
Then do a self left join, based on that adjusted version of Allele (and Sample File and Marker)
sampleData = sampleData %>% group_by(`Sample File`,Marker) %>% mutate(id = Allele) %>% ungroup()
left_join(
sampleData %>%
mutate(id = case_when(
Type=="minusone"~id+1,
Type=="minustwo"~id+2,
Type=="plusone"~id-1,
Type=="minushalf"~ceiling(id))),
sampleData %>% select(-c(Size,Type)),
by=c("Sample File", "Marker", "id"),
suffix = c("", ".parent")
) %>% select(-id)
Output:
# A tibble: 14 × 10
`Sample File` Marker Allele Size Height Type LUS Allele.parent Height.parent LUS.parent
<chr> <chr> <dbl> <dbl> <int> <chr> <dbl> <dbl> <int> <dbl>
1 A X 19 249. 173 minusone 11.8 20 1976 11.3
2 A X 20 253. 1976 allele 11.3 NA NA NA
3 A X 22 261. 145 minusone 13.4 23 1078 13.5
4 A X 23 265. 1078 allele 13.5 NA NA NA
5 A Y 18 366 137 ambiguous 18 NA NA NA
6 A Y 18.2 368. 62 minushalf 9 19 1381 19
7 A Y 19 370. 1381 allele 19 NA NA NA
8 A Y 19.2 372. 45 minushalf 10 20 1005 20
9 A Y 20 374. 1005 allele 20 NA NA NA
10 B Z 12 91.6 38 minustwo 12 14 5766 14
11 B Z 13 95.9 482 ambiguous 11 NA NA NA
12 B Z 14 100 5766 allele 14 NA NA NA
13 B Z 15 104. 4893 allele 15 NA NA NA
14 B Z 16 108. 19 plusone 16 15 4893 15
15 C q 10 178. 287 minusone 9.5 11 5001 11
16 C q 10.2 179. 36 minushalf NA 11 5001 11
17 C q 11 181. 5001 allele 11 NA NA NA
18 C q 12 185. 50 plusone 11.5 11 5001 11

Can you use multiple conditions in match() function - R

I'm trying to graph excess deaths for 2020 against confirmed covid-19 deaths.
I have 2 dataframes, one x_worldwide_weekly_deaths (covid-19) and the other containing excess deaths, I want to add an excess deaths column to x_worldwide_weekly_deaths and match by both ISO3 country code, and week number;
Not every country tracks excess deaths so I want those not within the original excess df to have an NA value
Likewise, not every country who track excess deaths are as up to date, some have 37 weeks of data, others might only have 24, so I want the NA values for the missing weeks also
Using the below, I've gotten halfway there, countries not on the original list have NA and those who are have a value, however it only uses the first value rather than changing total per week
x_worldwide_weekly_death_values["excess_2020"] <- excess_death_2020$DTotal[match(x_worldwide_weekly_death_values$ISO3,
excess_death_2020$ISO3)]
Example of the data not in the original excess_death_2020 file which have had NA's added successfully
ISO3 administrative_~ population pop_density_km2 week_number weekly_deaths date excess_2020
<chr> <chr> <int> <chr> <dbl> <dbl> <date> <dbl>
1 AFG Afghanistan 37172386 56.937760009803 1 0 2020-01-06 NA
2 AFG Afghanistan 37172386 56.937760009803 2 0 2020-01-13 NA
3 AFG Afghanistan 37172386 56.937760009803 3 0 2020-01-20 NA
dput() for the above:
dput(x_worldwide_weekly_death_values[1:3,])
structure(list(ISO3 = c("AFG", "AFG", "AFG"), administrative_area_level_1 = c("Afghanistan",
"Afghanistan", "Afghanistan"), population = c(37172386L, 37172386L,
37172386L), pop_density_km2 = c("56.937760009803", "56.937760009803",
"56.937760009803"), week_number = c(1, 2, 3), weekly_deaths = c(0,
0, 0), date = structure(c(18267, 18274, 18281), class = "Date"),
excess_2020 = c(NA_real_, NA_real_, NA_real_)), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
Compared to Austria, where the week 1 value has been added to all cells
ISO3 administrative_a~ population pop_density_km2 week_number weekly_deaths date excess_2020
<chr> <chr> <int> <chr> <dbl> <dbl> <date> <dbl>
1 AUT Austria 8840521 107.1279668605~ 1 0 2020-01-06 1610
2 AUT Austria 8840521 107.1279668605~ 2 0 2020-01-13 1610
3 AUT Austria 8840521 107.1279668605~ 3 0 2020-01-20 1610
dput() for the above:
dput(x_worldwide_weekly_death_values[371:373,])
structure(list(ISO3 = c("AUT", "AUT", "AUT"), administrative_area_level_1 = c("Austria",
"Austria", "Austria"), population = c(8840521L, 8840521L, 8840521L
), pop_density_km2 = c("107.127966860564", "107.127966860564",
"107.127966860564"), week_number = c(1, 2, 3), weekly_deaths = c(0,
0, 0), date = structure(c(18267, 18274, 18281), class = "Date"),
excess_2020 = c(1610, 1610, 1610)), row.names = c(NA, -3L
), class = c("tbl_df", "tbl", "data.frame"))
Expected output for excess_2020 column would be the DTotal column figures associated to the Week number; Week 1 = 1610, Week 2 = 1702, Week 3 = 1797
ISO3 Year Week Sex D0_14 D15_64 D65_74 D75_84 D85p DTotal R0_14 R15_64 R65_74 R75_84 R85p
<chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AUT 2020 1 b 1 220 221 481 687 1610 4.07e-5 0.00196 0.0134 0.0399 0.157
2 AUT 2020 2 b 8 231 261 490 712 1702 3.26e-4 0.00206 0.0158 0.0407 0.163
3 AUT 2020 3 b 12 223 272 537 753 1797 4.89e-4 0.00198 0.0165 0.0446 0.173
dput() for the above
dput(excess_death_2020[1:3,])
structure(list(ISO3 = c("AUT", "AUT", "AUT"), Year = c(2020,
2020, 2020), Week = c(1, 2, 3), Sex = c("b", "b", "b"), D0_14 = c(1,
8, 12), D15_64 = c(220, 231, 223), D65_74 = c(221, 261, 272),
D75_84 = c(481, 490, 537), D85p = c(687, 712, 753), DTotal = c(1610,
1702, 1797), R0_14 = c(4.07296256273503e-05, 0.000325837005018803,
0.000488755507528204), R15_64 = c(0.00195783568851069, 0.00205572747293622,
0.00198453344789947), R65_74 = c(0.0133964529296798, 0.0158211502925177,
0.0164879420672982), R75_84 = c(0.0399495248686277, 0.0406970211759409,
0.044600613003021), R85p = c(0.157436284517545, 0.163165406952681,
0.172561167746305), RTotal = c(0.00948052042945739, 0.0100222644539978,
0.0105816740445559), Split = c(0, 0, 0), SplitSex = c(0,
0, 0), Forecast = c(1, 1, 1), date = structure(c(18267, 18274,
18281), class = "Date")), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"))
I tried a few variations of the below with little success
x_worldwide_weekly_deaths["excess_2020"] <- excess_death_2020$DTotal[excess_death_2020$Week[match(x_worldwide_weekly_death_values$week_number
[x_worldwide_weekly_death_values$ISO3],
excess_death_2020$Week[excess_death_2020$CountryCode])]]
Should I not be using match() on multiple criteria or am I not formatting it correctly?
Really appreciate any help and suggestions!
dplyr is reaaly good/easy for this kind of thing. Here's a simplified example that achieves both of your goals (adding NA for countries that are not in the excess death data, and adding NA for weeks that are not in the excess death data)...
library(dplyr)
x_worldwide_weekly_death_values <-
tribble(
~iso3c, ~week, ~covid_deaths,
"AFG", 1, 0,
"AFG", 2, 10,
"AFG", 3, 30,
"AFG", 4, 50,
"AUT", 1, 120,
"AUT", 2, 200,
"AUT", 3, 320,
"AUT", 4, 465,
"XXX", 1, 10,
"XXX", 2, 20,
"XXX", 3, 30,
"XXX", 4, 40,
)
excess_death_2020 <-
tribble(
~iso3c, ~week, ~DTotal,
"AFG", 1, 0,
"AFG", 2, 0,
"AFG", 3, 0,
"AUT", 1, 1610,
"AUT", 2, 1702,
"AUT", 3, 1797,
)
x_worldwide_weekly_death_values %>%
left_join(excess_death_2020, by = c("iso3c", "week"))
#> # A tibble: 12 x 4
#> iso3c week covid_deaths DTotal
#> <chr> <dbl> <dbl> <dbl>
#> 1 AFG 1 0 0
#> 2 AFG 2 10 0
#> 3 AFG 3 30 0
#> 4 AFG 4 50 NA
#> 5 AUT 1 120 1610
#> 6 AUT 2 200 1702
#> 7 AUT 3 320 1797
#> 8 AUT 4 465 NA
#> 9 XXX 1 10 NA
#> 10 XXX 2 20 NA
#> 11 XXX 3 30 NA
#> 12 XXX 4 40 NA

Concat two different Data frames horizontally [duplicate]

I have two lists named h and g.
They each contain 244 dataframes and they look like the following:
h[[1]]
year avg hr sal
1 2010 0.300 31 2000
2 2011 0.290 30 4000
3 2012 0.275 14 600
4 2013 0.280 24 800
5 2014 0.295 18 1000
6 2015 0.330 26 7000
7 2016 0.315 40 9000
g[[1]]
year pos fld
1 2010 A 0.990
2 2011 B 0.995
3 2013 C 0.970
4 2014 B 0.980
5 2015 D 0.990
I want to cbind these two dataframes.
But as you see, they have different number of rows.
I want to combine these dataframes so that the rows with the same year will be combined in one row. And I want the empty spaces to be filled with NA.
The result I expect looks like this:
year avg hr sal pos fld
1 2010 0.300 31 2000 A 0.990
2 2011 0.290 30 4000 B 0.995
3 2012 0.275 14 600 NA NA
4 2013 0.280 24 800 C 0.970
5 2014 0.295 18 1000 B 0.980
6 2015 0.330 26 7000 D 0.990
7 2016 0.315 40 9000 NA NA
Also, I want to repeat this for all the 244 dataframes in each list, h and g.
I'd like to make a new list named final which contains the 244 combined dataframes.
How can I do this...?
All answers will be greatly appreciated :)
I think you should instead use merge:
merge(df1, df2, by="year", all = T)
For your data:
df1 = data.frame(matrix(0, 7, 4))
names(df1) = c("year", "avg", "hr", "sal")
df1$year = 2010:2016
df1$avg = c(.3, .29, .275, .280, .295, .33, .315)
df1$hr = c(31, 30, 14, 24, 18, 26, 40)
df1$sal = c(2000, 4000, 600, 800, 1000, 7000, 9000)
df2 = data.frame(matrix(0, 5, 3))
names(df2) = c("year", "pos", "fld")
df2$year = c(2010, 2011, 2013, 2014, 2015)
df2$pos = c('A', 'B', 'C', 'B', 'D')
df2$fld = c(.99,.995,.97,.98,.99)
cbind is meant to column-bind two dataframes that are in all sense compatible. But what you aim to do is actual merge, where you want the elements from the two data frames not be discarded, and for missing values you get NA instead.
We can use Map with cbind.fill (from rowr) to cbind the corresponding 'data.frame' from 'h' and 'g'.
library(rowr)
Map(cbind.fill, h, g, MoreArgs = list(fill=NA))
Update
Based on the expected output showed, it seems like the OP wanted a merge instead of cbind
f1 <- function(...) merge(..., all = TRUE, by = 'year')
Map(f1, h, g)
#[[1]]
# year avg hr sal pos fld
#1 2010 0.300 31 2000 A 0.990
#2 2011 0.290 30 4000 B 0.995
#3 2012 0.275 14 600 <NA> NA
#4 2013 0.280 24 800 C 0.970
#5 2014 0.295 18 1000 B 0.980
#6 2015 0.330 26 7000 D 0.990
#7 2016 0.315 40 9000 <NA> NA
Or as #Colonel Beauvel mentioned, this can be made compact
Map(merge, h, g, by='year', all=TRUE)
data
h <- list(structure(list(year = 2010:2016, avg = c(0.3, 0.29, 0.275,
0.28, 0.295, 0.33, 0.315), hr = c(31L, 30L, 14L, 24L, 18L, 26L,
40L), sal = c(2000L, 4000L, 600L, 800L, 1000L, 7000L, 9000L)), .Names = c("year",
"avg", "hr", "sal"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7")))
g <- list(structure(list(year = c(2010L, 2011L, 2013L, 2014L, 2015L
), pos = c("A", "B", "C", "B", "D"), fld = c(0.99, 0.995, 0.97,
0.98, 0.99)), .Names = c("year", "pos", "fld"), class = "data.frame",
row.names = c("1",
"2", "3", "4", "5")))
Here is how you could do this with tidyverse tools:
library(tidyverse)
h <- list()
g <- list()
h[[1]] <- tribble(
~year, ~avg, ~hr, ~sal,
2010, 0.300, 31, 2000,
2011, 0.290, 30, 4000,
2012, 0.275, 14, 600,
2013, 0.280, 24, 800,
2014, 0.295, 18, 1000,
2015, 0.330, 26, 7000,
2016, 0.315, 40, 9000
)
g[[1]] <- tribble(
~year, ~pos, ~fld,
2010, "A", 0.990,
2011, "B", 0.995,
2013, "C", 0.970,
2014, "B", 0.980,
2015, "D", 0.990
)
map2(h, g, left_join)
Which produces:
[[1]]
# A tibble: 7 x 6
year avg hr sal pos fld
<dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 2010 0.3 31 2000 A 0.99
2 2011 0.290 30 4000 B 0.995
3 2012 0.275 14 600 NA NA
4 2013 0.28 24 800 C 0.97
5 2014 0.295 18 1000 B 0.98
6 2015 0.33 26 7000 D 0.99
7 2016 0.315 40 9000 NA NA

R Aggregate over multiple columns

i´m currently working with a large dataframe of 75 columns and round about 9500 rows. This dataframe contains observations for every day from 1995-2019 for several observation points.
Edit: The print from dput(head(df))
> dput(head(df))
structure(list(date = structure(c(9131, 9132, 9133, 9134, 9135,
9136), class = "Date"), x1 = c(50.75, 62.625, 57.25, 56.571,
36.75, 39.125), x2 = c(62.25, 58.714, 49.875, 56.375, 43.25,
41.625), x3 = c(90.25, NA, 70.125, 75.75, 83.286, 98.5),
x4 = c(60, 72, 68.375, 65.5, 63.25, 55.875), x5 = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), xn = c(53.25,
61.143, 56.571, 58.571, 36.25, 44.375), year = c(1995, 1995, 1995, 1995,
1995, 1995), month = c(1, 1, 1, 1, 1, 1), day = c(1, 2, 3,
4, 5, 6)), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
The dataframe looks like this sample from it:
date x1 x2 x3 x4 x5 xn year month day
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1995-01-01 50.8 62.2 90.2 60 NA 53.2 1995 1 1
2 1999-08-02 62.6 58.7 NA 72 NA 61.1 1999 8 2
3 2001-09-03 57.2 49.9 70.1 68.4 NA 56.6 2001 9 3
4 2008-05-04 56.6 56.4 75.8 65.5 NA 58.6 2008 5 4
5 2012-04-05 36.8 43.2 83.3 63.2 NA 36.2 2012 4 5
6 2019-12-31 39.1 41.6 98.5 55.9 NA 44.4 2019 12 31
str(df)
tibble [9,131 x 75] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ date : Date[1:9131], format: "1995-01-01" "1995-01-02" ...
$ x1 : num [1:9131] 50.8 62.6 57.2 56.6 36.8 ...
$ x2 : num [1:9131] 62.2 58.7 49.9 56.4 43.2 ...
xn
$ year : num [1:9131] 1995 1995 1995 1995 1995 ...
$ month : num [1:9131] 1 1 1 1 1 1 1 1 1 1 ...
$ day : num [1:9131] 1 2 3 4 5 6 7 8 9 10 ...
My goal is to get for every observation point xn the count of all observations which cross a certain limit per year.
So far i tried to reach this with the Aggregate function.
To get the mean of every year i used the following command:
aggregate(list(df), by=list(year=df$year), mean, na.rm=TRUE)
this works perfect, i get the mean for every year for every observation point.
To get the sum of one station i used the following code
aggregate(list(x1=df$x1), by=list(year=df$year), function(x) sum(rle(x)$values>120, na.rm=TRUE))
which results in this print:
year x1
1 1995 52
2 1996 43
3 1997 44
4 1998 42
5 1999 38
6 2000 76
7 2001 52
8 2002 58
9 2003 110
10 2004 34
11 2005 64
12 2006 46
13 2007 46
14 2008 17
15 2009 41
16 2010 30
17 2011 40
18 2012 47
19 2013 40
20 2014 21
21 2015 56
22 2016 27
23 2017 45
24 2018 22
25 2019 45
So far, so good. I know i could expand the code by adding (..,x2=data$x2, x3=data$x3,..xn) to the list argument in code above. which i tried and they work.
But how do I get them all at once?
I tried the following codes:
aggregate(.~(date, year, month, day), by=list(year=df$year), function(x) sum(rle(x)$values>120, na.rm=TRUE))
Fehler: Unerwartete(s) ',' in "aggregate(.~(date,"
aggregate(.~date+year+month+day, by=list(year=df$year), function(x) sum(rle(x)$values>120, na.rm=TRUE))
Fehler in as.data.frame.default(data, optional = TRUE) :
cannot coerce class ‘"function"’ to a data.frame
aggregate(. ~ date + year + month + day, data = df,by=list(year=df$year), function(x) sum(rle(x)$values>120, na.rm=TRUE))
Fehler in aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...) :
Argumente müssen dieselbe Länge haben
But unfortunately none of them works. Could someone please give me a hint where my mistake is?
Here is an answer that uses base R, and since none of the data in the example data is above 120, we set a criterion of above 70.
data <- structure(
list(
date = structure(c(9131, 9132, 9133, 9134, 9135,
9136), class = "Date"),
x1 = c(50.75, 62.625, 57.25, 56.571,
36.75, 39.125),
x2 = c(62.25, 58.714, 49.875, 56.375, 43.25,
41.625),
x3 = c(90.25, NA, 70.125, 75.75, 83.286, 98.5),
x4 = c(60, 72, 68.375, 65.5, 63.25, 55.875),
x5 = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_),
xn = c(53.25,
61.143, 56.571, 58.571, 36.25, 44.375),
year = c(1995, 1995, 1995, 1995,
1995, 1995),
month = c(1, 1, 1, 1, 1, 1),
day = c(1, 2, 3,
4, 5, 6)
),
row.names = c(NA,-6L),
class = c("tbl_df", "tbl",
"data.frame"
))
First, we create a subset of the data that contains all columns containing x, and set them to TRUE or FALSE based on whether the value is greater than 70.
theCols <- data[,colnames(data)[grepl("x",colnames(data))]]
Second, we cbind() the year onto the matrix of logical values.
x_logical <- cbind(year = data$year,as.data.frame(apply(theCols,2,function(x) x > 70)))
Finally, we use aggregate across all columns other than year and sum the columns.
aggregate(x_logical[2:ncol(x_logical)],by = list(x_logical$year),sum,na.rm=TRUE)
...and the output:
Group.1 x1 x2 x3 x4 x5 xn
1 1995 0 0 5 1 0 0
>
Note that by using colnames() to extract the columns that start with x and nrow() in the aggregate() function, we make this a general solution that will handle a varying number of x locations.
Two tidyverse solutions
A tidyverse solution to the same problem is as follows. It includes the following steps.
Use mutate() with across() to create the TRUE / FALSE versions of the x variables. Note that across() requires dplyr 1.0.0, which is currently in development but due for production release the week of May 25th.
Use pivot_longer() to allow us to summarise() multiple measures without a lot of complicated code.
Use pivot_wider() to convert the data back to one column for each x measurement.
...and the code is:
devtools::install_github("tidyverse/dplyr") # needed for across()
library(dplyr)
library(tidyr)
library(lubridate)
data %>%
mutate(.,across(starts_with("x"),~if_else(. > 70,TRUE,FALSE))) %>%
select(-year,-month,-day) %>% group_by(date) %>%
pivot_longer(starts_with("x"),names_to = "measure",values_to = "value") %>%
mutate(year = year(date)) %>% group_by(year,measure) %>%
select(-date) %>%
summarise(value = sum(value,na.rm=TRUE)) %>%
pivot_wider(id_cols = year,names_from = "measure",
values_from = value)
...and the output, which matches the Base R solution that I originally posted:
`summarise()` regrouping output by 'year' (override with `.groups` argument)
# A tibble: 1 x 7
# Groups: year [1]
year x1 x2 x3 x4 x5 xn
<dbl> <int> <int> <int> <int> <int> <int>
1 1995 0 0 5 1 0 0
>
...and here's an edited version of the other answer that will also produce the same results as above. This solution implements pivot_longer() before creating the logical variable for exceeding the threshold, so it does not require the across() function. Also note that since this uses 120 as the threshold value and none of the data meets this threshold, the sums are all 0.
df_example %>%
pivot_longer(x1:x5) %>%
mutate(greater_120 = value > 120) %>%
group_by(year,name) %>%
summarise(sum_120 = sum(greater_120,na.rm = TRUE)) %>%
pivot_wider(id_cols = year,names_from = "name", values_from = sum_120)
...and the output:
`summarise()` regrouping output by 'year' (override with `.groups` argument)
# A tibble: 1 x 6
# Groups: year [1]
year x1 x2 x3 x4 x5
<dbl> <int> <int> <int> <int> <int>
1 1995 0 0 0 0 0
>
Conclusions
As usual, there are many ways to accomplish a given task in R. Depending on one's preferences, the problem can be solved with Base R or the tidyverse. One of the quirks of the tidyverse is that some operations such as summarise() are much easier to perform on narrow format tidy data than on wide format data. Therefore, it's important to be proficient with tidyr::pivot_longer() and pivot_wider() when working in the tidyverse.
That said, with the production release of dplyr 1.0.0, the team at RStudio continues to add features that facilitate working with wide format data.
This should solve your problem
library(tidyverse)
library(lubridate)
df_example <- structure(list(date = structure(c(9131, 9132, 9133, 9134, 9135,
9136), class = "Date"), x1 = c(50.75, 62.625, 57.25, 56.571,
36.75, 39.125), x2 = c(62.25, 58.714, 49.875, 56.375, 43.25,
41.625), x3 = c(90.25, NA, 70.125, 75.75, 83.286, 98.5),
x4 = c(60, 72, 68.375, 65.5, 63.25, 55.875), x5 = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), xn = c(53.25,
61.143, 56.571, 58.571, 36.25, 44.375), year = c(1995, 1995, 1995, 1995,
1995, 1995), month = c(1, 1, 1, 1, 1, 1), day = c(1, 2, 3,
4, 5, 6)), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
df_example %>%
pivot_longer(x1:x5) %>%
mutate(greater_120 = value > 120) %>%
group_by(year(date)) %>%
summarise(sum_120 = sum(greater_120,na.rm = TRUE))

How to change an ID's NA to character value, based on other ID's cell values/characteristics in R?

I have a problem in my dataset with missing values. For some reason, several ID’s miss a value at the column ‘Names’. This is strange, because other ID’s (with the same CODE (there are more codes in my whole dataset (>10K) and same year(6 options for years)) do have a value in that column.
Can somebody help me figuring out the code, so that ID’s with missing values in the ‘Names’ column, do get the same character value in ‘Names’ column, if other ID’s with the same code and year, do have a value in that column?
For example: the NA at row 4; should change to 'Hospital'; based on the same code and year, of another ID.(In my original dataframe there is an ID with 2013 and code 01 with name 'Hospital'; if not, it should stay NA).
Sidenote: it is panel data, so each ID can be in the dataset for multiple years (and rows; each year is one row) and not everybody is in for every year. There are also more variables in my dataframe.
> dput(Dataframe[1:7, ])
structure(list(ID = structure(c(1, 2, 2, 2, 2, 2, 2), format.spss = "F9.3"), CODE = c("01", "01", "01","01", "01", "01", "01"), Year = structure(c(2018, 2014, 2018, 2013, 2013, 2015, 2015), format.spss = "F9.3"), Quarter = structure(c(3, 4, 4, 4, 3, 4, 3), format.spss = "F9.3"), Size = c(24.5, 23.25, 24.5, 30, 30, 19.25, 19.25), Names = c("Hospital", "Hospital", "Hospital", NA, "Hospital", NA, "Hospital")), row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame"
A tibble: 7 x 8
ID Gender CODE Year Quarter Size Names
<dbl> <dbl> <dttm> <chr> <dbl> <dbl> <dbl> <chr>
1 1 2 01 2018 3 24.5 Hospital
2 2 1 01 2014 4 23.2 Hospital
3 2 1 01 2018 4 24.5 Hospital
4 2 1 01 2013 4 30 NA
5 2 1 01 2013 3 30 Hospital
6 2 1 01 2015 4 19.2 NA
7 2 1 01 2015 3 19.2 Hospital
Selecting and checking indvidual rows is too much work, I have over 1.1 million rows..
Edit: it also possible to transfer the 'names' column to 1 if it has a (character) value, and 0 if NA.
Thank you!
I'm not exactly sure because in your example all the names are the same but I think this might do what you are looking for.
I changed the example below to have the last Names be "Not Hospital".
df <- structure(list(ID = structure(c(1, 2, 2, 2, 2, 2, 2), format.spss = "F9.3"), CODE = c("01", "01", "01","01", "01", "01", "01"), Year = structure(c(2018, 2014, 2018, 2013, 2013, 2015, 2015), format.spss = "F9.3"), Quarter = structure(c(3, 4, 4, 4, 3, 4, 3), format.spss = "F9.3"), Size = c(24.5, 23.25, 24.5, 30, 30, 19.25, 19.25), Names = c("Hospital", "Hospital", "Hospital", NA, "Hospital", NA, "Not Hospital")), row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame") )
Original
# A tibble: 7 x 6
ID CODE Year Quarter Size Names
<dbl> <chr> <dbl> <dbl> <dbl> <chr>
1 1 01 2018 3 24.5 Hospital
2 2 01 2014 4 23.2 Hospital
3 2 01 2018 4 24.5 Hospital
4 2 01 2013 4 30 NA
5 2 01 2013 3 30 Hospital
6 2 01 2015 4 19.2 NA
7 2 01 2015 3 19.2 Not Hospital
Here's the code to update the names.
df %>%
filter(!is.na(Names)) %>%
select(CODE, Year, Names) %>%
group_by_all() %>%
summarise() %>%
right_join(df, by = c("CODE", "Year")) %>%
rename(Names = Names.x) %>%
select(-Names.y)
Output:
# A tibble: 7 x 6
# Groups: CODE, Year [4]
CODE Year Names ID Quarter Size
<chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 01 2018 Hospital 1 3 24.5
2 01 2014 Hospital 2 4 23.2
3 01 2018 Hospital 2 4 24.5
4 01 2013 Hospital 2 4 30
5 01 2013 Hospital 2 3 30
6 01 2015 Not Hospital 2 4 19.2
7 01 2015 Not Hospital 2 3 19.2
There are several ways to approach this problem, as far as I can see. However, I prefer the following solution.
The first step is to split the data frame into two. One data frame contains only rows without NA's in the Names column, while the other contains only rows with NA's in the Names column. Then, you simply search in the former for CODE YEAR combinations and return the name of the corresponding row. The first is to collect the rows that contain NA's, and take this to search for code and year combinations.
# Your data frame
df <-
# Split df
df.with.nas <- df[ is.na(df$Names) ,]
df.without.nas <- df[ !is.na(df$Names) ,]
# Define function to separat logic
get.name <- function(row) {
# row is an atomic vector. Hence we have to use row["<SELECTOR>"]
result <- subset(df.without.nas, CODE == row["CODE"] & Year == row["Year"])
return(result["Names"])
}
# Finally, search and return.
row.axis <- 1
df.with.nas$Names <- apply(df.with.nas, row.axis, get.name)
# Combine the dfs
df <- rbind(
df.with.nas, df.without.nas)
This solution has a shortcoming. What should happen, when we find dublicates?
I hope this useful!

Resources