Calculate a complex difference score with tidyverse?

Calculate a complex difference score with tidyverse? - r

I have a large dataset of 70 000 rows that I want to perform some operations on, but I can't find an appropriate solution.
bib sta run course finish comment day
1 42 9 1 SG 19.88 99 1
2 42 17 2 A 19.96 11 1
3 42 27 3 B 20.92 22 1
4 42 39 4 A 19.60 11 1
5 42 48 5 SG 20.24 99 1
6 42 61 6 C 22.90 33 1
7 42 76 7 B 20.70 22 1
8 42 86 8 C 22.74 33 1
9 42 93 9 C 22.75 33 1
10 42 103 10 A 19.79 11 1
11 42 114 11 B 20.67 22 1
12 42 120 12 SG 20.10 99 1
I want to end up with a tibble that:
calculates the mean finish time in SG course for each bib number on one particular day. For example, 19.88 + 20.24 + 20.10 / 3
calculate a difference score for each observation in the dataset by subtracting finish from this mean SG score. For example, 19.88 - mean(SG), 19.96 - mean(SG).
I have tried the following approach:
First group by day, bib and course. Then filter by SG and calculate the mean:
avg.sgtime <- df %>%
group_by(day, bib, course) %>%
filter(course == 'SG') %>%
mutate(avg.sg = mean(finish))
Resulting in the following tibble
bib sta run course finish comment day avg.sg
<int> <int> <int> <chr> <dbl> <int> <chr> <dbl>
1 42 9 1 SG 19.9 99 1 20.1
2 42 48 5 SG 20.2 99 1 20.1
3 42 120 12 SG 20.1 99 1 20.1
4 42 6 1 SG 20.0 99 2 19.9
5 42 42 5 SG 19.8 77 2 19.9
6 42 130 15 SG 19.9 99 2 19.9
7 42 6 1 SG 20.6 99 3 20.5
8 42 68 12 SG 20.6 77 3 20.5
9 42 90 15 SG 20.4 77 3 20.5
Finally I join the two tibbles together using the following syntax:
df %>% full_join(avg.sgtime) %>%
mutate(diff = finish - avg.sg)
However, this doesn't work. It only works for the SG course but not for course A, B and C. Is there a way to fix this or is there a better solution to the problem?
bib sta run course finish comment day avg.sg diff
1 42 9 1 SG 19.88 99 1 20.07333 -0.193333333
2 42 17 2 A 19.96 11 1 NA NA
3 42 27 3 B 20.92 22 1 NA NA
4 42 39 4 A 19.60 11 1 NA NA
5 42 48 5 SG 20.24 99 1 20.07333 0.166666667

You can filter your values for finish within the mutate() and calculate the mean based on those:
df %>%
group_by(day,bib) %>%
mutate(
avg.sg = mean(finish[course=="SG"]),
diff = finish - avg.sg)

Is the following what you are aiming for?
(note that I added a few random values for a second bib just to make sure the join is done properly)
The difference to your attempt is using summarise() instead of mutate() to consolidate the avg.sgtime data frame, and also dropping a few columns so that the join is not populated with NAs. Instead of dropping you can also set the relevant columns to join by passing the by argument to the left_join() function.
library(dplyr)
library(tidyr) # for join
avg.sgtime <- df %>%
group_by(day, bib, course) %>%
filter(course == 'SG') %>%
summarise(avg.sg = mean(finish), .groups = "drop") %>%
select(c(bib, day, avg.sg))
avg.sgtime
#> # A tibble: 3 x 3
#> bib day avg.sg
#> <dbl> <dbl> <dbl>
#> 1 42 1 20.1
#> 2 43 1 19.1
#> 3 44 2 19.3
df %>% left_join(avg.sgtime) %>%
mutate(diff = finish - avg.sg)
#> Joining, by = c("bib", "day")
#> # A tibble: 36 x 9
#> bib sta run course finish comment day avg.sg diff
#> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 42 9 1 SG 19.9 99 1 20.1 -0.193
#> 2 42 17 2 A 20.0 11 1 20.1 -0.113
#> 3 42 27 3 B 20.9 22 1 20.1 0.847
#> 4 42 39 4 A 19.6 11 1 20.1 -0.473
#> 5 42 48 5 SG 20.2 99 1 20.1 0.167
#> 6 42 61 6 C 22.9 33 1 20.1 2.83
#> 7 42 76 7 B 20.7 22 1 20.1 0.627
#> 8 42 86 8 C 22.7 33 1 20.1 2.67
#> 9 42 93 9 C 22.8 33 1 20.1 2.68
#> 10 42 103 10 A 19.8 11 1 20.1 -0.283
#> # … with 26 more rows
Created on 2021-07-04 by the reprex package (v2.0.0)
data
df <- tribble(~bib, ~sta, ~run, ~course, ~finish, ~comment, ~day,
42, 9, 1, "SG", 19.88, 99, 1,
42, 17, 2, "A", 19.96, 11, 1,
42, 27, 3, "B", 20.92, 22, 1,
42, 39, 4, "A", 19.60, 11, 1,
42, 48, 5, "SG", 20.24, 99, 1,
42, 61, 6, "C", 22.90, 33, 1,
42, 76, 7, "B", 20.70, 22, 1,
42, 86, 8, "C", 22.74, 33, 1,
42, 93, 9, "C", 22.75, 33, 1,
42, 103, 10, "A", 19.79, 11, 1,
42, 114, 11, "B", 20.67, 22, 1,
42, 120, 12, "SG", 20.10, 99, 1,
43, 9, 1, "SG", 19.12, 99, 1,
43, 17, 2, "A", 19.64, 11, 1,
43, 27, 3, "B", 20.62, 22, 1,
43, 39, 4, "A", 19.23, 11, 1,
43, 48, 5, "SG", 20.11, 99, 1,
43, 61, 6, "C", 22.22, 33, 1,
43, 76, 7, "B", 20.33, 22, 1,
43, 86, 8, "C", 22.51, 33, 1,
43, 93, 9, "C", 22.78, 33, 1,
43, 103, 10, "A", 19.98, 11, 1,
43, 114, 11, "B", 20.11, 22, 1,
43, 120, 12, "SG", 18.21, 99, 1,
44, 9, 1, "SG", 19.18, 99, 2,
44, 17, 2, "A", 19.56, 11, 2,
44, 27, 3, "B", 20.62, 22, 2,
44, 39, 4, "A", 19.20, 11, 2,
44, 48, 5, "SG", 20.74, 99, 2,
44, 61, 6, "C", 22.50, 33, 2,
44, 76, 7, "B", 20.60, 22, 2,
44, 86, 8, "C", 22.74, 33, 2,
44, 93, 9, "C", 22.85, 33, 2,
44, 103, 10, "A", 19.59, 11, 2,
44, 114, 11, "B", 20.27, 22, 2,
44, 120, 12, "SG", 18.10, 99, 2,
)

Thanks #Marcelo Avila for providing me with a very good hint:
I hope this is what you are looking for:
library(dplyr)
df %>%
group_by(bib, day) %>%
mutate(across(finish, ~ mean(.x[course == "SG"]), .names = "avg_{.col}"),
diff = finish - avg_finish,
avg_finish = ifelse(course == "SG", avg_finish, NA))
# A tibble: 12 x 9
# Groups: bib, day [1]
bib sta run course finish comment day avg_finish diff
<int> <int> <int> <chr> <dbl> <int> <int> <dbl> <dbl>
1 42 9 1 SG 19.9 99 1 20.1 -0.193
2 42 17 2 A 20.0 11 1 NA -0.113
3 42 27 3 B 20.9 22 1 NA 0.847
4 42 39 4 A 19.6 11 1 NA -0.473
5 42 48 5 SG 20.2 99 1 20.1 0.167
6 42 61 6 C 22.9 33 1 NA 2.83
7 42 76 7 B 20.7 22 1 NA 0.627
8 42 86 8 C 22.7 33 1 NA 2.67
9 42 93 9 C 22.8 33 1 NA 2.68
10 42 103 10 A 19.8 11 1 NA -0.283
11 42 114 11 B 20.7 22 1 NA 0.597
12 42 120 12 SG 20.1 99 1 20.1 0.0267
I also added another alternative solution with a minor change, using dear #Marcelo Avila's data set:
df %>%
group_by(bib, day) %>%
mutate(across(finish, ~ mean(.x[select(cur_data(), course) == "SG"]), .names = "avg_{.col}"),
diff = finish - avg_finish,
avg_finish = ifelse(course == "SG", avg_finish, NA))
# A tibble: 36 x 9
# Groups: bib, day [3]
bib sta run course finish comment day avg_finish diff
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 42 9 1 SG 19.9 99 1 20.1 -0.193
2 42 17 2 A 20.0 11 1 NA -0.113
3 42 27 3 B 20.9 22 1 NA 0.847
4 42 39 4 A 19.6 11 1 NA -0.473
5 42 48 5 SG 20.2 99 1 20.1 0.167
6 42 61 6 C 22.9 33 1 NA 2.83
7 42 76 7 B 20.7 22 1 NA 0.627
8 42 86 8 C 22.7 33 1 NA 2.67
9 42 93 9 C 22.8 33 1 NA 2.68
10 42 103 10 A 19.8 11 1 NA -0.283
# ... with 26 more rows

Related

Error message with missForest package (imputation using Random Forest)

My dataframe is below. All variables are numeric, one of them (Total) has about 20 NAs. I would like the missForest package to create imputed values for the NAs in Total. I am running
R version 4.2.1 (2022-06-23 ucrt) on Windows.
imp <- structure(list(Years = c(21, 5, 5, 25, 4, 4, 4, 1, 12, 17, 5.5,
4, 13, 1, 1, 5, 1, 12, 8, 1, 14, 0.8, 6, 5, 4, 7, 4, 21, 3, 2,
20, 1, 2, 2, 20, 2, 1, 9, 12, 22, 1, 27, 5, 3, 1, 8, 5, 25, 1,
0.4, 4, 1, 1.5, 1, 1, 21, 5, 0.5, 3, 12, 3, 28, 7, 5, 22, 3.25,
4, 4, 12, 1, 3, 25, 17, 12, 40, 12, 6, 3, 8, 7, 17, 1, 3, 3,
6, 4, 7, 1, 7, 6, 4, 11, 1, 5, 2, 15, 1, 3, 7.5, 21, 4, 1.5,
7, 13, 5, 6, 9, 12.5, 2.5, 1, 17, 8, 5, 22, 25, 13, 5.5, 19,
9, 3.3, 14, 3, 22, 5, 6, 2.8, 9, 1, 8, 11, 8, 4, 2, 10, 1, 19,
13, 5, 1, 1.5, 7, 12, 2, 2.5, 1.5, 1, 2, 8, 5, 4, 3, 2, 2.5,
7, 11, 3, 8, 22, 5, 5, 8, 3.5, 1, 8, 11, 1, 5, 7, 9, 7, 4, 1,
14, 4, 20, 4, 5, 15.5, 9, 2, 7.5, 1, 13.5, 14, 1, 7, 4, 20, 9.5,
0, 10, 3, 8, 1, 3, 1, 19, 1, 20, 8, 25, 16, 14, 10, 24, 1, 2,
4, 0, 11, 2, 1.5, 2, 1, 21, 1, 20, 1.75, 5, 22, 5, 3), Staff = c(7,
8, 6, 10, 15, 6, 7, 17, 9, 5, 7, 12, 15, 8, 7, 5, 8, 8, 2, 8,
7, 8, 7, 7, 12, 8, 8, 7, 12, 10, 5, 7, 3, 6, 11, 4, 8, 8, 9,
6, 9, 9, 18, 10, 9, 5, 7, 20, 9, 4, 9, 6, 5, 4, 3, 5, 11, 8,
4, 7, 6, 16, 5, 5, 8, 8, 7, 4, 9, 9, 9, 14, 8, 5, 6, 6, 4, 3,
6, 7, 10, 7, 7, 3, 7, 13, 12, 4, 10, 8, 9, 5, 15, 7, 9, 9, 6,
5, 15, 7, 6, 5, 7, 8, 7, 7, 5, 9, 15, 12, 15, 5, 8, 7, 7, 5,
8, 12, 6, 6, 12, 9, 5, 4, 6, 7, 15, 5, 20, 6, 6, 11, 6, 8, 6,
2, 7, 4, 4, 2, 6, 15, 5, 15, 6, 3, 8, 15, 12, 7, 6, 9, 7, 1,
10, 5, 7, 4, 5, 1, 6, 5, 20, 8, 10, 1, 11, 9, 9, 5, 3, 8, 6,
5, 5, 5, 6, 8, 4, 7, 5, 4, 10, 8, 13, 5, 13, 3, 0, 15, 20, 5,
15, 14, 19, 20, 5, 7, 5, 9, 6, 6, 7, 20, 10, 25, 7, 5, 6, 10,
45, 10, 6, 5, 6, 8, 13, 12, 15, 7, 4, 1), JDs = c(64, 64, 120,
200, 30, 70, 370, 75, 300, 20, 68, 170, 77, 275, 132, 81, 875,
135, 75, 84, 74, 110, 120, 60, 1800, 94, 54, 125, 140, 150, 52,
190, 53, 170, 325, 18, 300, 86, 130, 375, 140, 200, 104, 50,
100, 95, 360, 40, 45, 52, 165, 20, 150, 58, 230, 95, 150, 95,
85, 120, 100, 265, 18, 90, 130, 77, 80, 75, 133, 73, 302, 500,
70, 50, 55, 72, 35, 60, 100, 90, 130, 41, 200, 29, 90, 35, 68,
30, 115, 51, 40, 125, 460, 400, 125, 400, 250, 51, 190, 200,
235, 150, 250, 137, 760, 90, 70, 100, 325, 200, 350, 150, 325,
23, 17, 50, 415, 650, 120, 96, 200, 4, 71, 700, 60, 224, 203,
16, 40, 62, 105, 41, 340, 22, 60, 11, 60, 30, 95, 27, 300, 120,
70, 96, 100, 6, 750, 14, 80, 60, 51, 90, 350, 250, 31, 78, 95,
32, 185, 65, 65, 30, 24, 65, 550, 100, 200, 80, 47, 45, 37, 250,
55, 25, 27, 90, 190, 65, 27, 80, 68, 110, 220, 325, 25, 43, 14,
5, 7, 17, 15, 135, 20, 26, 26, 29, 75, 93, 50, 127, 14, 75, 90,
50, 105, 190, 8, 45, 150, 300, 15, 25, 150, 60, 32, 85, 15, 144,
190, 155, 10, 20), Total = c(325000, 250000, 275000, 340000,
165000, 3e+05, 420000, 8e+05, 5e+05, 100776, 440000, 440000,
191500, NA, 4e+05, 145000, 6e+05, 4e+05, 125000, 155000, 230000,
250000, 240000, 2e+05, NA, 250000, 188000, 375000, 190000, 450000,
290558, 725000, 355000, 350000, 8e+05, 125000, 450000, 255000,
212500, 6e+05, 342000, 450000, 250000, 228000, 325000, 325000,
425000, 175000, NA, 240000, NA, 250000, 237000, 330000, 345000,
195000, 295000, 208000, 225000, NA, 445000, 253000, 75000, 285000,
4e+05, 2e+05, 308000, 236000, 470000, 190000, 1250000, 480000,
2e+05, 285000, 232000, 240000, 2e+05, 209000, 250000, 309000,
NA, 170000, 1e+06, 115200, 565000, 182500, 175000, 250000, 250000,
265000, 120000, 345000, 425000, 630000, 165000, 650000, 3e+05,
265000, 345000, 425000, 4e+05, 230000, 425000, 161500, 6e+05,
251000, 265000, 190000, 420000, 6e+05, 510000, 340000, 650000,
275000, 120000, 185000, 480000, 550000, 185000, 240000, 560000,
114000, 150000, 1050000, 230000, NA, 335000, 225000, 260000,
410000, 315000, 206000, 650000, 160000, 210000, 180000, 275000,
2e+05, 2e+05, 201094, 395000, 297000, 265000, 3e+05, 275000,
80000, 134000, 180000, 195000, 850000, 4e+05, 385000, 420000,
NA, 187000, 180000, 182700, 96597.28, 380000, 2e+05, 260000,
257500, 185000, 220000, 550000, 315000, 360000, 380000, 185000,
280000, 225000, 375000, 310000, 170000, 165000, 260000, 350000,
208000, 110000, 192500, 187500, 216000, 495000, 550000, 114500,
215000, 185000, NA, 114500, 110000, 250000, 350000, 180000, 118000,
191500, 1e+05, 230000, 350000, 240000, NA, 180000, 215000, 203000,
99800, 389900, NA, NA, NA, 4e+05, 6e+05, NA, NA, NA, 220000,
217500, NA, NA, 210000, 337000, 275000, NA, NA)), row.names = c(NA,
-222L), class = c("tbl_df", "tbl", "data.frame"))
library(missForest) # installed with dependencies = TRUE
impFor <- missForest(imp)
The statement above returns the following warnings and error.
Warning: argument is not numeric or logical: returning NAWarning: argument is not numeric or logical: returning NAWarning: argument is not numeric or logical: returning NAWarning: argument is not numeric or logical: returning NA
Warning: The response has five or fewer unique values. Are you sure you want to do regression?
Error in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry = mtry, :
length of response must be the same as predictors
The first four warnings appear to say that my four variables are neither numeric nor logical, but they are all numeric. The warning regarding regression and "five or fewer unique values" puzzles me because the package's manual makes no reference to a minimum number of unique values. Finally, the error confounds me completely.
I have searched StackOverflow, but the two questions that came up are not relevant.
Thank you for setting me right.

Your data should be in a data.frame format instead of tibble. You could use as.data.frame like this:
library(missForest)
class(imp)
#> [1] "tbl_df" "tbl" "data.frame"
imp <- as.data.frame(imp)
class(imp)
#> [1] "data.frame"
imp <- missForest(imp)
imp
#> $ximp
#> Years Staff JDs Total
#> 1 21.00 7 64 325000.00
#> 2 5.00 8 64 250000.00
#> 3 5.00 6 120 275000.00
#> 4 25.00 10 200 340000.00
#> 5 4.00 15 30 165000.00
#> 6 4.00 6 70 300000.00
#> 7 4.00 7 370 420000.00
#> 8 1.00 17 75 800000.00
#> 9 12.00 9 300 500000.00
#> 10 17.00 5 20 100776.00
#> 11 5.50 7 68 440000.00
#> 12 4.00 12 170 440000.00
#> 13 13.00 15 77 191500.00
#> 14 1.00 8 275 422030.00
#> 15 1.00 7 132 400000.00
#> 16 5.00 5 81 145000.00
#> 17 1.00 8 875 600000.00
#> 18 12.00 8 135 400000.00
#> 19 8.00 2 75 125000.00
#> 20 1.00 8 84 155000.00
#> 21 14.00 7 74 230000.00
#> 22 0.80 8 110 250000.00
#> 23 6.00 7 120 240000.00
#> 24 5.00 7 60 200000.00
#> 25 4.00 12 1800 564720.00
#> 26 7.00 8 94 250000.00
#> 27 4.00 8 54 188000.00
#> 28 21.00 7 125 375000.00
#> 29 3.00 12 140 190000.00
#> 30 2.00 10 150 450000.00
#> 31 20.00 5 52 290558.00
#> 32 1.00 7 190 725000.00
#> 33 2.00 3 53 355000.00
#> 34 2.00 6 170 350000.00
#> 35 20.00 11 325 800000.00
#> 36 2.00 4 18 125000.00
#> 37 1.00 8 300 450000.00
#> 38 9.00 8 86 255000.00
#> 39 12.00 9 130 212500.00
#> 40 22.00 6 375 600000.00
#> 41 1.00 9 140 342000.00
#> 42 27.00 9 200 450000.00
#> 43 5.00 18 104 250000.00
#> 44 3.00 10 50 228000.00
#> 45 1.00 9 100 325000.00
#> 46 8.00 5 95 325000.00
#> 47 5.00 7 360 425000.00
#> 48 25.00 20 40 175000.00
#> 49 1.00 9 45 185352.00
#> 50 0.40 4 52 240000.00
#> 51 4.00 9 165 403167.00
#> 52 1.00 6 20 250000.00
#> 53 1.50 5 150 237000.00
#> 54 1.00 4 58 330000.00
#> 55 1.00 3 230 345000.00
#> 56 21.00 5 95 195000.00
#> 57 5.00 11 150 295000.00
#> 58 0.50 8 95 208000.00
#> 59 3.00 4 85 225000.00
#> 60 12.00 7 120 261252.00
#> 61 3.00 6 100 445000.00
#> 62 28.00 16 265 253000.00
#> 63 7.00 5 18 75000.00
#> 64 5.00 5 90 285000.00
#> 65 22.00 8 130 400000.00
#> 66 3.25 8 77 200000.00
#> 67 4.00 7 80 308000.00
#> 68 4.00 4 75 236000.00
#> 69 12.00 9 133 470000.00
#> 70 1.00 9 73 190000.00
#> 71 3.00 9 302 1250000.00
#> 72 25.00 14 500 480000.00
#> 73 17.00 8 70 200000.00
#> 74 12.00 5 50 285000.00
#> 75 40.00 6 55 232000.00
#> 76 12.00 6 72 240000.00
#> 77 6.00 4 35 200000.00
#> 78 3.00 3 60 209000.00
#> 79 8.00 6 100 250000.00
#> 80 7.00 7 90 309000.00
#> 81 17.00 10 130 279905.83
#> 82 1.00 7 41 170000.00
#> 83 3.00 7 200 1000000.00
#> 84 3.00 3 29 115200.00
#> 85 6.00 7 90 565000.00
#> 86 4.00 13 35 182500.00
#> 87 7.00 12 68 175000.00
#> 88 1.00 4 30 250000.00
#> 89 7.00 10 115 250000.00
#> 90 6.00 8 51 265000.00
#> 91 4.00 9 40 120000.00
#> 92 11.00 5 125 345000.00
#> 93 1.00 15 460 425000.00
#> 94 5.00 7 400 630000.00
#> 95 2.00 9 125 165000.00
#> 96 15.00 9 400 650000.00
#> 97 1.00 6 250 300000.00
#> 98 3.00 5 51 265000.00
#> 99 7.50 15 190 345000.00
#> 100 21.00 7 200 425000.00
#> 101 4.00 6 235 400000.00
#> 102 1.50 5 150 230000.00
#> 103 7.00 7 250 425000.00
#> 104 13.00 8 137 161500.00
#> 105 5.00 7 760 600000.00
#> 106 6.00 7 90 251000.00
#> 107 9.00 5 70 265000.00
#> 108 12.50 9 100 190000.00
#> 109 2.50 15 325 420000.00
#> 110 1.00 12 200 600000.00
#> 111 17.00 15 350 510000.00
#> 112 8.00 5 150 340000.00
#> 113 5.00 8 325 650000.00
#> 114 22.00 7 23 275000.00
#> 115 25.00 7 17 120000.00
#> 116 13.00 5 50 185000.00
#> 117 5.50 8 415 480000.00
#> 118 19.00 12 650 550000.00
#> 119 9.00 6 120 185000.00
#> 120 3.30 6 96 240000.00
#> 121 14.00 12 200 560000.00
#> 122 3.00 9 4 114000.00
#> 123 22.00 5 71 150000.00
#> 124 5.00 4 700 1050000.00
#> 125 6.00 6 60 230000.00
#> 126 2.80 7 224 756680.00
#> 127 9.00 15 203 335000.00
#> 128 1.00 5 16 225000.00
#> 129 8.00 20 40 260000.00
#> 130 11.00 6 62 410000.00
#> 131 8.00 6 105 315000.00
#> 132 4.00 11 41 206000.00
#> 133 2.00 6 340 650000.00
#> 134 10.00 8 22 160000.00
#> 135 1.00 6 60 210000.00
#> 136 19.00 2 11 180000.00
#> 137 13.00 7 60 275000.00
#> 138 5.00 4 30 200000.00
#> 139 1.00 4 95 200000.00
#> 140 1.50 2 27 201094.00
#> 141 7.00 6 300 395000.00
#> 142 12.00 15 120 297000.00
#> 143 2.00 5 70 265000.00
#> 144 2.50 15 96 300000.00
#> 145 1.50 6 100 275000.00
#> 146 1.00 3 6 80000.00
#> 147 2.00 8 750 134000.00
#> 148 8.00 15 14 180000.00
#> 149 5.00 12 80 195000.00
#> 150 4.00 7 60 850000.00
#> 151 3.00 6 51 400000.00
#> 152 2.00 9 90 385000.00
#> 153 2.50 7 350 420000.00
#> 154 7.00 1 250 434900.00
#> 155 11.00 10 31 187000.00
#> 156 3.00 5 78 180000.00
#> 157 8.00 7 95 182700.00
#> 158 22.00 4 32 96597.28
#> 159 5.00 5 185 380000.00
#> 160 5.00 1 65 200000.00
#> 161 8.00 6 65 260000.00
#> 162 3.50 5 30 257500.00
#> 163 1.00 20 24 185000.00
#> 164 8.00 8 65 220000.00
#> 165 11.00 10 550 550000.00
#> 166 1.00 1 100 315000.00
#> 167 5.00 11 200 360000.00
#> 168 7.00 9 80 380000.00
#> 169 9.00 9 47 185000.00
#> 170 7.00 5 45 280000.00
#> 171 4.00 3 37 225000.00
#> 172 1.00 8 250 375000.00
#> 173 14.00 6 55 310000.00
#> 174 4.00 5 25 170000.00
#> 175 20.00 5 27 165000.00
#> 176 4.00 5 90 260000.00
#> 177 5.00 6 190 350000.00
#> 178 15.50 8 65 208000.00
#> 179 9.00 4 27 110000.00
#> 180 2.00 7 80 192500.00
#> 181 7.50 5 68 187500.00
#> 182 1.00 4 110 216000.00
#> 183 13.50 10 220 495000.00
#> 184 14.00 8 325 550000.00
#> 185 1.00 13 25 114500.00
#> 186 7.00 5 43 215000.00
#> 187 4.00 13 14 185000.00
#> 188 20.00 3 5 132532.25
#> 189 9.50 0 7 114500.00
#> 190 0.00 15 17 110000.00
#> 191 10.00 20 15 250000.00
#> 192 3.00 5 135 350000.00
#> 193 8.00 15 20 180000.00
#> 194 1.00 14 26 118000.00
#> 195 3.00 19 26 191500.00
#> 196 1.00 20 29 100000.00
#> 197 19.00 5 75 230000.00
#> 198 1.00 7 93 350000.00
#> 199 20.00 5 50 240000.00
#> 200 8.00 9 127 259289.83
#> 201 25.00 6 14 180000.00
#> 202 16.00 6 75 215000.00
#> 203 14.00 7 90 203000.00
#> 204 10.00 20 50 99800.00
#> 205 24.00 10 105 389900.00
#> 206 1.00 25 190 466223.67
#> 207 2.00 7 8 153912.76
#> 208 4.00 5 45 249760.00
#> 209 0.00 6 150 400000.00
#> 210 11.00 10 300 600000.00
#> 211 2.00 45 15 190321.17
#> 212 1.50 10 25 143960.88
#> 213 2.00 6 150 350892.50
#> 214 1.00 5 60 220000.00
#> 215 21.00 6 32 217500.00
#> 216 1.00 8 85 193365.00
#> 217 20.00 13 15 193093.52
#> 218 1.75 12 144 210000.00
#> 219 5.00 15 190 337000.00
#> 220 22.00 7 155 275000.00
#> 221 5.00 4 10 143128.61
#> 222 3.00 1 20 149726.72
#>
#> $OOBerror
#> NRMSE
#> 0.4584988
#>
#> attr(,"class")
#> [1] "missForest"
Created on 2023-02-11 with reprex v2.0.2

Sample part of a dataset while keeping subgroups intact

I have a dataframe which I would like to split into one 75% and one 25% parts of the original.
I thought a good first step would be to create the 25% dataset from the original dataset, by randomly sampling a quarter of the data.
However sampling shouldn't be entirely random, I want to preserve groups of a certain variable.
So with the example below, I want to randomly sample 1/4 of the data frame, but data needs to remain grouped via the 'team' variable. I have 8 teams, so I want to randomly sample 2 teams.
Data example (dput below)
team points assists
1 1 99 33
2 1 90 28
3 1 86 31
4 1 88 39
5 2 95 34
6 2 92 30
7 2 91 32
8 2 79 35
9 3 85 36
10 3 90 29
11 3 91 24
12 3 97 26
13 4 96 28
14 4 94 18
15 4 95 19
16 4 98 25
17 5 78 36
18 5 80 34
19 5 85 39
20 5 89 33
21 6 94 34
22 6 85 39
23 6 99 28
24 6 79 31
25 7 78 35
26 7 99 29
27 7 98 36
28 7 75 39
29 8 97 33
30 8 68 26
31 8 86 38
32 8 76 31
I've tried this using the slice_sample code from dplyr, but this does the exact opposite of what I want (it splits all teams)
testdata <- df %>% group_by(team) %>% slice_sample(n = 2)
My code results in
team points assists
<dbl> <dbl> <dbl>
1 1 90 28
2 1 99 33
3 2 95 34
4 2 92 30
5 3 91 24
6 3 85 36
7 4 95 19
8 4 98 25
9 5 80 34
10 5 78 36
11 6 85 39
12 6 94 34
13 7 78 35
14 7 98 36
15 8 76 31
16 8 86 38
Example of the dataframe:
structure(list(team = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4,
4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8), points = c(99,
90, 86, 88, 95, 92, 91, 79, 85, 90, 91, 97, 96, 94, 95, 98, 78,
80, 85, 89, 94, 85, 99, 79, 78, 99, 98, 75, 97, 68, 86, 76),
assists = c(33, 28, 31, 39, 34, 30, 32, 35, 36, 29, 24, 26,
28, 18, 19, 25, 36, 34, 39, 33, 34, 39, 28, 31, 35, 29, 36,
39, 33, 26, 38, 31)), class = "data.frame", row.names = c(NA,
-32L))

With dplyr, if you group_by(team) and then sample, that's sampling within each team--the opposite of what you want. Here's a direct approach:
test_teams = sample(unique(dataset$team), size = 2)
test = dataset %>% filter(team %in% test_teams)
train = dataset %>% filter(!team %in% test_teams)

library(caTools)
split <- sample.split(dataset$team, SplitRatio = 0.75)
training_set <- subset(dataset, split == TRUE)
test_set <- subset(dataset, split == FALSE)

How to summarize the top n values across multiple columns row wise?

In my dataframe, I have multiple columns with student grades. I want to sum the "Quiz" columns (e.g., Quiz1, Quiz2). However, I only want to sum the top 2 values, and ignore the others. I want to create a new column with the total (i.e., the sum of the top 2 values).
One issue is that some students have grades that tie for the top 2 grades in a given row. For example, Aaron has a high score of 42, but then there are two scores that tie for the second highest (i.e., 36).
Data
df <-
structure(
list(
Student = c("Aaron", "James", "Charlotte", "Katie", "Olivia",
"Timothy", "Grant", "Chloe", "Judy", "Justin"),
ID = c(30016, 87311, 61755, 55323, 94839, 38209, 34096,
98432, 19487, 94029),
Quiz1 = c(31, 25, 41, 10, 35, 19, 27, 42, 15, 20),
Quiz2 = c(42, 33, 34, 22, 23, 38, 48, 49, 23, 30),
Quiz3 = c(36, 36, 34, 32, 43, 38, 44, 42, 42, 37),
Quiz4 = c(36, 43, 39, 46, 40, 38, 43, 35, 41, 41)
),
row.names = c(NA, -10L),
class = c("tbl_df", "tbl", "data.frame")
)
I know that I can use pivot_longer to do this, which allows me to arrange by group, then take the top 2 values for each student. This works fine, but I would like a more efficient way with tidyverse, rather than having to pivot back and forth.
What I Tried
library(tidyverse)
df %>%
pivot_longer(-c(Student, ID)) %>%
group_by(Student, ID) %>%
arrange(desc(value), .by_group = TRUE) %>%
slice_head(n = 2) %>%
pivot_wider(names_from = name, values_from = value) %>%
ungroup() %>%
mutate(Total = rowSums(select(., starts_with("Quiz")), na.rm = TRUE))
I also know that if I wanted to sum all the columns on each row, then I could use rowSums, as I made use of above. However, I am unsure how to do rowSums of just the top 2 values in the 4 quiz columns.
Expected Output
# A tibble: 10 × 7
Student ID Quiz2 Quiz3 Quiz1 Quiz4 Total
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aaron 30016 42 36 NA NA 78
2 Charlotte 61755 NA NA 41 39 80
3 Chloe 98432 49 NA 42 NA 91
4 Grant 34096 48 44 NA NA 92
5 James 87311 NA 36 NA 43 79
6 Judy 19487 NA 42 NA 41 83
7 Justin 94029 NA 37 NA 41 78
8 Katie 55323 NA 32 NA 46 78
9 Olivia 94839 NA 43 NA 40 83
10 Timothy 38209 38 38 NA NA 76

Based on this StackOverflow answer.
library(tidyverse)
df <-
structure(
list(
Student = c("Aaron", "James", "Charlotte", "Katie", "Olivia",
"Timothy", "Grant", "Chloe", "Judy", "Justin"),
ID = c(30016, 87311, 61755, 55323, 94839, 38209, 34096,
98432, 19487, 94029),
Quiz1 = c(31, 25, 41, 10, 35, 19, 27, 42, 15, 20),
Quiz2 = c(42, 33, 34, 22, 23, 38, 48, 49, 23, 30),
Quiz3 = c(36, 36, 34, 32, 43, 38, 44, 42, 42, 37),
Quiz4 = c(36, 43, 39, 46, 40, 38, 43, 35, 41, 41)
),
row.names = c(NA, -10L),
class = c("tbl_df", "tbl", "data.frame")
)
df %>%
rowwise() %>%
mutate(Quiz_Total = sum(sort(c(Quiz1,Quiz2,Quiz3,Quiz4), decreasing = TRUE)[1:2])) %>%
ungroup()
#> # A tibble: 10 × 7
#> Student ID Quiz1 Quiz2 Quiz3 Quiz4 Quiz_Total
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Aaron 30016 31 42 36 36 78
#> 2 James 87311 25 33 36 43 79
#> 3 Charlotte 61755 41 34 34 39 80
#> 4 Katie 55323 10 22 32 46 78
#> 5 Olivia 94839 35 23 43 40 83
#> 6 Timothy 38209 19 38 38 38 76
#> 7 Grant 34096 27 48 44 43 92
#> 8 Chloe 98432 42 49 42 35 91
#> 9 Judy 19487 15 23 42 41 83
#> 10 Justin 94029 20 30 37 41 78

with base R - select just the quiz result columns and you can treat it like a matrix. apply sort in decreasing order, subsetting first two elements, and then use colSums.
df$Total <- colSums(apply(df[grepl("Quiz", names(df))], 1, function(x) sort(x, decreasing = TRUE)[1:2]))
df
#> # A tibble: 10 × 7
#> Student ID Quiz1 Quiz2 Quiz3 Quiz4 Total
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Aaron 30016 31 42 36 36 78
#> 2 James 87311 25 33 36 43 79
#> 3 Charlotte 61755 41 34 34 39 80
#> 4 Katie 55323 10 22 32 46 78
#> 5 Olivia 94839 35 23 43 40 83
#> 6 Timothy 38209 19 38 38 38 76
#> 7 Grant 34096 27 48 44 43 92
#> 8 Chloe 98432 42 49 42 35 91
#> 9 Judy 19487 15 23 42 41 83
#> 10 Justin 94029 20 30 37 41 78

You do not have to do pivot_wider. Note that the longer format is the tidy format. Just do pivot_longer and left_join:
df %>%
left_join(pivot_longer(., -c(Student, ID)) %>%
group_by(Student, ID) %>%
summarise(Total = sum(sort(value, TRUE)[1:2]), .groups = 'drop'))
# A tibble: 10 x 7
Student ID Quiz1 Quiz2 Quiz3 Quiz4 Total
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aaron 30016 31 42 36 36 78
2 James 87311 25 33 36 43 79
3 Charlotte 61755 41 34 34 39 80
4 Katie 55323 10 22 32 46 78
5 Olivia 94839 35 23 43 40 83
6 Timothy 38209 19 38 38 38 76
7 Grant 34096 27 48 44 43 92
8 Chloe 98432 42 49 42 35 91
9 Judy 19487 15 23 42 41 83
10 Justin 94029 20 30 37 41 78

Yet another solution, based on tidyverse:
library(tidyverse)
df %>%
rowwise %>%
mutate(Quiz = list(c_across(starts_with("Quiz")) *
if_else(rank(c_across(starts_with("Quiz")),ties.method="last")>=3,1,NA_real_)),
across(matches("\\d$"), ~ NULL), total = sum(Quiz, na.rm = T)) %>%
unnest_wider(Quiz, names_sep = "")
#> # A tibble: 10 × 7
#> Student ID Quiz1 Quiz2 Quiz3 Quiz4 total
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Aaron 30016 NA 42 36 NA 78
#> 2 James 87311 NA NA 36 43 79
#> 3 Charlotte 61755 41 NA NA 39 80
#> 4 Katie 55323 NA NA 32 46 78
#> 5 Olivia 94839 NA NA 43 40 83
#> 6 Timothy 38209 NA 38 38 NA 76
#> 7 Grant 34096 NA 48 44 NA 92
#> 8 Chloe 98432 42 49 NA NA 91
#> 9 Judy 19487 NA NA 42 41 83
#> 10 Justin 94029 NA NA 37 41 78

(A bit messy) Base R Solution:
# Store the names of quiz columns as a vector: quiz_colnames => character vector
quiz_colnames <- grep("Quiz\\d+", names(df), value = TRUE)
# Store the names of the non-quiz columns as a vector: non_quiz_colnames => character vector
non_quiz_colnames <- names(df)[!(names(df) %in% quiz_colnames)]
# Store an Idx based on the ID: Idx => integer vector:
Idx <- with(df, as.integer(factor(ID, levels = unique(ID))))
# Split-Apply-Combine to calculate the top 2 quizes: res => data.frame
res <- data.frame(
do.call(
rbind,
lapply(
with(
df,
split(
df,
Idx
)
),
function(x){
# Extract the top 2 quiz vectors: top_2_quizes => named integer vector
top_2_quizes <- head(sort(unlist(x[,quiz_colnames]), decreasing = TRUE), 2)
# Calculate the quiz columns not used: remainder_quiz_cols => character vector
remainder_quiz_cols <- quiz_colnames[!(quiz_colnames %in% names(top_2_quizes))]
# Nullify the remaining quizes: x => data.frame
x[, remainder_quiz_cols] <- NA_integer_
# Calculate the resulting data.frame: data.frame => env
transform(
cbind(
x[,non_quiz_names],
x[,names(top_2_quizes)],
x[,remainder_quiz_cols]
),
Total = sum(top_2_quizes)
)[,c(non_quiz_names, "Quiz2", "Quiz3", "Quiz1", "Quiz4", "Total")]
}
)
),
row.names = NULL,
stringsAsFactors = FALSE
)

Try this base R to also get the NAs
cbind( df[,1:2], t( sapply( seq_along( 1:nrow( df ) ), function(x){
ord <- order( unlist( df[x,3:6] ) )[1:2]; arow <- df[x,3:6];
arow[ord] <- NA; ttl <- rowSums( arow[-ord], na.rm=T );
cbind( arow,Total=ttl ) } ) ) )
Student ID Quiz1 Quiz2 Quiz3 Quiz4 Total
1 Aaron 30016 NA 42 NA 36 78
2 James 87311 NA NA 36 43 79
3 Charlotte 61755 41 NA NA 39 80
4 Katie 55323 NA NA 32 46 78
5 Olivia 94839 NA NA 43 40 83
6 Timothy 38209 NA NA 38 38 76
7 Grant 34096 NA 48 44 NA 92
8 Chloe 98432 NA 49 42 NA 91
9 Judy 19487 NA NA 42 41 83
10 Justin 94029 NA NA 37 41 78

As #akrun provided above, collapse is another efficient possibility. radixorder provides an integer ordering vector, and only the top 2 values in each row are kept, while the others are replaced with NA. Then, rowSums is used to get the totals for each row.
library(collapse)
ftransform(gvr(df, "Student|ID"),
dapply(
gvr(df, "^Quiz"),
MARGIN = 1,
FUN = function(x)
replace(x, radixorder(radixorder(x)) %in% 1:2, NA)
)) %>%
ftransform(Total = rowSums(gvr(., "^Quiz"), na.rm = TRUE))
Output
# A tibble: 10 × 7
Student ID Quiz1 Quiz2 Quiz3 Quiz4 Total
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aaron 30016 NA 42 NA 36 78
2 James 87311 NA NA 36 43 79
3 Charlotte 61755 41 NA NA 39 80
4 Katie 55323 NA NA 32 46 78
5 Olivia 94839 NA NA 43 40 83
6 Timothy 38209 NA NA 38 38 76
7 Grant 34096 NA 48 44 NA 92
8 Chloe 98432 NA 49 42 NA 91
9 Judy 19487 NA NA 42 41 83
10 Justin 94029 NA NA 37 41 78

Descriptive statistics for R-data in a long format

I have data on repeated measurements (5 or less) which include blood pressure measurements. I formatted the data in a long format, but seeing as I've done this for the first time, I now no longer see how to get descriptive statistics of my variables.
My example data:
questiondata <- structure(list(id = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3,
3, 4, 4, 5, 5, 5, 5, 6, 6, 7, 7, 8, 8),
time = c("time1", "time2", "time3", "time5", "time1", "time2", "time3", "time5",
"time1", "time2", "time3", "time5", "time4", "time5", "time4", "time5",
"time4", "time5", "time4", "time5", "time4", "time5", "time4", "time5"),
cohort = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
systolicBP = c(102, 137, 132, 150, 152, 146, 160.5, 159.5, 144, 138, 137.5, 163,
137, 147, 125, 141, 147, 150, 148, 167.5, 153.5, 164.5, 159, 123),
diastolicBP = c(56, 99, 78, 90, 77, 78, 80.5, 82, 72, 70, 67.5, 61, 86, 90, 80.5,
84, 75, 81, 91, 96, 80, 87.5, 87, 79),
egfr = c(78.2, NA, 55.8, NA, NA, NA, 87.6, NA, NA, NA, 75.6, 70.9, 71.9, 71.8,
47.9, 36.6, 93.7, 81.5, 93.2, 82.1, 92.9, 79.1, 66.6, 55.2)),
row.names = c(NA, -24L), class = c("tbl_df", "tbl", "data.frame"))
Which corresponds to the following tibble:
# A tibble: 24 x 6
id time cohort systolicBP diastolicBP egfr
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 time1 1 102 56 78.2
2 1 time2 1 137 99 NA
3 1 time3 1 132 78 55.8
4 1 time5 1 150 90 NA
5 2 time1 1 152 77 NA
6 2 time2 1 146 78 NA
7 2 time3 1 160. 80.5 87.6
8 2 time5 1 160. 82 NA
9 3 time1 1 144 72 NA
10 3 time2 1 138 70 NA
11 3 time3 1 138. 67.5 75.6
12 3 time5 1 163 61 70.9
13 4 time4 2 137 86 71.9
14 4 time5 2 147 90 71.8
15 5 time4 2 125 80.5 47.9
16 5 time5 2 141 84 36.6
17 6 time4 2 147 75 93.7
18 6 time5 2 150 81 81.5
19 7 time4 2 148 91 93.2
20 7 time5 2 168. 96 82.1
21 8 time4 2 154. 80 92.9
22 8 time5 2 164. 87.5 79.1
23 9 time4 2 159 87 66.6
24 9 time5 2 123 79 55.2
So for example now I want to
Get how many persons are in each cohort
What is the mean of the systolicBP, in total, for each cohort, and at each timepoint.
I'm sure this is very easy, but I can't seem to get any viable results.

library(tidyverse)
questiondata %>%
group_by(cohort) %>%
summarise(n = n_distinct(id),
mean_systolic = mean(systolicBP, na.rm = TRUE), .groups = "drop")
#> # A tibble: 2 x 3
#> cohort n mean_systolic
#> <dbl> <int> <dbl>
#> 1 1 3 143.
#> 2 2 6 147.
questiondata %>%
group_by(cohort, time) %>%
summarise(mean_systolic_time = mean(systolicBP, na.rm = TRUE), .groups = "drop")
#> # A tibble: 6 x 3
#> cohort time mean_systolic_time
#> <dbl> <chr> <dbl>
#> 1 1 time1 133.
#> 2 1 time2 140.
#> 3 1 time3 143.
#> 4 1 time5 158.
#> 5 2 time4 145.
#> 6 2 time5 149.
Created on 2021-06-25 by the reprex package (v2.0.0)
questiondata %>%
group_by(cohort, time) %>%
summarise(mean_systolic_time = str_c(round(mean(systolicBP, na.rm = TRUE), digits = 0), " mmHg"),
.groups = "drop")
# A tibble: 6 x 3
cohort time mean_systolic_time
<dbl> <chr> <chr>
1 1 time1 133 mmHg
2 1 time2 140 mmHg
3 1 time3 143 mmHg
4 1 time5 158 mmHg
5 2 time4 145 mmHg
6 2 time5 149 mmHg

R Impute NA's by Linear Increase Depending on Time Interval

PROBLEM
I neeed to impute the NA's in my data frame that comes from a repeated measures study. On this particular outcome, I need to impute the NA's with the last observed non-NA value +1 by each +52 week interval starting from the last observed value.
EXAMPLE
An example data frame with the target imputation goal included.
df <- data.frame(
subject = rep(1:3, each = 12),
week = rep(c(8, 10, 12, 16, 20, 26, 32, 44, 52, 64, 78, 104),3),
value = c(112, 97, 130, 104, NA, NA, NA, NA, NA, NA, NA, NA,
89, 86, 94, 96, 88,107, 110, 102, 107, NA, NA, NA,
107, 110, 102, 130, 104, 88, 82, 79, 92, 106, NA, NA),
goal = c(112, 97, 130, 104, 104, 104, 104, 104, 104, 104, 105, 105,
89, 86, 94, 96, 88,107, 110, 102, 107, 107,107, 108,
107, 110, 102, 130, 104, 88, 82, 79, 92, 106, 106, 106)
)

I left the intermediate columns in to make what's happening more obvious, but you can remove them with a simple select.
df = df %>%
group_by(subject) %>%
mutate(last_obs_week = max(week[!is.na(value)]),
since_last_week = pmax(0, week - last_obs_week),
inc_52 = since_last_week %/% 52,
result = zoo::na.locf(value) + inc_52
)
all(df$goal == df$result)
# [1] TRUE
print.data.frame(df)
# subject week value goal last_obs_week since_last_week inc_52 result
# 1 1 8 112 112 16 0 0 112
# 2 1 10 97 97 16 0 0 97
# 3 1 12 130 130 16 0 0 130
# 4 1 16 104 104 16 0 0 104
# 5 1 20 NA 104 16 4 0 104
# 6 1 26 NA 104 16 10 0 104
# 7 1 32 NA 104 16 16 0 104
# 8 1 44 NA 104 16 28 0 104
# 9 1 52 NA 104 16 36 0 104
# 10 1 64 NA 104 16 48 0 104
# 11 1 78 NA 105 16 62 1 105
# 12 1 104 NA 105 16 88 1 105
# 13 2 8 89 89 52 0 0 89
# ...

One can use dplyr and tidyr::fill to get the desired result. The logic will be to add a column to track the week which had the non-NA value. Use tidyr::fill to populate last non-NA value and then check if difference of current week with last non-NA week is more than 52 then increase the value by 1.
library(dplyr)
library(tidyr)
df %>% group_by(subject) %>%
mutate(weekWithLastNonNaValue = ifelse(is.na(value), NA, week)) %>%
fill(value, weekWithLastNonNaValue) %>%
mutate(value = value + (week-weekWithLastNonNaValue) %/% 52) %>%
select(-weekWithLastNonNaValue) %>%
as.data.frame()
# subject week value goal
# 1 1 8 112 112
# 2 1 10 97 97
# 3 1 12 130 130
# 4 1 16 104 104
# 5 1 20 104 104
# 6 1 26 104 104
# 7 1 32 104 104
# 8 1 44 104 104
# 9 1 52 104 104
# 10 1 64 104 104
# 11 1 78 105 105
# 12 1 104 105 105
# 13 2 8 89 89
# 14 2 10 86 86
# 15 2 12 94 94
# 16 2 16 96 96
# 17 2 20 88 88
# 18 2 26 107 107
# 19 2 32 110 110
# 20 2 44 102 102
#
# so on
#

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Calculate a complex difference score with tidyverse? - r

You can filter your values for finish within the mutate() and calculate the mean based on those: df %>% group_by(day,bib) %>% mutate( avg.sg = mean(finish[course=="SG"]), diff = finish - avg.sg)

Related

Error message with missForest package (imputation using Random Forest)

Sample part of a dataset while keeping subgroups intact

How to summarize the top n values across multiple columns row wise?

Descriptive statistics for R-data in a long format

R Impute NA's by Linear Increase Depending on Time Interval

Categories

Resources