Related
I have a dataframe which I would like to split into one 75% and one 25% parts of the original.
I thought a good first step would be to create the 25% dataset from the original dataset, by randomly sampling a quarter of the data.
However sampling shouldn't be entirely random, I want to preserve groups of a certain variable.
So with the example below, I want to randomly sample 1/4 of the data frame, but data needs to remain grouped via the 'team' variable. I have 8 teams, so I want to randomly sample 2 teams.
Data example (dput below)
team points assists
1 1 99 33
2 1 90 28
3 1 86 31
4 1 88 39
5 2 95 34
6 2 92 30
7 2 91 32
8 2 79 35
9 3 85 36
10 3 90 29
11 3 91 24
12 3 97 26
13 4 96 28
14 4 94 18
15 4 95 19
16 4 98 25
17 5 78 36
18 5 80 34
19 5 85 39
20 5 89 33
21 6 94 34
22 6 85 39
23 6 99 28
24 6 79 31
25 7 78 35
26 7 99 29
27 7 98 36
28 7 75 39
29 8 97 33
30 8 68 26
31 8 86 38
32 8 76 31
I've tried this using the slice_sample code from dplyr, but this does the exact opposite of what I want (it splits all teams)
testdata <- df %>% group_by(team) %>% slice_sample(n = 2)
My code results in
team points assists
<dbl> <dbl> <dbl>
1 1 90 28
2 1 99 33
3 2 95 34
4 2 92 30
5 3 91 24
6 3 85 36
7 4 95 19
8 4 98 25
9 5 80 34
10 5 78 36
11 6 85 39
12 6 94 34
13 7 78 35
14 7 98 36
15 8 76 31
16 8 86 38
Example of the dataframe:
structure(list(team = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4,
4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8), points = c(99,
90, 86, 88, 95, 92, 91, 79, 85, 90, 91, 97, 96, 94, 95, 98, 78,
80, 85, 89, 94, 85, 99, 79, 78, 99, 98, 75, 97, 68, 86, 76),
assists = c(33, 28, 31, 39, 34, 30, 32, 35, 36, 29, 24, 26,
28, 18, 19, 25, 36, 34, 39, 33, 34, 39, 28, 31, 35, 29, 36,
39, 33, 26, 38, 31)), class = "data.frame", row.names = c(NA,
-32L))
With dplyr, if you group_by(team) and then sample, that's sampling within each team--the opposite of what you want. Here's a direct approach:
test_teams = sample(unique(dataset$team), size = 2)
test = dataset %>% filter(team %in% test_teams)
train = dataset %>% filter(!team %in% test_teams)
library(caTools)
split <- sample.split(dataset$team, SplitRatio = 0.75)
training_set <- subset(dataset, split == TRUE)
test_set <- subset(dataset, split == FALSE)
I try to create a weekly cumulative result from this daily data with detail below
date A B C D E F G H
16-Jan-22 227 441 3593 2467 9 6 31 2196
17-Jan-22 224 353 3555 2162 31 5 39 2388
18-Jan-22 181 144 2734 2916 0 0 14 1753
19-Jan-22 95 433 3610 3084 42 19 10 2862
20-Jan-22 141 222 3693 3149 183 19 23 2176
21-Jan-22 247 426 3455 4016 68 0 1 2759
22-Jan-22 413 931 4435 4922 184 2 39 3993
23-Jan-22 389 1340 5433 5071 200 48 27 4495
24-Jan-22 281 940 6875 5009 343 47 71 3713
25-Jan-22 314 454 5167 4555 127 1 68 3554
26-Jan-22 315 973 5789 3809 203 1 105 4456
27-Jan-22 269 1217 6776 4578 227 91 17 5373
28-Jan-22 248 1320 5942 3569 271 91 156 4260
29-Jan-22 155 1406 6771 4328 426 44 109 4566
Solution using data.table and lubridate
library(lubridate)
library(data.table)
setDT(df)
df[, lapply(.SD, sum), by = isoweek(dmy(date))]
# isoweek A B C D E F G H
# 1: 2 227 441 3593 2467 9 6 31 2196
# 2: 3 1690 3849 26915 25320 708 93 153 20426
# 3: 4 1582 6310 37320 25848 1597 275 526 25922
I wanted to provide a solution using the tidyverse principles.
You will need to use the group_by and summarize formulas and this very useful across() function.
#recreate data in tribble
df <- tribble(
~"date", ~"A", ~"B", ~"C", ~"D", ~"E", ~"F", ~"G", ~H,
"16-Jan-22",227, 441, 3593, 2467, 9, 6, 31, 2196,
"17-Jan-22",224, 353, 3555, 2162, 31, 5, 39, 2388,
"18-Jan-22",181, 144, 2734, 2916, 0, 0, 14, 1753,
"19-Jan-22",95, 433, 3610, 3084, 42, 19, 10, 2862,
"20-Jan-22",141, 222, 3693, 3149, 183, 19, 23, 2176,
"21-Jan-22",247, 426, 3455, 4016, 68, 0, 1, 2759,
"22-Jan-22",413, 931, 4435, 4922, 184, 2, 39, 3993,
"23-Jan-22",389, 1340, 5433, 5071, 200, 48, 27, 4495,
"24-Jan-22",281, 940, 6875, 5009, 343, 47, 71, 3713,
"25-Jan-22",314, 454, 5167, 4555, 127, 1, 68, 3554,
"26-Jan-22",315, 973, 5789, 3809, 203, 1, 105, 4456,
"27-Jan-22",269, 1217, 6776, 4578, 227, 91, 17, 5373,
"28-Jan-22",248, 1320, 5942, 3569, 271, 91, 156, 4260,
"29-Jan-22",155, 1406, 6771, 4328, 426, 44, 109, 4566)
#change date to format date
df$date <- ymd(df$date)
# I both create a new column "week_num" and group it by this variables
## Then I summarize that for each column except for "date", take each sum
df %>%
group_by(week_num=lubridate::isoweek(date)) %>%
summarize(across(-c("date"),sum))
# I get this results
week_num A B C D E F G H
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 2017 6028 33189 26785 990 243 310 25464
2 4 1482 4572 34639 26850 1324 131 400 23080
Group_by() and summarize() are relatively straight forward. Across() is a fairly new verb that is super powerful. It allows you to reference columns using tidy selection principles (eg. starts_with(), c(1:9), etc) and if you apply it a formula it will allow that formula to each of the selected columns. Less typing!
Alternatively you would have to individually sum each column A=sum(A) which is more typing.
I have a large dataset of 70 000 rows that I want to perform some operations on, but I can't find an appropriate solution.
bib sta run course finish comment day
1 42 9 1 SG 19.88 99 1
2 42 17 2 A 19.96 11 1
3 42 27 3 B 20.92 22 1
4 42 39 4 A 19.60 11 1
5 42 48 5 SG 20.24 99 1
6 42 61 6 C 22.90 33 1
7 42 76 7 B 20.70 22 1
8 42 86 8 C 22.74 33 1
9 42 93 9 C 22.75 33 1
10 42 103 10 A 19.79 11 1
11 42 114 11 B 20.67 22 1
12 42 120 12 SG 20.10 99 1
I want to end up with a tibble that:
calculates the mean finish time in SG course for each bib number on one particular day. For example, 19.88 + 20.24 + 20.10 / 3
calculate a difference score for each observation in the dataset by subtracting finish from this mean SG score. For example, 19.88 - mean(SG), 19.96 - mean(SG).
I have tried the following approach:
First group by day, bib and course. Then filter by SG and calculate the mean:
avg.sgtime <- df %>%
group_by(day, bib, course) %>%
filter(course == 'SG') %>%
mutate(avg.sg = mean(finish))
Resulting in the following tibble
bib sta run course finish comment day avg.sg
<int> <int> <int> <chr> <dbl> <int> <chr> <dbl>
1 42 9 1 SG 19.9 99 1 20.1
2 42 48 5 SG 20.2 99 1 20.1
3 42 120 12 SG 20.1 99 1 20.1
4 42 6 1 SG 20.0 99 2 19.9
5 42 42 5 SG 19.8 77 2 19.9
6 42 130 15 SG 19.9 99 2 19.9
7 42 6 1 SG 20.6 99 3 20.5
8 42 68 12 SG 20.6 77 3 20.5
9 42 90 15 SG 20.4 77 3 20.5
Finally I join the two tibbles together using the following syntax:
df %>% full_join(avg.sgtime) %>%
mutate(diff = finish - avg.sg)
However, this doesn't work. It only works for the SG course but not for course A, B and C. Is there a way to fix this or is there a better solution to the problem?
bib sta run course finish comment day avg.sg diff
1 42 9 1 SG 19.88 99 1 20.07333 -0.193333333
2 42 17 2 A 19.96 11 1 NA NA
3 42 27 3 B 20.92 22 1 NA NA
4 42 39 4 A 19.60 11 1 NA NA
5 42 48 5 SG 20.24 99 1 20.07333 0.166666667
You can filter your values for finish within the mutate() and calculate the mean based on those:
df %>%
group_by(day,bib) %>%
mutate(
avg.sg = mean(finish[course=="SG"]),
diff = finish - avg.sg)
Is the following what you are aiming for?
(note that I added a few random values for a second bib just to make sure the join is done properly)
The difference to your attempt is using summarise() instead of mutate() to consolidate the avg.sgtime data frame, and also dropping a few columns so that the join is not populated with NAs. Instead of dropping you can also set the relevant columns to join by passing the by argument to the left_join() function.
library(dplyr)
library(tidyr) # for join
avg.sgtime <- df %>%
group_by(day, bib, course) %>%
filter(course == 'SG') %>%
summarise(avg.sg = mean(finish), .groups = "drop") %>%
select(c(bib, day, avg.sg))
avg.sgtime
#> # A tibble: 3 x 3
#> bib day avg.sg
#> <dbl> <dbl> <dbl>
#> 1 42 1 20.1
#> 2 43 1 19.1
#> 3 44 2 19.3
df %>% left_join(avg.sgtime) %>%
mutate(diff = finish - avg.sg)
#> Joining, by = c("bib", "day")
#> # A tibble: 36 x 9
#> bib sta run course finish comment day avg.sg diff
#> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 42 9 1 SG 19.9 99 1 20.1 -0.193
#> 2 42 17 2 A 20.0 11 1 20.1 -0.113
#> 3 42 27 3 B 20.9 22 1 20.1 0.847
#> 4 42 39 4 A 19.6 11 1 20.1 -0.473
#> 5 42 48 5 SG 20.2 99 1 20.1 0.167
#> 6 42 61 6 C 22.9 33 1 20.1 2.83
#> 7 42 76 7 B 20.7 22 1 20.1 0.627
#> 8 42 86 8 C 22.7 33 1 20.1 2.67
#> 9 42 93 9 C 22.8 33 1 20.1 2.68
#> 10 42 103 10 A 19.8 11 1 20.1 -0.283
#> # … with 26 more rows
Created on 2021-07-04 by the reprex package (v2.0.0)
data
df <- tribble(~bib, ~sta, ~run, ~course, ~finish, ~comment, ~day,
42, 9, 1, "SG", 19.88, 99, 1,
42, 17, 2, "A", 19.96, 11, 1,
42, 27, 3, "B", 20.92, 22, 1,
42, 39, 4, "A", 19.60, 11, 1,
42, 48, 5, "SG", 20.24, 99, 1,
42, 61, 6, "C", 22.90, 33, 1,
42, 76, 7, "B", 20.70, 22, 1,
42, 86, 8, "C", 22.74, 33, 1,
42, 93, 9, "C", 22.75, 33, 1,
42, 103, 10, "A", 19.79, 11, 1,
42, 114, 11, "B", 20.67, 22, 1,
42, 120, 12, "SG", 20.10, 99, 1,
43, 9, 1, "SG", 19.12, 99, 1,
43, 17, 2, "A", 19.64, 11, 1,
43, 27, 3, "B", 20.62, 22, 1,
43, 39, 4, "A", 19.23, 11, 1,
43, 48, 5, "SG", 20.11, 99, 1,
43, 61, 6, "C", 22.22, 33, 1,
43, 76, 7, "B", 20.33, 22, 1,
43, 86, 8, "C", 22.51, 33, 1,
43, 93, 9, "C", 22.78, 33, 1,
43, 103, 10, "A", 19.98, 11, 1,
43, 114, 11, "B", 20.11, 22, 1,
43, 120, 12, "SG", 18.21, 99, 1,
44, 9, 1, "SG", 19.18, 99, 2,
44, 17, 2, "A", 19.56, 11, 2,
44, 27, 3, "B", 20.62, 22, 2,
44, 39, 4, "A", 19.20, 11, 2,
44, 48, 5, "SG", 20.74, 99, 2,
44, 61, 6, "C", 22.50, 33, 2,
44, 76, 7, "B", 20.60, 22, 2,
44, 86, 8, "C", 22.74, 33, 2,
44, 93, 9, "C", 22.85, 33, 2,
44, 103, 10, "A", 19.59, 11, 2,
44, 114, 11, "B", 20.27, 22, 2,
44, 120, 12, "SG", 18.10, 99, 2,
)
Thanks #Marcelo Avila for providing me with a very good hint:
I hope this is what you are looking for:
library(dplyr)
df %>%
group_by(bib, day) %>%
mutate(across(finish, ~ mean(.x[course == "SG"]), .names = "avg_{.col}"),
diff = finish - avg_finish,
avg_finish = ifelse(course == "SG", avg_finish, NA))
# A tibble: 12 x 9
# Groups: bib, day [1]
bib sta run course finish comment day avg_finish diff
<int> <int> <int> <chr> <dbl> <int> <int> <dbl> <dbl>
1 42 9 1 SG 19.9 99 1 20.1 -0.193
2 42 17 2 A 20.0 11 1 NA -0.113
3 42 27 3 B 20.9 22 1 NA 0.847
4 42 39 4 A 19.6 11 1 NA -0.473
5 42 48 5 SG 20.2 99 1 20.1 0.167
6 42 61 6 C 22.9 33 1 NA 2.83
7 42 76 7 B 20.7 22 1 NA 0.627
8 42 86 8 C 22.7 33 1 NA 2.67
9 42 93 9 C 22.8 33 1 NA 2.68
10 42 103 10 A 19.8 11 1 NA -0.283
11 42 114 11 B 20.7 22 1 NA 0.597
12 42 120 12 SG 20.1 99 1 20.1 0.0267
I also added another alternative solution with a minor change, using dear #Marcelo Avila's data set:
df %>%
group_by(bib, day) %>%
mutate(across(finish, ~ mean(.x[select(cur_data(), course) == "SG"]), .names = "avg_{.col}"),
diff = finish - avg_finish,
avg_finish = ifelse(course == "SG", avg_finish, NA))
# A tibble: 36 x 9
# Groups: bib, day [3]
bib sta run course finish comment day avg_finish diff
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 42 9 1 SG 19.9 99 1 20.1 -0.193
2 42 17 2 A 20.0 11 1 NA -0.113
3 42 27 3 B 20.9 22 1 NA 0.847
4 42 39 4 A 19.6 11 1 NA -0.473
5 42 48 5 SG 20.2 99 1 20.1 0.167
6 42 61 6 C 22.9 33 1 NA 2.83
7 42 76 7 B 20.7 22 1 NA 0.627
8 42 86 8 C 22.7 33 1 NA 2.67
9 42 93 9 C 22.8 33 1 NA 2.68
10 42 103 10 A 19.8 11 1 NA -0.283
# ... with 26 more rows
PROBLEM
I neeed to impute the NA's in my data frame that comes from a repeated measures study. On this particular outcome, I need to impute the NA's with the last observed non-NA value +1 by each +52 week interval starting from the last observed value.
EXAMPLE
An example data frame with the target imputation goal included.
df <- data.frame(
subject = rep(1:3, each = 12),
week = rep(c(8, 10, 12, 16, 20, 26, 32, 44, 52, 64, 78, 104),3),
value = c(112, 97, 130, 104, NA, NA, NA, NA, NA, NA, NA, NA,
89, 86, 94, 96, 88,107, 110, 102, 107, NA, NA, NA,
107, 110, 102, 130, 104, 88, 82, 79, 92, 106, NA, NA),
goal = c(112, 97, 130, 104, 104, 104, 104, 104, 104, 104, 105, 105,
89, 86, 94, 96, 88,107, 110, 102, 107, 107,107, 108,
107, 110, 102, 130, 104, 88, 82, 79, 92, 106, 106, 106)
)
I left the intermediate columns in to make what's happening more obvious, but you can remove them with a simple select.
df = df %>%
group_by(subject) %>%
mutate(last_obs_week = max(week[!is.na(value)]),
since_last_week = pmax(0, week - last_obs_week),
inc_52 = since_last_week %/% 52,
result = zoo::na.locf(value) + inc_52
)
all(df$goal == df$result)
# [1] TRUE
print.data.frame(df)
# subject week value goal last_obs_week since_last_week inc_52 result
# 1 1 8 112 112 16 0 0 112
# 2 1 10 97 97 16 0 0 97
# 3 1 12 130 130 16 0 0 130
# 4 1 16 104 104 16 0 0 104
# 5 1 20 NA 104 16 4 0 104
# 6 1 26 NA 104 16 10 0 104
# 7 1 32 NA 104 16 16 0 104
# 8 1 44 NA 104 16 28 0 104
# 9 1 52 NA 104 16 36 0 104
# 10 1 64 NA 104 16 48 0 104
# 11 1 78 NA 105 16 62 1 105
# 12 1 104 NA 105 16 88 1 105
# 13 2 8 89 89 52 0 0 89
# ...
One can use dplyr and tidyr::fill to get the desired result. The logic will be to add a column to track the week which had the non-NA value. Use tidyr::fill to populate last non-NA value and then check if difference of current week with last non-NA week is more than 52 then increase the value by 1.
library(dplyr)
library(tidyr)
df %>% group_by(subject) %>%
mutate(weekWithLastNonNaValue = ifelse(is.na(value), NA, week)) %>%
fill(value, weekWithLastNonNaValue) %>%
mutate(value = value + (week-weekWithLastNonNaValue) %/% 52) %>%
select(-weekWithLastNonNaValue) %>%
as.data.frame()
# subject week value goal
# 1 1 8 112 112
# 2 1 10 97 97
# 3 1 12 130 130
# 4 1 16 104 104
# 5 1 20 104 104
# 6 1 26 104 104
# 7 1 32 104 104
# 8 1 44 104 104
# 9 1 52 104 104
# 10 1 64 104 104
# 11 1 78 105 105
# 12 1 104 105 105
# 13 2 8 89 89
# 14 2 10 86 86
# 15 2 12 94 94
# 16 2 16 96 96
# 17 2 20 88 88
# 18 2 26 107 107
# 19 2 32 110 110
# 20 2 44 102 102
#
# so on
#
Hi I have a dataframe as below (example)..
time <-c( 8/11/2017 10:21, 8/10/2017 22:34, 8/16/2017 2:28, 8/14/2017 6:17, 8/11/2017 6:33, 8/15/2017 23:46, 8/10/2017 20:10, 8/14/2017 3:35, 8/11/2017 4:09, 8/15/2017 21:05, 8/11/2017 2:16, 8/10/2017 18:17, 8/13/2017 10:02, 8/13/2017 9:08, 8/13/2017 8:32, 8/13/2017 8:20, 8/13/2017 7:56)
code <- c( 1, 3, 2, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2)
var1 <- c( 5, 11, 16, 22, 27, 33, 38, 44, 49, 55, 60, 66, 71, 77, 66, 71, 77)
var2 <- c( 115, 66, 71, 33, 38, 110, 115, 121, 126, 132, 104, 66, 71, 77, 115, 121, 66)
var3 <- c( 38, 44, 49, 55, 60, 66, 71, 77, 66, 71, 77, 132, 104, 66, 71, 77, 115)
time code var1 var2 var3
8/11/2017 10:21 1 5 115 38
8/10/2017 22:34 3 11 66 44
8/16/2017 2:28 2 16 71 49
8/14/2017 6:17 2 22 33 55
8/11/2017 6:33 1 27 38 60
8/15/2017 23:46 2 33 110 66
8/10/2017 20:10 2 38 115 71
8/14/2017 3:35 1 44 121 77
8/11/2017 4:09 1 49 126 66
8/15/2017 21:05 2 55 132 71
8/11/2017 2:16 2 60 104 77
8/10/2017 18:17 2 66 66 132
8/13/2017 10:02 2 71 71 104
8/13/2017 9:08 2 77 77 66
8/13/2017 8:32 1 66 115 71
8/13/2017 8:20 1 71 121 77
8/13/2017 7:56 2 77 66 115
I want to recast this dataframe using the column "code". The output I'm expecting should be as below.
time code1_var1 code1_var2 code1_var3 code2_var1 code2_var2 code2_var3 code3_var1 code3_var2 code3_var3
8/11/2017 10:21
8/10/2017 22:34
8/16/2017 2:28
8/14/2017 6:17
8/11/2017 6:33
8/15/2017 23:46
8/10/2017 20:10
8/14/2017 3:35
8/11/2017 4:09
8/15/2017 21:05
8/11/2017 2:16
8/10/2017 18:17
8/13/2017 10:02
8/13/2017 9:08
8/13/2017 8:32
8/13/2017 8:20
8/13/2017 7:56
But when I tried dcast funtion in R It is giving me an error for time variable.
Please help me with this reshaping objective
note: The result should have many NA because of reshaping and missing data.
The easiest way to do this would be with dcast from "data.table" or even reshape from base R.
Assuming your vectors are collected in a data.frame named "d", try the following:
library(data.table)
setDT(d)
x <- dcast(d, time ~ code, value.var = paste0("var", 1:3))
head(x)
# time var1_1 var1_2 var1_3 var2_1 var2_2 var2_3 var3_1 var3_2 var3_3
# 1: 8/10/2017 18:17 NA 66 NA NA 66 NA NA 132 NA
# 2: 8/10/2017 20:10 NA 38 NA NA 115 NA NA 71 NA
# 3: 8/10/2017 22:34 NA NA 11 NA NA 66 NA NA 44
# 4: 8/11/2017 10:21 5 NA NA 115 NA NA 38 NA NA
# 5: 8/11/2017 2:16 NA 60 NA NA 104 NA NA 77 NA
# 6: 8/11/2017 4:09 49 NA NA 126 NA NA 66 NA NA
OR
reshape(d, direction = "wide", idvar = "time", timevar = "code")
If you wanted to use the tidyverse, you would need to first gather, then create a new "times" variable, and then reshape to the wide format:
library(tidyverse)
d %>%
gather(variable, value, starts_with("var")) %>%
unite(key, code, variable) %>%
spread(key, value)