dplyr: using column created by mutate in the mutation itself - r

I have a data frame that looks something like this:
> df
# A tibble: 5,427 x 3
cond desired inc
<chr> <dbl> <dbl>
1 <NA> 0 0
2 <NA> 5 5
3 X 10 5
4 X 7 7
5 <NA> 16 16
6 <NA> 21 5
7 <NA> 26 5
8 <NA> 31 5
9 X 37 6
10 <NA> 5 5
this already includes my desired output. What I want to do is sum up the values of inc, but reset the sum if there is an X in the cond-column of the previous row. So for example in row 9 I'd take the desired-value from the previous row (31) and add the inc-value from row 9 (6) which gives 37. And in row 5 I'd just take the inc-value because the cond-column of the previous row was X. I solved this using a loop, but I'd like to use a vectorized solution. So far I got this:
df$test <- 0
df <- df %>% mutate(test = ifelse(is.na(lag(df$cond)), lag(test) + inc, inc))
If I run the second line once I get this:
> df
# A tibble: 5,427 x 4
cond desired inc test
<chr> <dbl> <dbl> <dbl>
1 <NA> 0 0 NA
2 <NA> 5 5 5
3 X 10 5 5
4 X 7 7 7
5 <NA> 16 16 16
6 <NA> 21 5 5
7 <NA> 26 5 5
8 <NA> 31 5 5
9 X 37 6 6
10 <NA> 5 5 5
After the second run it looks like this:
> df
# A tibble: 5,427 x 4
cond desired inc test
<chr> <dbl> <dbl> <dbl>
1 <NA> 0 0 NA
2 <NA> 5 5 NA
3 X 10 5 10
4 X 7 7 7
5 <NA> 16 16 16
6 <NA> 21 5 21
7 <NA> 26 5 10
8 <NA> 31 5 10
9 X 37 6 11
10 <NA> 5 5 5
# ... with 5,417 more rows
Third time:
> df
# A tibble: 5,427 x 4
cond desired inc test
<chr> <dbl> <dbl> <dbl>
1 <NA> 0 0 NA
2 <NA> 5 5 NA
3 X 10 5 NA
4 X 7 7 7
5 <NA> 16 16 16
6 <NA> 21 5 21
7 <NA> 26 5 26
8 <NA> 31 5 15
9 X 37 6 16
10 <NA> 5 5 5
Then, after the fifth time:
> df
# A tibble: 5,427 x 4
cond desired inc test
<chr> <dbl> <dbl> <dbl>
1 <NA> 0 0 NA
2 <NA> 5 5 NA
3 X 10 5 NA
4 X 7 7 7
5 <NA> 16 16 16
6 <NA> 21 5 21
7 <NA> 26 5 26
8 <NA> 31 5 31
9 X 37 6 37
10 <NA> 5 5 5
I'm using the column I'm creating with mutate in the mutate-command itself and I guess that is causing this behaviour/problem. Is there any way to get to my desired result? Thanks in advance!
the dataframe:
structure(list(cond = c(NA, NA, "X", "X", NA, NA, NA, NA, "X",
NA, NA, NA, NA, NA, NA, NA, NA, NA, "X", NA, NA, NA, NA, "X",
NA, NA, NA, NA, NA, NA, "X", NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, "X", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, "X", NA, NA, NA, NA, NA, NA, NA, "X", NA, NA, "X",
NA, NA, NA, NA, NA, "X", NA, NA, NA, NA, NA, NA, NA, "X", NA,
NA, NA, NA, NA, NA, NA, NA, "X", NA, NA, NA, NA, NA, NA, NA,
NA, "X", NA, NA, NA, NA, NA, NA, "X", NA, NA, NA, NA, NA, NA,
NA, NA, NA, "X", NA, NA, NA, "X", NA, NA, NA, NA, "X", NA, NA,
NA, NA, NA, NA, NA, NA, "X", NA, NA, "X", NA, NA, NA, NA, "X",
NA, NA, NA, NA, NA, NA, NA, NA, "X", NA, NA, NA, NA, NA, NA,
NA, "X", NA, "X", NA, NA, NA, NA, NA, NA, NA, NA, "X", NA, NA,
NA, NA, NA, NA, NA, "X", NA, NA, NA, "X", "X", NA, NA, NA, NA,
NA, NA, NA, NA, "X", "X", NA, "X", NA, NA, NA, NA, NA, NA, NA,
NA, "X", NA, NA, NA, "X", NA, NA, NA, NA, NA, NA, NA, NA, "X",
NA, NA, NA, NA, NA, "X", NA, NA, NA, NA, "X", NA, NA, NA, NA,
"X", NA, NA, NA, NA, NA, "X", NA, NA, NA, NA, NA, NA, NA, NA,
"X", NA, NA, NA, NA, NA, NA, "X", NA, NA, NA, NA, "X", NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "X", NA, "X",
NA, "X", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, "X", NA, NA, NA), desired = c(0, 5, 10, 7, 16, 21, 26,
31, 37, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 5, 10, 15, 20,
30, 7, 15, 21, 25, 40, 45, 55, 12, 20, 25, 30, 35, 40, 45, 50,
55, 60, 65, 70, 75, 5, 10, 15, 20, 22, 30, 35, 45, 50, 55, 60,
65, 70, 75, 9, 14, 19, 24, 29, 34, 39, 44, 5, 7, 10, 2, 7, 12,
17, 22, 27, 5, 10, 15, 20, 25, 30, 35, 38, 4, 7, 12, 17, 22,
27, 32, 37, 39, 13, 18, 23, 28, 33, 38, 43, 48, 53, 5, 10, 15,
20, 25, 30, 35, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 5, 10,
15, 20, 2, 10, 15, 20, 25, 5, 10, 15, 20, 25, 30, 35, 40, 45,
5, 8, 12, 5, 10, 14, 19, 24, 5, 10, 15, 20, 25, 30, 35, 40, 45,
5, 10, 15, 20, 25, 28, 33, 38, 5, 11, 5, 10, 15, 20, 25, 30,
35, 40, 45, 12, 17, 22, 27, 32, 37, 42, 47, 5, 10, 15, 20, 5,
5, 10, 15, 20, 25, 30, 35, 40, 45, 5, 5, 10, 5, 10, 15, 20, 25,
30, 35, 40, 45, 5, 10, 15, 20, 5, 10, 15, 20, 25, 30, 34, 39,
44, 5, 10, 15, 20, 25, 30, 5, 10, 15, 20, 25, 5, 10, 15, 20,
25, 5, 10, 15, 20, 25, 29, 5, 10, 15, 20, 23, 25, 30, 35, 40,
5, 15, 20, 25, 30, 35, 40, 5, 10, 15, 20, 25, 5, 10, 15, 20,
25, 28, 33, 38, 43, 48, 53, 58, 71, 76, 81, 5, 10, 5, 10, 5,
10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 5,
10, 15), inc = c(0, 5, 5, 7, 16, 5, 5, 5, 6, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 10, 7, 8, 6, 4, 15, 5, 10, 12, 8, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 2, 8, 5, 10, 5, 5,
5, 5, 5, 5, 9, 5, 5, 5, 5, 5, 5, 5, 5, 2, 3, 2, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 3, 4, 3, 5, 5, 5, 5, 5, 5, 2, 13, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 2, 8, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
3, 4, 5, 5, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
3, 5, 5, 5, 6, 5, 5, 5, 5, 5, 5, 5, 5, 5, 12, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4,
5, 5, 5, 5, 3, 2, 5, 5, 5, 5, 10, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 3, 5, 5, 5, 5, 5, 5, 13, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5)), .Names = c("cond",
"desired", "inc"), row.names = c(NA, -300L), class = c("tbl_df",
"tbl", "data.frame"))

Here's an example using the ave() function and the df structure from above. I'm showing all the steps for clarity but these could be reduced if needed.
library(dplyr)
df %>%
mutate(prevcond = lag(cond)) %>%
mutate(flag = ifelse(is.na(prevcond) | prevcond !='X', 0, 1)) %>%
mutate(counter = cumsum(flag)) %>%
mutate(desired2 = ave(inc, counter, FUN = cumsum))

To arrive at your desired output, we must first create a grouping column that resets every time the previous row is equal to X. For this we use row_number() in combination with zoo::na.locf(). Then we can simply use cumsum():
library(dplyr)
library(zoo)
df %>% group_by(grp = na.locf(row_number(cond),
fromLast = TRUE,
na.rm = FALSE)) %>%
mutate(test = cumsum(inc))
# cond desired inc grp test
# <chr> <dbl> <dbl> <int> <dbl>
# 1 <NA> 0 0 1 0
# 2 <NA> 5 5 1 5
# 3 X 10 5 1 10
# 4 X 7 7 2 7
# 5 <NA> 16 16 3 16
# 6 <NA> 21 5 3 21
# 7 <NA> 26 5 3 26
# 8 <NA> 31 5 3 31
# 9 X 37 6 3 37
#10 <NA> 5 5 4 5

Related

R Pivot Longer With Multiple Columns

HAVE = data.frame( COURSE =c( 1, 1, 1, 2, 2, 2, 3, 3, 3 ),
STUDENT =c( 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C' ),
FISH =c( 4, 8, 9, 1, 7, 1, 10, 10, 10 ),
CAT =c( 9, 8, 10, 7, 1, 2, 8, 0, 2 ),
FOX =c( 7, NA, 9, 0, NA, 10, 5, NA, 10 ),
BUNNIE =c( 6, NA, 0, 5, NA, 6, 4, NA, 1 ),
RABBIT =c( 2, NA, 0, 6, NA, 8, 3, NA, 0 ))
WANT = data.frame( COURSE =c( 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3 ),
TEST =c( 'FISH', 'CAT', 'FOX', 'BUNNIE', 'RABBIT', 'FISH', 'CAT', 'FOX', 'BUNNIE', 'RABBIT', 'FISH', 'CAT', 'FOX', 'BUNNIE', 'RABBIT' ),
A =c( 4, 9, 7, 6, 2, 1, 7, 0, 5, 6, 10, 8, 5, 4, 3 ),
B =c( 8, 8, NA, NA, NA, 7, 1, NA, NA, NA, 10, 0, NA, NA, NA ),
C =c( 9, 10, 9, 0, 0, 1, 2, 10, 6, 8, 10, 2, 10, 1, 0 ))
I try:
WANT = HAVE %>% pivot_longer(FISH:RABBIT, names_to = "TEST", values_to = A:C) with no success
Basically you want to gather the animals names into a single column named "TEST", and then expand the student names in several columns. So you need two steps:
pivot_longer() where you gather the animals names
pivot_wider() where you expand the student names
library(tidyr)
HAVE = data.frame( COURSE =c( 1, 1, 1, 2, 2, 2, 3, 3, 3 ),
STUDENT =c( 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C' ),
FISH =c( 4, 8, 9, 1, 7, 1, 10, 10, 10 ),
CAT =c( 9, 8, 10, 7, 1, 2, 8, 0, 2 ),
FOX =c( 7, NA, 9, 0, NA, 10, 5, NA, 10 ),
BUNNIE =c( 6, NA, 0, 5, NA, 6, 4, NA, 1 ),
RABBIT =c( 2, NA, 0, 6, NA, 8, 3, NA, 0 ))
out <- HAVE |>
pivot_longer(
cols = c("FISH", "CAT", "FOX", "BUNNIE", "RABBIT"),
names_to = "TEST"
) |>
pivot_wider(
names_from = "STUDENT",
values_from = "value"
)
out
#> # A tibble: 15 × 5
#> COURSE TEST A B C
#> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 1 FISH 4 8 9
#> 2 1 CAT 9 8 10
#> 3 1 FOX 7 NA 9
#> 4 1 BUNNIE 6 NA 0
#> 5 1 RABBIT 2 NA 0
#> 6 2 FISH 1 7 1
#> 7 2 CAT 7 1 2
#> 8 2 FOX 0 NA 10
#> 9 2 BUNNIE 5 NA 6
#> 10 2 RABBIT 6 NA 8
#> 11 3 FISH 10 10 10
#> 12 3 CAT 8 0 2
#> 13 3 FOX 5 NA 10
#> 14 3 BUNNIE 4 NA 1
#> 15 3 RABBIT 3 NA 0
Check that the result is what is expected:
WANT = data.frame( COURSE =c( 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3 ),
TEST =c( 'FISH', 'CAT', 'FOX', 'BUNNIE', 'RABBIT', 'FISH', 'CAT', 'FOX', 'BUNNIE', 'RABBIT', 'FISH', 'CAT', 'FOX', 'BUNNIE', 'RABBIT' ),
A =c( 4, 9, 7, 6, 2, 1, 7, 0, 5, 6, 10, 8, 5, 4, 3 ),
B =c( 8, 8, NA, NA, NA, 7, 1, NA, NA, NA, 10, 0, NA, NA, NA ),
C =c( 9, 10, 9, 0, 0, 1, 2, 10, 6, 8, 10, 2, 10, 1, 0 ))
identical(out, as_tibble(WANT))
#> [1] TRUE
Created on 2022-10-05 with reprex v2.0.2

Wide to long dataframe with many within and between subject variables

I'm trying to build a data set with a long type of structure, with 2 between-subject variables and 2 within-subject variables from an excel table.
The current dataset structure is the following:
> str(Subset_0)
'data.frame': 54 obs. of 11 variables:
$ Subject : num 1 2 3 4 5 6 7 8 9 10 ...
$ BETWEEN1: num 1 1 1 2 2 2 2 1 1 2 ...
$ BETWEEN2: num 1 1 2 2 2 2 1 1 1 1 ...
$ A_x1 : num 5 1 3 1 0 6 1 2 7 1 ...
$ B_x2 : num 5 1 3 0 3 0 0 2 6 1 ...
$ C_y1 : num 6 9 9 2 2 4 2 2 6 0 ...
$ D_y2 : num 6 15 4 1 2 4 3 1 3 0 ...
$ K_x1 : num 5 1 3 1 0 6 1 2 7 1 ...
$ L_x2 : num 5 1 3 0 3 0 0 2 6 1 ...
$ M_y1 : num 6 9 9 2 2 4 2 2 6 14 ...
$ N_y2 : num 3 1 0 4 0 5 6 5 17 21 ...
data file from dput:
structure(list(Subject = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27,
28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,
44, 45, 46, 47, 48, 49, 50, 51, 52, 54, 55), BETWEEN1 = c(1,
1, 1, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 2, 1, 1, 2, 2, 1,
2, 1, 2, 1, 1, 2, 1, 1, 2, 1, 1, 2, 1, 2, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), BETWEEN2 = c(1,
1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), A_x1 = c(5,
1, 3, 1, 0, 6, 1, 2, 7, 1, 1, 0, 0, 2, 0, 8, NA, NA, NA, NA,
14, 23, 19, 10, 9, 10, 11, 14, 16, 8, 24, 17, 8, 22, 14, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA), B_x2 = c(5, 1, 3, 0, 3, 0, 0, 2, 6, 1, 0, 0, 0, 0, 1,
7, 14, 23, 19, 10, 14, 29, 15, 7, 13, 16, 7, 9, 17, 6, 7, 16,
6, 11, 13, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA), C_y1 = c(6, 9, 9, 2, 2, 4, 2, 2, 6,
0, 6, 0, 1, 10, 3, 8, 14, 29, 15, 7, 17, 21, 24, 7, 32, 31, 31,
21, 27, 29, 18, 27, 33, 23, 28, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), D_y2 = c(6, 15,
4, 1, 2, 4, 3, 1, 3, 0, 0, 0, 2, 2, 2, 5, 17, 21, 24, 7, 24,
16, 28, 7, 28, 23, 25, 25, 24, 28, 33, 27, 31, 33, 21, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA), K_x1 = c(5, 1, 3, 1, 0, 6, 1, 2, 7, 1, 1, 0, 0, 2, 0, 8,
24, 16, 28, 7, 24, 31, 31, 13, 32, 35, 32, 22, 29, 32, 32, 29,
34, 32, 34, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA), L_x2 = c(5, 1, 3, 0, 3, 0, 0, 2, 6,
1, 0, 0, 0, 0, 1, 7, 24, 31, 31, 13, 30, 30, 34, 12, 31, 27,
23, 25, 33, 28, 31, 29, 30, 36, 24, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), M_y1 = c(6,
9, 9, 2, 2, 4, 2, 2, 6, 14, 23, 19, 10, 9, 10, 11, 14, 16, 8,
24, 17, 8, 22, 14, 33, 28, 31, 14, 23, 19, 10, 9, 10, 11, 14,
16, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA), N_y2 = c(3, 1, 0, 4, 0, 5, 6, 5, 17, 21, 24, 7,
32, 31, 31, 21, NA, NA, NA, NA, 27, 29, 18, 27, NA, NA, 17, 21,
24, 7, 32, 31, 31, 21, 27, 17, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-54L))
I need to separate it by subject and per conditions: one per column with the values of A, B, C, and D in one column and call it 'First' ; and K, L, M, N in another and call it 'Second'. More than this, the x, y, _1 and _2 present in these variables represent within-subject factors, that I also need to take into account into another two columns - 'Within1' for x and y; and 'Within2' for 1 and 2. And finally, with two columns 'Between1' and 'Between2' which are the between-subject factors.
I need it to look like this:
Subject First SecondI Within2 Within2 Between1 Between2
1 Ai Ki 1 x 1 1
1 Bi Li 2 x 1 1
1 Ci Mi 1 y 1 1
1 Di Ni 2 y 1 1
2 Ai Ki 1 x 1 1
2 Bi Li 2 x 1 1
2 Ci Mi 1 y 1 1
2 Di Ni 2 y 1 1
...
I have used the reshape function twice, once for grouping into one column, the A,B,C,D and separating the within-subject variables from it and I succeeded:
Subset_1 <-reshape(Subset_0,
varying = c("A_x1", " B_x2", "C_y1", "D_y2"),
v.names = "First",
timevar = "Within1",
times = c("A_x1", " B_x2", "C_y1", "D_y2"),
direction = "long")
# Next_Trial_Choice column
Subset_1$Within1[Subset_1$Within1== "A_x1"] <- "x"
Subset_1$Within1[Subset_1$Within1== "B_x2"] <- "x"
Subset_1$Within1[Subset_1$Within1== "C_y1"] <- "y"
Subset_1$Within1[Subset_1$Within1== "D_y2"] <- "y"
#cleaning the names - opponent column
Subset_1$Within2[Subset_1$Within2== "A_x1"] <- "1"
Subset_1$Within2[Subset_1$Within2== "B_x2"] <- "2"
Subset_1$Within2[Subset_1$Within2== "C_y1"] <- "1"
Subset_1$Within2[Subset_1$Within2== "D_y2"] <- "2"
The problem is that I need to do the same for another column ('Second') and I tried to use reshape again, as I did before, applied to the Subset1 this time. But it doesn't do what I need.
Is there a way to do this?
This looks like it gets your given example result:
# pipe
library(magrittr)
# input data
dxyz <- structure(list(Subject = c(
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27,
28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,
44, 45, 46, 47, 48, 49, 50, 51, 52, 54, 55
), BETWEEN1 = c(
1,
1, 1, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 2, 1, 1, 2, 2, 1,
2, 1, 2, 1, 1, 2, 1, 1, 2, 1, 1, 2, 1, 2, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), BETWEEN2 = c(
1,
1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), A_x1 = c(
5,
1, 3, 1, 0, 6, 1, 2, 7, 1, 1, 0, 0, 2, 0, 8, NA, NA, NA, NA,
14, 23, 19, 10, 9, 10, 11, 14, 16, 8, 24, 17, 8, 22, 14, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA
), B_x2 = c(
5, 1, 3, 0, 3, 0, 0, 2, 6, 1, 0, 0, 0, 0, 1,
7, 14, 23, 19, 10, 14, 29, 15, 7, 13, 16, 7, 9, 17, 6, 7, 16,
6, 11, 13, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA
), C_y1 = c(
6, 9, 9, 2, 2, 4, 2, 2, 6,
0, 6, 0, 1, 10, 3, 8, 14, 29, 15, 7, 17, 21, 24, 7, 32, 31, 31,
21, 27, 29, 18, 27, 33, 23, 28, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), D_y2 = c(
6, 15,
4, 1, 2, 4, 3, 1, 3, 0, 0, 0, 2, 2, 2, 5, 17, 21, 24, 7, 24,
16, 28, 7, 28, 23, 25, 25, 24, 28, 33, 27, 31, 33, 21, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA
), K_x1 = c(
5, 1, 3, 1, 0, 6, 1, 2, 7, 1, 1, 0, 0, 2, 0, 8,
24, 16, 28, 7, 24, 31, 31, 13, 32, 35, 32, 22, 29, 32, 32, 29,
34, 32, 34, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA
), L_x2 = c(
5, 1, 3, 0, 3, 0, 0, 2, 6,
1, 0, 0, 0, 0, 1, 7, 24, 31, 31, 13, 30, 30, 34, 12, 31, 27,
23, 25, 33, 28, 31, 29, 30, 36, 24, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), M_y1 = c(
6,
9, 9, 2, 2, 4, 2, 2, 6, 14, 23, 19, 10, 9, 10, 11, 14, 16, 8,
24, 17, 8, 22, 14, 33, 28, 31, 14, 23, 19, 10, 9, 10, 11, 14,
16, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA
), N_y2 = c(
3, 1, 0, 4, 0, 5, 6, 5, 17, 21, 24, 7,
32, 31, 31, 21, NA, NA, NA, NA, 27, 29, 18, 27, NA, NA, 17, 21,
24, 7, 32, 31, 31, 21, 27, 17, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
)), class = "data.frame", row.names = c(
NA,
-54L
))
# extract all abcd in long format with Within seperated
abcd <- dxyz %>%
tidyr::pivot_longer(-c(Subject, BETWEEN1,BETWEEN2)) %>%
tidyr::separate(col = name, sep = "_", into = c("First", "Within")) %>%
dplyr::filter(First %in% c("A", "B", "C", "D")) %>%
dplyr::mutate(
Within21 = stringr::str_extract_all(Within, "[:digit:]") %>% unlist(),
Within22 = stringr::str_extract_all(Within, "[:alpha:]") %>% unlist()
) %>%
dplyr::select(-Within)
# extract all klmn in long format with Within seperated
klmn <- dxyz %>%
tidyr::pivot_longer(-c(Subject, BETWEEN1,BETWEEN2)) %>%
tidyr::separate(col = name, sep = "_", into = c("Second", "Within")) %>%
dplyr::filter(Second %in% c("K", "L", "M", "N"))%>%
dplyr::mutate(
Within21 = stringr::str_extract_all(Within, "[:digit:]") %>% unlist(),
Within22 = stringr::str_extract_all(Within, "[:alpha:]") %>% unlist()
) %>%
dplyr::select(-Within)
# join both data sets together
abcd %>%
dplyr::left_join(
klmn,
by = c("Subject", "BETWEEN1", "BETWEEN2", "Within21", "Within22")
) %>%
dplyr::select(
Subject, First, Second, Within21, Within22, BETWEEN1, BETWEEN2, value.x, value.y
)
I seperated the reshaping into two pieces for for A, B, C, D and K, L, M, N and then joined the data together.
here is one option with pivot_longer. I know separated a bit too much, but it is just to remove confusion with names. You can adjust them according to your output.
library(tidyr)
df %>% pivot_longer(cols=c("A_x1", "B_x2", "C_y1", "D_y2"), names_to="first") %>%
pivot_longer(cols=c("K_x1", "L_x2", "M_y1", "N_y2"), names_to="second",values_to = "value2") %>%
separate(first, into = c("first", "Within1"), sep = "_") %>%
separate(Within1,into = c("Within1", "Within1_2"), sep = "(?<=[A-Za-z])(?=[0-9])") %>%
separate(second, into = c("second", "Within2"), sep = "_") %>%
separate(Within2,into = c("Within2", "Within2_2"), sep = "(?<=[A-Za-z])(?=[0-9])") %>%
select(-c(value, value2)) %>% distinct()

How do I filter a first event per day per group, select a variable in another column based on a condition, and run calculations on values inbetween?

In R, I have a data frame with the columns id (representing study participants), phase, time, glucose, steps, and kiloCalories. id and phase are factors, time is POSIXcT and includes date + time, glucose (sampled every ~15 minutes) steps (sampled every minute), and kiloCalories (sampled irregularly, represents an eaten meal) are numeric.
Glucose and kiloCalories data is much less frequently sampled than steps, so it contains lots of NAs.
I would like to filter this data frame in the following ways:
Retrieve the rows with the first meal of the day of each participant (id), and their glucose reading 2 hours (+-15 minutes) before that meal.
Retrieve the rows with each meal (i.e. each kiloCalories entry) of each participant (id), along with the glucose reading 2 hours (+-15 minutes) after the meal.
From task 2, take the subset of data in between meal and glucose reading, and calculate the sum of steps within that time.
The reason I specify 2 hours (+-15 minutes) is because there is a very low probability that the data frame has a glucose reading exactly 2 hours after a meal is eaten, thus I want to extend the timeframe
I've tried this StackOverflow thread on how to subset based on time and condition, but to no avail, leaving me stuck at my first task. And that thread does not talk about the complex subsetting I'd like to perform.
Edit - Here is some sample data which meets the criteria of the tasks:
sampleData <- structure(list(id = c(13, 13, 13, 13, 13, 13, 13, 13, 13, 13,
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,
13, 13, 13, 13), phase = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), time = structure(c(1450881900,
1450881960, 1450882020, 1450882080, 1450882140, 1450882200, 1450882260,
1450882320, 1450882380, 1450882440, 1450882500, 1450882560, 1450882620,
1450882680, 1450882740, 1450882800, 1450882860, 1450882920, 1450882980,
1450883040, 1450883100, 1450883160, 1450883220, 1450883280, 1450883340,
1450883400, 1450883460, 1450883520, 1450883580, 1450883640, 1450883700,
1450883760, 1450883820, 1450883880, 1450883940, 1450884000, 1450884060,
1450884120, 1450884180, 1450884240, 1450884300, 1450884360, 1450884420,
1450884480, 1450884540, 1450884600, 1450884660, 1450884720, 1450884780,
1450884840, 1450884900, 1450884960, 1450885020, 1450885080, 1450885140,
1450885200, 1450885260, 1450885320, 1450885380, 1450885440, 1450885500,
1450885560, 1450885620, 1450885680, 1450885740, 1450885800, 1450885860,
1450885920, 1450885980, 1450886040, 1450886100, 1450886160, 1450886220,
1450886280, 1450886340, 1450886400, 1450886460, 1450886520, 1450886580,
1450886640, 1450886700, 1450886760, 1450886820, 1450886880, 1450886940,
1450887000, 1450887060, 1450887120, 1450887180, 1450887240, 1450887300,
1450887360, 1450887420, 1450887480, 1450887540, 1450887600, 1450887660,
1450887720, 1450887780, 1450887840, 1450887900, 1450887960, 1450888020,
1450888080, 1450888140, 1450888200, 1450888260, 1450888320, 1450888380,
1450888440, 1450888500, 1450888560, 1450888620, 1450888680, 1450888740,
1450888800, 1450888860, 1450888920, 1450888980, 1450889040, 1450889100,
1450889160, 1450889220, 1450889280, 1450889340, 1450889400, 1450889460,
1450889520, 1450889580, 1450889640, 1450889700, 1450889760, 1450889820,
1450889880, 1450889940, 1450890000, 1450890060, 1450890120, 1450890180,
1450890240, 1450890300, 1450890360, 1450890420, 1450890480, 1450890540,
1450890600, 1450890660, 1450890720, 1450890780, 1450890840, 1450890900,
1450890960, 1450891020, 1450891080, 1450891140, 1450891200, 1450891260,
1450891320, 1450891380, 1450891440, 1450891500, 1450891560, 1450891620,
1450891680, 1450891740, 1450891800, 1450891860, 1450891920, 1450891980,
1450892040, 1450892100, 1450892160, 1450892220, 1450892280, 1450892340,
1450892400, 1450892460, 1450892520, 1450892580, 1450892640, 1450892700,
1450892760, 1450892820, 1450892880, 1450892940, 1450893000, 1450893060,
1450893120, 1450893180, 1450893240, 1450893300, 1450893360, 1450893420,
1450893480, 1450893540, 1450893600, 1450893660, 1450893720, 1450893780,
1450893840, 1450893900, 1450893960, 1450894020, 1450894080, 1450894140,
1450894140, 1450894200, 1450894260, 1450894320, 1450894380, 1450894440,
1450894500, 1450894560, 1450894620, 1450894680, 1450894740, 1450894800,
1450894860, 1450894920, 1450894980, 1450895040, 1450895100, 1450895160,
1450895220, 1450895280, 1450895340, 1450895400, 1450895460, 1450895520,
1450895580, 1450895640, 1450895700, 1450895760, 1450895820, 1450895880,
1450895940, 1450896000, 1450896060, 1450896120, 1450896180, 1450896240,
1450896300, 1450896360, 1450896420, 1450896480, 1450896540, 1450896600,
1450896660, 1450896720, 1450896780, 1450896840, 1450896900, 1450896960,
1450897020, 1450897080, 1450897140, 1450897200, 1450897260, 1450897320,
1450897380, 1450897440, 1450897500, 1450897560, 1450897620, 1450897680,
1450897740, 1450897800, 1450897860, 1450897920, 1450897980, 1450898040,
1450898100, 1450898160, 1450898220, 1450898280, 1450898340, 1450898400,
1450898460, 1450898520, 1450898580, 1450898640, 1450898700, 1450898760,
1450898820, 1450898880, 1450898940, 1450899000, 1450899060, 1450899120,
1450899180, 1450899240, 1450899300, 1450899360, 1450899420, 1450899480,
1450899540, 1450899600, 1450899660, 1450899720, 1450899780, 1450899840,
1450899900), class = c("POSIXct", "POSIXt")), gl = c(NA, NA,
NA, NA, NA, NA, NA, NA, 84, NA, NA, NA, NA, 83, NA, NA, NA, NA,
NA, NA, NA, NA, NA, 81, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, 82, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, 84, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, 83, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, 79, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
76, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 78,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 93, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 116, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 128, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 141, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 142, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 146,
143, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
136, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
129, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
134, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
139, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
134, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
128, NA, NA, NA, NA, NA, NA), steps = c(24, 39, 28, 19, 29, 6,
12, 3, 13, 1, 6, 2, 1, 13, 10, 1, 1, 1, 1, 0, 0, 1, 1, 3, 1,
0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 2, 1, 0, 3, 33, 27, 17, 27,
30, 19, 23, 34, 38, 25, 30, 42, 31, 31, 16, 52, 91, 39, 23, 7,
6, 27, 64, 20, 53, 22, 14, 14, 5, 4, 13, 7, 13, 7, 8, 10, 14,
26, 25, 19, 23, 35, 23, 15, 13, 12, 11, 27, 21, 25, 27, 4, 8,
18, 15, 22, 30, 16, 15, 15, 5, 3, 4, 6, 0, 12, 10, 4, 3, 5, 2,
5, 10, 13, 7, 2, 6, 2, 1, 15, 23, 25, 18, 27, 5, 11, 22, 31,
17, 27, 19, 2, 0, 12, 3, 0, 5, 5, 0, 0, 1, 0, 2, 2, 2, 5, 4,
4, 1, 7, 2, 5, 4, 8, 2, 4, 0, 4, 6, 8, 11, 10, 22, 2, 1, 0, 4,
4, 2, 2, 9, 19, 8, 11, 7, 7, 4, 0, 1, 0, 2, 3, 13, 9, 0, 3, 4,
5, 5, 7, 5, 5, 8, 8, 26, 23, 26, 27, 24, 24, 13, 25, 17, 24,
24, 11, 16, 15, 25, 21, 18, 11, 16, 19, 2, 0, 7, 6, 6, 3, 1,
13, 13, 0, 1, 10, 12, 10, 9, 7, 1, 1, 12, 4, 0, 0, 0, 5, 2, 5,
2, 1, 2, 0, 1, 2, 5, 11, 0, 0, 2, 1, 0, 2, 0, 7, 1, 0, 0, 0,
0, 1, 0, 3, 1, 0, 1, 0, 0, 3, 10, 13, 1, 8, 4, 1, 0, 0, 1, 0,
23, 22, 11, 16, 16, 5, 5, 5, 3, 14, 2, 0, 0, 0, 1, 2, 0, 1, 2,
3, 1), kiloCalories = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 603, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, 143, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA)), row.names = c(NA, -302L), class = c("tbl_df",
"tbl", "data.frame"))
I believe there may be a number of considerations on how you want to organize your data, depending on how you intend to analyze further. However, here are some ideas that may be helpful for you.
This solution uses tidyverse and fuzzyjoin as you tagged with dplyr - but you may want to consider a data.table or sqldf solution as alternatives, depending on size of data, speed needed, and other factors.
First, I would create a table that includes the meals based on kiloCalories values that are not missing. We will create a meal column and enumerate meals for each date. In addition, we can calculate your windows for preprandial and postprandial glucose levels.
library(tidyverse)
library(fuzzyjoin)
mealsData <- sampleData %>%
filter(!is.na(kiloCalories)) %>%
group_by(id, date = date(time)) %>%
mutate(meal = 1:n(),
preprandial_1 = time - (60 * 60 * 2) - (15 * 60),
preprandial_2 = time - (60 * 60 * 2) + (15 * 60),
postprandial_1 = time + (60 * 60 * 2) - (15 * 60),
postprandial_2 = time + (60 * 60 * 2) + (15 * 60)) %>%
select(-gl, -steps, -kiloCalories)
The result is this for mealsData:
id phase time date meal preprandial_1 preprandial_2 postprandial_1 postprandial_2
<dbl> <dbl> <dttm> <date> <int> <dttm> <dttm> <dttm> <dttm>
1 13 1 2015-12-23 12:00:00 2015-12-23 1 2015-12-23 09:45:00 2015-12-23 10:15:00 2015-12-23 13:45:00 2015-12-23 14:15:00
2 13 1 2015-12-23 13:30:00 2015-12-23 2 2015-12-23 11:15:00 2015-12-23 11:45:00 2015-12-23 15:15:00 2015-12-23 15:45:00
I have found such tables to be very useful as reference.
Next, you can merge this table with your sampleData. For task 1, you want preprandial first meal glucose levels. So, you can use fuzzy_join and ensure the times are between the calculated preprandial times determined.
fuzzy_inner_join(
mealsData %>% filter(meal == 1),
sampleData %>% filter(!is.na(gl)),
by = c("id", "phase", "preprandial_1" = "time", "preprandial_2" = "time"),
match_fun = c(`==`, `==`, `<=`, `>=`)
)
The result is:
id.x phase.x time.x date meal preprandial_1 preprandial_2 postprandial_1 postprandial_2 id.y phase.y time.y gl steps kiloCalories
<dbl> <dbl> <dttm> <date> <int> <dttm> <dttm> <dttm> <dttm> <dbl> <dbl> <dttm> <dbl> <dbl> <dbl>
1 13 1 2015-12-23 12:00:00 2015-12-23 1 2015-12-23 09:45:00 2015-12-23 10:15:00 2015-12-23 13:45:00 2015-12-23 14:15:00 13 1 2015-12-23 09:53:00 84 13 NA
2 13 1 2015-12-23 12:00:00 2015-12-23 1 2015-12-23 09:45:00 2015-12-23 10:15:00 2015-12-23 13:45:00 2015-12-23 14:15:00 13 1 2015-12-23 09:58:00 83 13 NA
3 13 1 2015-12-23 12:00:00 2015-12-23 1 2015-12-23 09:45:00 2015-12-23 10:15:00 2015-12-23 13:45:00 2015-12-23 14:15:00 13 1 2015-12-23 10:08:00 81 3 NA
It appears there are 3 glucose levels that fall within that window from the sample data.
Next, you can do something similar for postprandial data, for all meals:
fuzzy_inner_join(
mealsData,
sampleData %>% filter(!is.na(gl)),
by = c("id", "phase", "postprandial_1" = "time", "postprandial_2" = "time"),
match_fun = c(`==`, `==`, `<=`, `>=`)
)
The result is:
id.x phase.x time.x date meal preprandial_1 preprandial_2 postprandial_1 postprandial_2 id.y phase.y time.y gl steps kiloCalories
<dbl> <dbl> <dttm> <date> <int> <dttm> <dttm> <dttm> <dttm> <dbl> <dbl> <dttm> <dbl> <dbl> <dbl>
1 13 1 2015-12-23 12:00:00 2015-12-23 1 2015-12-23 09:45:00 2015-12-23 10:15:00 2015-12-23 13:45:00 2015-12-23 14:15:00 13 1 2015-12-23 13:54:00 134 0 NA
2 13 1 2015-12-23 12:00:00 2015-12-23 1 2015-12-23 09:45:00 2015-12-23 10:15:00 2015-12-23 13:45:00 2015-12-23 14:15:00 13 1 2015-12-23 14:09:00 139 1 NA
Here there are two glucose levels postprandial found.
Finally, you can merge the data.frames and then group by the id (id.x used since the join created a duplicate), the meal and the date. Then you can sum up the steps:
fuzzy_inner_join(
mealsData,
sampleData,
by = c("id", "phase", "time" = "time", "postprandial_2" = "time"),
match_fun = c(`==`, `==`, `<=`, `>=`)
) %>%
group_by(id.x, meal, date) %>%
summarise(step_sum = sum(steps))
The result is:
id.x meal date step_sum
<dbl> <int> <date> <dbl>
1 13 1 2015-12-23 876
2 13 2 2015-12-23 294
Edit 1: You might also try using data.table for a faster solution. Using setDT will make the data.frame a data.table:
library(data.table)
setDT(mealsData)
setDT(sampleData)
Then, you can do a nonequi join between your sampleData and mealsData. This statement includes which columns you want to include in the result, and merging based on times. The nomatch will leave out results where there is no match (for example, no post-prandial glucose levels for second meal).
sampleData[!is.na(gl)][
mealsData,
.(id, phase, gl, x.time),
on = .(id, phase, time >= postprandial_1, time <= postprandial_2),
nomatch = 0]
To get the sum of steps, you can try:
sampleData[mealsData,
.(id, phase, meal, date, steps),
on = .(id, phase, time >= time, time <= postprandial_2),
nomatch = 0][
,
.(step_sum = sum(steps)),
by = .(id, meal, date)]
The results should be the same as above.
Edit 2: You can merge both the second and third outcomes (average glucose and sum of steps). Make sure both have id, phase, meal and date to merge on. The first dt1 now includes the mean glucose and stores the associated meal. Store both dt1 and dt2 in intermediate data.tables:
dt1 <- sampleData[!is.na(gl)][
mealsData,
.(id, phase, gl, x.time, meal, date),
on = .(id, phase, time >= postprandial_1, time <= postprandial_2),
nomatch = 0][
,
.(gl_ave = mean(gl)),
by = .(id, phase, meal, date)]
dt2 <- sampleData[mealsData,
.(id, phase, meal, date, steps),
on = .(id, phase, time >= time, time <= postprandial_2),
nomatch = 0][
,
.(step_sum = sum(steps)),
by = .(id, phase, meal, date)]
Then merge:
merge(dt1, dt2, by = c("id", "phase", "meal", "date"))
Since your the data frame sampleData is sorted and includes one observation per minute this can be used in your advantage:
library(dplyr)
library(zoo)
1) Retrieve the rows with the first meal of the day of each participant (id), and their glucose reading 2 hours (+-15 minutes) before that meal:
sampleData$gl <- na.locf(sampleData$gl, na.rm=FALSE)
df1 <- sampleData %>%
mutate(previousGl = lag(gl,120), glTime = lag(time, 120)) %>%
filter(!is.na(kiloCalories))
2) Retrieve the rows with each meal (i.e. each kiloCalories entry) of each participant (id), along with the glucose reading 2 hours (+-15 minutes) after the meal.
sampleData$gl <- na.locf(sampleData$gl, fromLast = TRUE,na.rm=FALSE)
df2 <- sampleData %>%
mutate(previousGl = lag(gl,120), glTime = lead(time, 120)) %>%
filter(!is.na(kiloCalories))
3) From task 2, take the subset of data in between meal and glucose reading, and calculate the sum of steps within that time.
lapply(1:NROW(df2), function(i) {
sampleData %>% filter(time >= df2$time[i],
time <= df2$glTime[i]) %>%
summarize(steps = sum(steps))
})

Tried code in R with mutate_at and max() functions with own data. Warning messages come up: no non-missing arguments to max

I'm curretly learning R with a book and was trying a mutate_at function from dplyr. In this example I want to standardize the survey items on a scale from 0 to 1. To do this, we can divide each value by the (theoretical) maximum value of the scale.
The book example stats_test from the package "pradadata" works perfectly fine:
data(stats_test, package = "pradadata")
stats_test %>%
drop_na() %>%
mutate_at(.vars = vars(study_time, self_eval, interest),
.funs = funs(prop = ./max(.))) %>%
select(contains("_prop"))
Output:
study_time_prop self_eval_prop interest_prop
<dbl> <dbl> <dbl>
1 0.6 0.7 0.667
2 0.8 0.8 0.833
3 0.6 0.4 0.167
4 0.8 0.7 0.833
5 0.4 0.6 0.5
6 0.4 0.6 0.667
7 0.8 0.6 0.5
8 0.2 0.7 0.667
9 0.6 0.8 0.833
10 0.6 0.7 0.833
# ... with 1,617 more rows
Tried the same code with my own data but it doesn't work and I can't figure out why. The variable RG04 from my data has a range from 1-5. I tried to transform the variable from numeric to integer, because the variables from the the data stats_test are integer too:
df_literacy_2 <- transform(df_literacy, RG04 = as.integer(RG04))
df_literacy_2 <- tibble(df_literacy_2)
df_literacy_2 %>%
drop_na() %>%
mutate_at(.vars = vars(RG04),
.funs = funs(prop = ./max(.))) %>%
select(contains("_prop"))
Output:
# A tibble: 0 x 0
Warning messages:
1: Problem with `mutate()` input `prop`.
i no non-missing arguments to max; returning -Inf
i Input `prop` is `RG04/max(RG04)`.
2: In base::max(x, ..., na.rm = na.rm) :
no non-missing arguments to max; returning -Inf
str(df_literacy_2$RG04)
int [1:630] 2 4 2 1 2 2 1 3 1 3 ...
Why doesn't it work on my data?
Thank you for your help.
Edit with sample of df_literacy:
> dput(head(df_literacy,20))
structure(list(CASE = c(40, 41, 44, 45, 48, 49, 54, 55, 56, 57,
58, 61, 62, 63, 64, 65, 66, 67, 68, 69), SERIAL = c(NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA), REF = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), QUESTNNR = c("base", "base",
"base", "base", "base", "base", "base", "base", "base", "base",
"base", "base", "base", "base", "base", "base", "base", "base",
"base", "base"), MODE = c("interview", "interview", "interview",
"interview", "interview", "interview", "interview", "interview",
"interview", "interview", "interview", "interview", "interview",
"interview", "interview", "interview", "interview", "interview",
"interview", "interview"), STARTED = structure(c(1607290462,
1607290608, 1607291086, 1607291118, 1607291265, 1607291793, 1607294071,
1607294336, 1607294337, 1607294419, 1607294814, 1607296474, 1607301809,
1607329348, 1607333933, 1607335996, 1607336207, 1607336378, 1607343194,
1607343414), tzone = "UTC", class = c("POSIXct", "POSIXt")),
EI01 = structure(c(2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L), .Label = c("Ja",
"Nein", "Nicht beantwortet"), class = "factor"), EI02 = c(2,
2, 2, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 1, 1, 1, 1, 2, 3),
RF01 = c(4, 2, 4, 3, 4, 4, 1, 3, 2, 3, 4, 3, 2, 3, 2, 2,
4, 2, 5, 3), RF02 = c(1, 1, 1, 1, 2, 2, 1, 2, 1, 1, 2, 1,
1, 1, 2, 2, 2, 2, 2, 2), RF03 = c(1, 2, 2, 2, 1, 2, 1, 1,
1, 1, 2, 1, 1, 2, 2, 2, 1, 2, 1, 2), RG01 = c(2, 2, 2, 2,
2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2), RG02 = c(3,
3, 3, 3, 4, 3, 4, 2, 4, 2, 3, 4, 4, 2, 4, 3, 4, 3, 4, 4),
RG03 = c(3, 2, 2, 3, 3, 3, 1, 3, 1, 2, 3, 1, 2, 2, 1, 3,
2, 3, 2, 2), RG04 = c(2, 4, 2, 1, 2, 2, 1, 3, 1, 3, 2, 4,
1, 1, 1, 1, 1, 2, 4, 1), RG05 = c(1, 1, 1, 1, 1, 1, 1, 2,
1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1), SD01 = structure(c(2L,
1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 1L, 1L), .Label = c("weiblich", "männlich", "divers",
"nicht beantwortet"), class = "factor"), SD03 = c(4, 3, 2,
2, 1, 2, 4, 4, 1, 4, 3, 1, 2, 3, 2, 4, 2, 3, 1, 3), SD05_01 = c(23,
22, 22, 21, 18, 22, 21, 27, 17, 22, 17, 21, 21, 22, 50, 25,
23, 20, 23, 23), TIME001 = c(2, 3, 23, 73, 29, 2, 3, 3, 29, 7,
50, 55, 3, 2, 10, 2, 1, 5, 7, 35), TIME002 = c(2, 2, 16,
34, 12, 14, 2, 2, 21, 2, 30, 24, 21, 3, 3, 2, 3, 2, 3, 22
), TIME003 = c(34, 8, 12, 15, 13, 12, 12, 7, 13, 11, 16,
10, 11, 16, 8, 8, 7, 8, 11, 14), TIME004 = c(60, 33, 25,
31, 45, 25, 14, 13, 38, 35, 50, 50, 37, 32, 32, 25, 72, 55,
28, 29), TIME005 = c(84, 21, 29, 41, 54, 33, 30, 22, 32,
42, 44, 23, 65, 30, 28, 32, 51, 31, 27, 44), TIME006 = c(14,
9, 27, 11, 24, 8, 8, 9, 18, 12, 35, 33, 27, 46, 11, 15, 8,
14, 12, 14), TIME007 = c(3, 18, 3, 5, 6, 2, 9, 2, 3, 3, 6,
7, 3, 13, 4, 4, 378, 3, 4, 10), TIME_SUM = c(199, 94, 135,
142, 183, 96, 78, 58, 154, 112, 186, 152, 167, 142, 96, 88,
146, 118, 92, 168), MAILSENT = c(NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
LASTDATA = structure(c(1607290661, 1607290702, 1607291221,
1607291328, 1607291448, 1607291889, 1607294149, 1607294394,
1607294491, 1607294531, 1607295045, 1607296676, 1607301976,
1607329490, 1607334030, 1607336084, 1607336727, 1607336496,
1607343286, 1607343582), tzone = "UTC", class = c("POSIXct",
"POSIXt")), FINISHED = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1), Q_VIEWER = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), LASTPAGE = c(7,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7),
MAXPAGE = c(7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 7), MISSING = c(7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 7, 7, 0, 7, 7, 7), MISSREL = c(1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1), TIME_RSI = c("46023",
"14246", "0.75", "0.63", "0.54", "12055", "17533", "30682",
"0.7", "44197", "0.45", "0.58", "0.83", "44378", "44501",
"18629", "46753", "46388", "44197", "0.57"), DEG_TIME = c(27,
27, 3, 1, 0, 23, 30, 42, 2, 17, 0, 2, 7, 18, 10, 27, 43,
18, 8, 0)), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
Edit with TRUE and FALSE NAs:
> sapply(df_literacy, function(a) table(c(T,F,is.na(a)))-1)
CASE SERIAL REF QUESTNNR MODE STARTED EI01 EI02 RF01 RF02 RF03 RG01 RG02 RG03 RG04 RG05 SD01 SD03 SD05_01 TE03_01 TIME001 TIME002 TIME003
FALSE 630 0 0 630 630 630 630 630 630 630 630 630 630 630 630 630 629 629 615 99 630 630 630
TRUE 0 630 630 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 15 531 0 0 0
TIME004 TIME005 TIME006 TIME007 TIME_SUM MAILSENT LASTDATA FINISHED Q_VIEWER LASTPAGE MAXPAGE MISSING MISSREL TIME_RSI DEG_TIME
FALSE 630 630 629 625 630 0 630 630 630 630 630 630 630 630 630
TRUE 0 0 1 5 0 630 0 0 0 0 0 0 0 0 0
There are a few things to correct here.
drop_na() is removing all of your data.
drop_na(df_literacy)
# # A tibble: 0 x 37
# # ... with 37 variables: CASE <dbl>, SERIAL <lgl>, REF <lgl>, QUESTNNR <chr>,
# # MODE <chr>, STARTED <dttm>, EI01 <fct>, EI02 <dbl>, RF01 <dbl>, RF02 <dbl>,
# # RF03 <dbl>, RG01 <dbl>, RG02 <dbl>, RG03 <dbl>, RG04 <dbl>, RG05 <dbl>,
# # SD01 <fct>, SD03 <dbl>, SD05_01 <dbl>, TIME001 <dbl>, TIME002 <dbl>,
# # TIME003 <dbl>, TIME004 <dbl>, TIME005 <dbl>, TIME006 <dbl>, TIME007 <dbl>,
# # TIME_SUM <dbl>, MAILSENT <lgl>, LASTDATA <dttm>, FINISHED <dbl>,
# # Q_VIEWER <dbl>, LASTPAGE <dbl>, MAXPAGE <dbl>, MISSING <dbl>,
# # MISSREL <dbl>, TIME_RSI <chr>, DEG_TIME <dbl>
The problem is that you have several columns that are completely NA, namely SERIAL, REF, and MAILSENT.
sapply(df_literacy, function(a) table(c(T,F,is.na(a)))-1)
# CASE SERIAL REF QUESTNNR MODE STARTED EI01 EI02 RF01 RF02 RF03 RG01 RG02
# FALSE 20 0 0 20 20 20 20 20 20 20 20 20 20
# TRUE 0 20 20 0 0 0 0 0 0 0 0 0 0
# RG03 RG04 RG05 SD01 SD03 SD05_01 TIME001 TIME002 TIME003 TIME004 TIME005
# FALSE 20 20 20 20 20 20 20 20 20 20 20
# TRUE 0 0 0 0 0 0 0 0 0 0 0
# TIME006 TIME007 TIME_SUM MAILSENT LASTDATA FINISHED Q_VIEWER LASTPAGE
# FALSE 20 20 20 0 20 20 20 20
# TRUE 0 0 0 20 0 0 0 0
# MAXPAGE MISSING MISSREL TIME_RSI DEG_TIME
# FALSE 20 20 20 20 20
# TRUE 0 0 0 0 0
Drop the drop_na(), or at least drop_na(-SERIAL, -REF, -MAILSENT).
Your code is using funs, which has been deprecated since dplyr-0.8.0.
# Warning: `funs()` is deprecated as of dplyr 0.8.0.
# Please use a list of either functions or lambdas:
# # Simple named list:
# list(mean = mean, median = median)
# # Auto named with `tibble::lst()`:
# tibble::lst(mean, median)
# # Using lambdas
# list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
While this isn't causing an error, it is causing a warning (and will likely stop working at some point. Change your mutate_at to be:
mutate_at(.vars = vars(RG04, RF02),
.funs = list(prop = ~ . / max(.)))
You are using a single variable within .vars and a single function within .funs, so the column names are preserved as-is (and you will not see a _prop column). From ?mutate_at:
The names of the new columns are derived from the names of the
input variables and the names of the functions.
• if there is only one unnamed function (i.e. if '.funs' is an
unnamed list of length one), the names of the input variables
are used to name the new columns;
• for _at functions, if there is only one unnamed variable
(i.e., if '.vars' is of the form 'vars(a_single_column)') and
'.funs' has length greater than one, the names of the
functions are used to name the new columns;
• otherwise, the new names are created by concatenating the
names of the input variables and the names of the functions,
separated with an underscore '"_"'.
If you aren't going to add more variables and functions, then you need to self-name it in the call, as in mutate_at(.vars = vars(RG04 = RG04), ...). Oddly enough, this causes it to produce RG04_prop.
If we fix all of those, then it works.
df_literacy %>%
drop_na(-SERIAL, -REF, -MAILSENT) %>%
mutate_at(.vars = vars(RG04 = RG04),
.funs = list(prop = ~ ./max(.))) %>%
select(contains("_prop")) %>%
head(3)
# A tibble: 3 x 1
# RG04_prop
# <dbl>
# 1 0.5
# 2 1
# 3 0.5

dplyr::starts_with and ends_with not subsetting based on arguments

I want to select a number of variables based on thier names to transform them. The variable names all start with inq and end with 7, 8, 10, 13:15. This is not working for me... Apologies if this is obvious, but I cannot get it to work. Am I using the wrong functions, putting my functions and arguments together wrong, or something else?
A reproducible example:
structure(list(inq1_1 = c(NA, 7, 5, 1, 1, 6, 5, 2, NA, NA), inq1_2 = c(NA,
7, 5, 1, 1, 6, 5, 5, NA, NA), inq1_3 = c(NA, 6, 4, 2, 1, 5, 2,
1, NA, NA), inq1_4 = c(NA, 6, 6, 1, 1, 6, 5, 1, NA, NA), inq1_5 = c(NA,
7, 3, 1, 1, 6, 2, 1, NA, NA), inq1_6 = c(NA, 7, 4, 4, 2, 7, 2,
4, NA, NA), inq1_7 = c(NA, 2, 4, 6, 7, 3, 1, 7, NA, NA), inq1_8 = c(NA,
1, NA, 2, 7, 2, 1, 4, NA, NA), inq1_9 = c(NA, 4, 6, 3, 1, 3,
7, 1, NA, NA), inq1_10 = c(NA, 3, 5, 7, 4, 4, 2, 7, NA, NA),
inq1_11 = c(NA, 5, 4, 7, 1, 6, 7, 6, NA, NA), inq1_12 = c(NA,
7, 5, 7, 4, 6, 7, 2, NA, NA), inq1_13 = c(NA, 3, 4, 6, 4,
3, 4, 4, NA, NA), inq1_14 = c(NA, 3, 2, 4, 4, 2, 1, 4, NA,
NA), inq1_15 = c(NA, 2, 2, 3, 5, 2, 4, 4, NA, NA), inqfinal_1 = c(5,
NA, 3, NA, NA, NA, NA, NA, NA, NA), inqfinal_2 = c(5, NA,
3, NA, NA, NA, NA, NA, NA, NA), inqfinal_3 = c(6, NA, 3,
NA, NA, NA, NA, NA, NA, NA), inqfinal_4 = c(5, NA, 3, NA,
NA, NA, NA, NA, NA, NA), inqfinal_5 = c(5, NA, 3, NA, NA,
NA, NA, NA, NA, NA), inqfinal_6 = c(6, NA, 3, NA, NA, NA,
NA, NA, NA, NA), inqfinal_7 = c(4, NA, 3, NA, NA, NA, NA,
NA, NA, NA), inqfinal_8 = c(2, NA, 3, NA, NA, NA, NA, NA,
NA, NA), inqfinal_9 = c(5, NA, 3, NA, NA, NA, NA, NA, NA,
NA), inqfinal_10 = c(4, NA, 3, NA, NA, NA, NA, NA, NA, NA
), inqfinal_11 = c(6, NA, 4, NA, NA, NA, NA, NA, NA, NA),
inqfinal_12 = c(6, NA, 4, NA, NA, NA, NA, NA, NA, NA), inqfinal_13 = c(4,
NA, 3, NA, NA, NA, NA, NA, NA, NA), inqfinal_14 = c(2, NA,
2, NA, NA, NA, NA, NA, NA, NA), inqfinal_15 = c(2, NA, 2,
NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
I am trying to become tidy and utilising dplyr as per the code below:
# select specific columns
sf_df %>% select(starts_with("inq"),
ends_with(7, 8, 10, 13:15)) %>% view(title = "test")
Alas, I get the following error:
Error in ends_with(7, 8, 10, 13:15) : unused argument (13:15)
14. .f(.x[[i]], ...)
13. map(.x[sel], .f, ...)
12. map_if(ind_list, is_helper, eval_tidy)
11. vars_select_eval(.vars, quos)
10. tidyselect::vars_select(names(.data), !!!quos(...))
9. select.data.frame(., starts_with("inq"), ends_with(7, 8, 10, 13:15))
8. select(., starts_with("inq"), ends_with(7, 8, 10, 13:15))
7. function_list[[i]](value)
6. freduce(value, `_function_list`)
5. `_fseq`(`_lhs`)
4. eval(quote(`_fseq`(`_lhs`)), env, env)
3. eval(quote(`_fseq`(`_lhs`)), env, env)
2. withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
1. sf_df %>% select(starts_with("inq"), ends_with(7, 8, 10, 13:15)) %>% view(title = "test")
Any help would be greatly appreciated! Thank you in advance.
Cheers,
Atanas.
A better option would be matches to match a regex pattern in the column name. Here, it matches the pattern 'ing' at the beginning (^) of the column name and numbers at the end ($) of the column name
sf_df %>%
select(matches('^inq.*(7|8|10|13|14|15)$'))
# A tibble: 10 x 12
# inq1_7 inq1_8 inq1_10 inq1_13 inq1_14 inq1_15 inqfinal_7 inqfinal_8 inqfinal_10 inqfinal_13 inqfinal_14 inqfinal_15
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 NA NA NA NA NA NA 4 2 4 4 2 2
# 2 2 1 3 3 3 2 NA NA NA NA NA NA
# 3 4 NA 5 4 2 2 3 3 3 3 2 2
# 4 6 2 7 6 4 3 NA NA NA NA NA NA
# 5 7 7 4 4 4 5 NA NA NA NA NA NA
# 6 3 2 4 3 2 2 NA NA NA NA NA NA
# 7 1 1 2 4 1 4 NA NA NA NA NA NA
# 8 7 4 7 4 4 4 NA NA NA NA NA NA
# 9 NA NA NA NA NA NA NA NA NA NA NA NA
#10 NA NA NA NA NA NA NA NA NA NA NA NA
Note that by using both starts_with and ends_with, the desired result may not be the expected one. The OP's dataset has 30 columns where all the column names start with 'inq'. So, with starts_with, it returns all columns, and adding ends_with, it is checking an OR match, e.g.
sf_df %>%
select(starts_with("inq"), ends_with("5")) %>%
ncol
#[1] 30 # returns 30 columns
It is not removing the columns that have no match for 5 at the string
It is not a behavior of the order of arguments as
sf_df %>%
select(ends_with("5"), starts_with("inq")) %>%
ncol
#[1] 30
Now, if we use only ends_with
sf_df %>%
select(ends_with("5")) %>%
ncol
#[1] 4
Based on the example, all columns starts with 'inq', so, ends_with alone would be sufficient for a single string match as the documentation for ?ends_with specifies
match - A string.
and not multiple strings
where the Usage is
starts_with(match, ignore.case = TRUE, vars = peek_vars())

Resources