How to find average time points difference in longitudinal data - r

0
I have longitudinal data of body weights of over 100K participants. Time points of weight measurements between participants are not the same. What I want to know is the average time difference between 1st and 2nd measurement as well as 2nd and 3rd measurement etc. Another one is how many people or % of people who have 3 body weight measurements, as well as for 4,5, 6, 7, and 8 etc. How can I do to find these answers on R.

Perhaps something like this:
library(dplyr, warn.conflicts = F)
set.seed(1)
# generate some sample data
dates <- seq(as.Date("2000-01-01"), by = "day", length.out = 500)
sample_data <- tibble(
participant_id = sample(1:1000, size = 5000, replace = T),
meas_date = sample(dates, size = 5000, replace = T)) %>%
arrange(participant_id, meas_date)
sample_data
#> # A tibble: 5,000 × 2
#> participant_id meas_date
#> <int> <date>
#> 1 1 2000-01-18
#> 2 1 2000-02-28
#> 3 1 2000-05-15
#> 4 1 2001-02-01
#> 5 2 2000-05-11
#> 6 3 2000-01-22
#> 7 3 2000-03-27
#> 8 3 2000-04-17
#> 9 3 2000-09-23
#> 10 3 2000-12-13
#> # … with 4,990 more rows
# periods between each measurement for each participant
meas_periods <- sample_data %>%
group_by(participant_id) %>%
mutate(meas_n = row_number(),
date_diff = meas_date - lag(meas_date)) %>%
ungroup()
meas_periods
#> # A tibble: 5,000 × 4
#> participant_id meas_date meas_n date_diff
#> <int> <date> <int> <drtn>
#> 1 1 2000-01-18 1 NA days
#> 2 1 2000-02-28 2 41 days
#> 3 1 2000-05-15 3 77 days
#> 4 1 2001-02-01 4 262 days
#> 5 2 2000-05-11 1 NA days
#> 6 3 2000-01-22 1 NA days
#> 7 3 2000-03-27 2 65 days
#> 8 3 2000-04-17 3 21 days
#> 9 3 2000-09-23 4 159 days
#> 10 3 2000-12-13 5 81 days
#> # … with 4,990 more rows
# average period between meas_n-1 and meas_n
meas_periods %>%
group_by(meas_n) %>%
summarise(mean_duration = mean(date_diff))
#> # A tibble: 13 × 2
#> meas_n mean_duration
#> <int> <drtn>
#> 1 1 NA days
#> 2 2 88.54102 days
#> 3 3 86.16762 days
#> 4 4 76.21154 days
#> 5 5 69.11392 days
#> 6 6 67.16798 days
#> 7 7 50.67089 days
#> 8 8 50.91111 days
#> 9 9 49.89873 days
#> 10 10 48.70588 days
#> 11 11 51.00000 days
#> 12 12 26.25000 days
#> 13 13 66.00000 days
# number and percentage of participants gone through meas_n measurements
meas_periods %>%
count(meas_n, name = "participant_n") %>%
mutate(percent = participant_n/max(participant_n))
#> # A tibble: 13 × 3
#> meas_n participant_n percent
#> <int> <int> <dbl>
#> 1 1 996 1
#> 2 2 963 0.967
#> 3 3 877 0.881
#> 4 4 728 0.731
#> 5 5 553 0.555
#> 6 6 381 0.383
#> 7 7 237 0.238
#> 8 8 135 0.136
#> 9 9 79 0.0793
#> 10 10 34 0.0341
#> 11 11 12 0.0120
#> 12 12 4 0.00402
#> 13 13 1 0.00100
Created on 2022-11-02 with reprex v2.0.2

Related

R: Random Sampling of Longitudinal Data

I have the following dataset in R (e.g. the same students take an exam each year and their results are recorded):
student_id = c(1,1,1,1,1, 2,2,2, 3,3,3,3)
exam_number = c(1,2,3,4,5,1,2,3,1,2,3,4)
exam_result = rnorm(12, 80,10)
my_data = data.frame(student_id, exam_number, exam_result)
student_id exam_number exam_result
1 1 1 72.79595
2 1 2 81.12950
3 1 3 93.29906
4 1 4 79.33229
5 1 5 76.64106
6 2 1 95.14271
Suppose I take a random sample from this data:
library(dplyr)
random_sample = sample_n(my_data, 5, replace = TRUE)
student_id exam_number exam_result
1 3 1 76.19691
2 3 3 87.52431
3 2 2 91.89661
4 2 3 80.05088
5 2 2 91.89661
Now, I can take the highest "exam_number" per student from this random sample:
max_value = random_sample %>%
group_by(student_id) %>%
summarize(max = max(exam_number))
# A tibble: 2 x 2
student_id max
<dbl> <dbl>
1 2 3
2 3 3
Based on these results - I want to accomplish the following. For the students that were selected in "random_sample":
Create a dataset that contains all rows occurring AFTER the "max exam number" (e.g. call this dataset "data_after")
Create a dataset that contains all rows occurring BEFORE (and equal to) the "max exam number" (e.g. call this dataset "data_before")
In the example I have created, this would look something like this:
# after
student_id exam_number exam_result
1 3 4 105.5805
# before
student_id exam_number exam_result
1 2 1 95.14000
2 2 2 91.89000
3 2 3 80.05000
4 3 1 76.19691
5 3 2 102.00875
6 3 3 87.52431
Currently, I am trying to do this in a very indirect way using JOINS and ANTI_JOINS:
max_3 = as.numeric(max_value[2,2])
max_s3 = max_3 - 1
student_3 = seq(1, max_s3 , by = 1)
before_student_3 = my_data[is.element(my_data$exam_number, student_3) & my_data$student_id == 3,]
remainder_student_3 = my_data[my_data$student_id == 3,]
after_student_3 = anti_join(remainder_student_3, before_student_3)
But I don't think I am doing this correctly - can someone please show me how to do this?
Thanks!
The code above also uses a join, like it is said in the question. Then, the wanted data sets are created by filtering the join result.
student_id = c(1,1,1,1,1, 2,2,2, 3,3,3,3)
exam_number = c(1,2,3,4,5,1,2,3,1,2,3,4)
exam_result = rnorm(12, 80,10)
my_data = data.frame(student_id, exam_number, exam_result)
suppressPackageStartupMessages({
library(dplyr)
})
set.seed(2022)
(random_sample = sample_n(my_data, 5, replace = TRUE))
#> student_id exam_number exam_result
#> 1 1 4 73.97148
#> 2 1 3 84.77151
#> 3 2 2 78.76927
#> 4 3 3 69.35063
#> 5 1 4 73.97148
max_value = random_sample %>%
group_by(student_id) %>%
summarize(max = max(exam_number))
# join only once
max_value %>%
left_join(my_data, by = "student_id") -> join_data
join_data
#> # A tibble: 12 × 4
#> student_id max exam_number exam_result
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 4 1 71.0
#> 2 1 4 2 69.1
#> 3 1 4 3 84.8
#> 4 1 4 4 74.0
#> 5 1 4 5 80.7
#> 6 2 2 1 77.4
#> 7 2 2 2 78.8
#> 8 2 2 3 69.5
#> 9 3 3 1 83.9
#> 10 3 3 2 62.7
#> 11 3 3 3 69.4
#> 12 3 3 4 102.
data_before <- join_data %>%
group_by(student_id) %>%
filter(exam_number <= max) %>%
ungroup() %>%
select(-max)
data_after <- join_data %>%
group_by(student_id) %>%
filter(exam_number > max) %>%
ungroup() %>%
select(-max)
data_before
#> # A tibble: 9 × 3
#> student_id exam_number exam_result
#> <dbl> <dbl> <dbl>
#> 1 1 1 71.0
#> 2 1 2 69.1
#> 3 1 3 84.8
#> 4 1 4 74.0
#> 5 2 1 77.4
#> 6 2 2 78.8
#> 7 3 1 83.9
#> 8 3 2 62.7
#> 9 3 3 69.4
data_after
#> # A tibble: 3 × 3
#> student_id exam_number exam_result
#> <dbl> <dbl> <dbl>
#> 1 1 5 80.7
#> 2 2 3 69.5
#> 3 3 4 102.
# final clean-up
rm(join_data)
Created on 2022-12-10 with reprex v2.0.2

How do I find three or more consecutive dates with the same value in R?

I'm trying to find periods of 3 days or more where the values are the same. As an example, if January 1st, 2nd, and 3rd all have a value of 2, they should be included - but if January 2nd has a value of 3, then none of them should be.
I've tried a few ways so far but no luck! Any help would be greatly appreciated!
Reprex:
library("dplyr")
#Goal: include all values with values of 2 or less for 5 consecutive days and allow for a "cushion" period of values of 2 to 5 for up to 3 days
data <- data.frame(Date = c("2000-01-01", "2000-01-02", "2000-01-03", "2000-01-04", "2000-01-05", "2000-01-06", "2000-01-07", "2000-01-08", "2000-01-09", "2000-01-10", "2000-01-11", "2000-01-12", "2000-01-13", "2000-01-14", "2000-01-15", "2000-01-16", "2000-01-17", "2000-01-18", "2000-01-19", "2000-01-20", "2000-01-21", "2000-01-22", "2000-01-23", "2000-01-24", "2000-01-25", "2000-01-26", "2000-01-27", "2000-01-28", "2000-01-29", "2000-01-30"),
Value = c(2,2,2,5,2,2,1,0,1,8,7,7,7,5,2,3,4,5,7,2,6,6,6,6,2,0,3,4,0,1))
head(data)
#Goal: values should include dates from 2000-01-01 to 2000-01-03, 2000-01-11 to 2000-01-13, and 2000-01-21 to 2000-01-24
#My attempt so far but it doesn't work
attempt1 <- data %>%
group_by(group_id = as.integer(gl(n(),3,n()))) %>% #3 day chunks
filter(Value == Value) %>% #looking for the values being the same inbetween, but this doesn't work for that
ungroup() %>%
select(-group_id)
head(attempt1)
With rle:
rl <- rle(data$Value)
data[rep(rl$lengths>=3,rl$lengths),]
Date Value
1 2000-01-01 2
2 2000-01-02 2
3 2000-01-03 2
11 2000-01-11 7
12 2000-01-12 7
13 2000-01-13 7
21 2000-01-21 6
22 2000-01-22 6
23 2000-01-23 6
24 2000-01-24 6
or with dplyr:
library(dplyr)
data %>% filter(rep(rle(Value)$length>=3,rle(Value)$length))
Date Value
1 2000-01-01 2
2 2000-01-02 2
3 2000-01-03 2
4 2000-01-11 7
5 2000-01-12 7
6 2000-01-13 7
7 2000-01-21 6
8 2000-01-22 6
9 2000-01-23 6
10 2000-01-24 6
You can create a temporary variable using rleid from the data.table package.
data %>%
group_by(data.table::rleid(Value)) %>%
filter(n() >= 3) %>%
ungroup() %>%
select(Date, Value)
#> # A tibble: 10 x 2
#> Date Value
#> <chr> <dbl>
#> 1 2000-01-01 2
#> 2 2000-01-02 2
#> 3 2000-01-03 2
#> 4 2000-01-11 7
#> 5 2000-01-12 7
#> 6 2000-01-13 7
#> 7 2000-01-21 6
#> 8 2000-01-22 6
#> 9 2000-01-23 6
#> 10 2000-01-24 6
Or, if you want to avoid using another package, you could equivalently do
data %>%
group_by(temp = cumsum(c(1, diff(Value) != 0))) %>%
filter(n() > 2) %>%
ungroup() %>%
select(-temp)
#> # A tibble: 10 x 2
#> Date Value
#> <chr> <dbl>
#> 1 2000-01-01 2
#> 2 2000-01-02 2
#> 3 2000-01-03 2
#> 4 2000-01-11 7
#> 5 2000-01-12 7
#> 6 2000-01-13 7
#> 7 2000-01-21 6
#> 8 2000-01-22 6
#> 9 2000-01-23 6
#> 10 2000-01-24 6
Created on 2022-09-12 with reprex v2.0.2

How can i add a row vector in a tibble in R?

I have a tibble in R that has 11 observations of every month.Apart from June that has 0.
My data frame (tibble) looks like this :
library(tidyverse)
A = c(1,2,3,4,5,7,8,9,10,11,12)
B = rnorm(11,0,1)
Data = tibble(A,B);Data
But i want to add the 0 observation of June of this timeseries.
Something like :
d = c(6,0);d
newdata = rbind(Data,d)
order(newdata$A)
but the 12 (december) appears.Any help?
Two approaches:
(1) We can use add_row for this. However, d must be named and we need to splice it into add_row with the tribble bang !!! operator. Then we can arrange the data so that the month are sorted from 1 to 12. Of course you can specify add_row directly like in #Chris answer without the need of an external vector.
library(dplyr)
A = c(1,2,3,4,5,7,8,9,10,11,12)
B = rnorm(11,0,1)
Data = tibble(A,B)
d = c(A = 6, B = 0)
newdata <- Data %>%
add_row(!!! d) %>%
arrange(A)
# check
newdata
#> # A tibble: 12 x 2
#> A B
#> <dbl> <dbl>
#> 1 1 1.22
#> 2 2 0.0729
#> 3 3 0.597
#> 4 4 -1.26
#> 5 5 0.928
#> 6 6 0
#> 7 7 -1.08
#> 8 8 0.704
#> 9 9 -0.119
#> 10 10 -0.462
#> 11 11 -0.00388
#> 12 12 1.56
order(newdata$A)
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12
(2) We can use tidyr::complete, as suggested by #Ronak in the comments, although we use a slightly different specification with full_seq:
library(tidyr)
Data %>%
complete(A = full_seq(A, 1), fill = list(B = 0))
#> # A tibble: 12 x 2
#> A B
#> <dbl> <dbl>
#> 1 1 -0.258
#> 2 2 -1.18
#> 3 3 -0.165
#> 4 4 0.775
#> 5 5 0.926
#> 6 6 0
#> 7 7 0.343
#> 8 8 1.10
#> 9 9 0.359
#> 10 10 0.934
#> 11 11 -0.444
#> 12 12 0.184
Created on 2021-09-21 by the reprex package (v2.0.1)
You can define the additional row in add_row:
library(dplyr)
Data %>%
add_row(A = 6, B = 0) %>%
arrange(A)
# A tibble: 12 x 2
A B
<dbl> <dbl>
1 1 -0.547
2 2 -0.564
3 3 -0.890
4 4 -0.477
5 5 -0.998
6 6 0
7 7 -0.776
8 8 0.0645
9 9 0.959
10 10 -0.110
11 11 -0.511
12 12 -0.911

Sample within a group multiple times in r using dplyr

I am trying to pick samples within each group:
df <- data.frame(ID=c(1,1,1,2,2,2), score=c(10,20,30,40,50,60))
ID score
1 1 10
2 1 20
3 1 30
4 2 40
5 2 50
6 2 60
df %>% group_by(ID) %>% sample_n(2)
ID score
1 1 20
2 1 30
3 2 50
4 2 40
But I want to do it n multiple times for each ID, for example 2 times to get something like this:
ID score sample_num
1 1 20 1
2 1 30 1
3 1 20 2
4 1 10 2
5 2 50 1
6 2 40 1
7 2 60 2
8 2 40 2
Each sample set should be done without replacement.
Is there a way to do this in dplyr? The long way I can think of is to do a for loop, create a df each iteration and then combine all the dfs together at the end.
If you have to do it N number of times, do this
create a variable N for times
map_dfr will iterate over its first argument i.e. seq_len(N) , do what you were doing manually, mutate one more variable which will store respective value of seq_len(N) i.e. .x in lambda formula, for each iteration.
final results will be compiled in a data frame as we are using map_dfr variant of map
df <- data.frame(ID=c(1,1,1,2,2,2), score=c(10,20,30,40,50,60))
library(tidyverse)
N <- 7
map_dfr(seq_len(N), ~df %>% group_by(ID) %>% sample_n(2) %>%
mutate(sample_no = .x))
#> # A tibble: 28 x 3
#> # Groups: ID [2]
#> ID score sample_no
#> <dbl> <dbl> <int>
#> 1 1 20 1
#> 2 1 10 1
#> 3 2 60 1
#> 4 2 50 1
#> 5 1 30 2
#> 6 1 10 2
#> 7 2 60 2
#> 8 2 40 2
#> 9 1 10 3
#> 10 1 20 3
#> # ... with 18 more rows
Created on 2021-06-11 by the reprex package (v2.0.0)
library(tidyverse)
df <- data.frame(ID=c(1,1,1,2,2,2), score=c(10,20,30,40,50,60))
set.seed(123)
#option 1
rerun(2, df %>% group_by(ID) %>% sample_n(2,replace = FALSE)) %>%
map2(1:length(.), ~mutate(.x, sample_n = .y)) %>%
reduce(bind_rows) %>%
arrange(ID)
#> # A tibble: 8 x 3
#> # Groups: ID [2]
#> ID score sample_n
#> <dbl> <dbl> <int>
#> 1 1 30 1
#> 2 1 10 1
#> 3 1 30 2
#> 4 1 20 2
#> 5 2 60 1
#> 6 2 50 1
#> 7 2 50 2
#> 8 2 60 2
#option 2
map(1:2, ~df %>% group_by(ID) %>%
sample_n(2,replace = FALSE) %>%
mutate(sample_num = .x)) %>%
reduce(bind_rows) %>%
arrange(ID)
#> # A tibble: 8 x 3
#> # Groups: ID [2]
#> ID score sample_num
#> <dbl> <dbl> <int>
#> 1 1 30 1
#> 2 1 10 1
#> 3 1 10 2
#> 4 1 20 2
#> 5 2 50 1
#> 6 2 60 1
#> 7 2 60 2
#> 8 2 50 2
Created on 2021-06-11 by the reprex package (v2.0.0)
library(tidyverse)
set.seed(1)
n_repeat <- 2
n_sample <- 2
df <- data.frame(ID=c(1,1,1,2,2,2), score=c(10,20,30,40,50,60))
df %>%
group_nest(ID) %>%
transmute(ID,
Score = map(data, ~as.vector(replicate(n_repeat, sample(.x$score, 2))))) %>%
unnest(Score) %>%
group_by(ID) %>%
mutate(sample_no = rep(seq(n_repeat), each = n_sample)) %>%
ungroup()
#> # A tibble: 8 x 3
#> ID Score sample_no
#> <dbl> <dbl> <int>
#> 1 1 10 1
#> 2 1 20 1
#> 3 1 30 2
#> 4 1 10 2
#> 5 2 50 1
#> 6 2 40 1
#> 7 2 60 2
#> 8 2 40 2
Created on 2021-06-11 by the reprex package (v2.0.0)

Dplyr across + mutate + condition to select the columns

I am sure the solution is a one liner, but I am banging my head against the wall.
See the very short reprex at the end of the post; how do I tell dplyr that I want to double only the columns without NA?
Many thanks
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- tibble(x=1:10, y=101:110,
w=c(6,NA,4,NA, 5,0,NA,4,8,17 ),
z=c(2,3,4,NA, 5,10,22,34,58,7 ),
k=rep("A",10))
df
#> # A tibble: 10 x 5
#> x y w z k
#> <int> <int> <dbl> <dbl> <chr>
#> 1 1 101 6 2 A
#> 2 2 102 NA 3 A
#> 3 3 103 4 4 A
#> 4 4 104 NA NA A
#> 5 5 105 5 5 A
#> 6 6 106 0 10 A
#> 7 7 107 NA 22 A
#> 8 8 108 4 34 A
#> 9 9 109 8 58 A
#> 10 10 110 17 7 A
df %>% mutate(across(where(is.numeric), ~.x*2))
#> # A tibble: 10 x 5
#> x y w z k
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 2 202 12 4 A
#> 2 4 204 NA 6 A
#> 3 6 206 8 8 A
#> 4 8 208 NA NA A
#> 5 10 210 10 10 A
#> 6 12 212 0 20 A
#> 7 14 214 NA 44 A
#> 8 16 216 8 68 A
#> 9 18 218 16 116 A
#> 10 20 220 34 14 A
##now double the value of all the columns without NA. How to fix this...
df %>% mutate(across(where(sum(is.na(.x))==0), ~.x*2))
#> Error: Problem with `mutate()` input `..1`.
#> ✖ object '.x' not found
#> ℹ Input `..1` is `across(where(sum(is.na(.x)) == 0), ~.x * 2)`.
Created on 2020-10-27 by the reprex package (v0.3.0.9001)
Here is the one-liner you are looking for
df %>% mutate(across(where(~is.numeric(.) && all(!is.na(.))), ~.x*2))
Output
# A tibble: 10 x 5
x y w z k
<dbl> <dbl> <dbl> <dbl> <chr>
1 2 202 6 2 A
2 4 204 NA 3 A
3 6 206 4 4 A
4 8 208 NA NA A
5 10 210 5 5 A
6 12 212 0 10 A
7 14 214 NA 22 A
8 16 216 4 34 A
9 18 218 8 58 A
10 20 220 17 7 A
Note that the aim is to select columns that dont have NA any that are numeric. Recall that the input to where must be a function. in your case just do:
df %>% mutate(across(where(~is.numeric(.) & sum(is.na(.x))==0), ~.x*2))
Well to give you other ways:
df %>% mutate(across(where(~!anyNA(.) & is.numeric(.)), ~.*2))
# A tibble: 10 x 5
x y w z k
<dbl> <dbl> <dbl> <dbl> <chr>
1 2 202 6 2 A
2 4 204 NA 3 A
3 6 206 4 4 A
4 8 208 NA NA A
5 10 210 5 5 A
6 12 212 0 10 A
7 14 214 NA 22 A
8 16 216 4 34 A
9 18 218 8 58 A
10 20 220 17 7 A
If you know how to use the negate function:
df %>% mutate(across(where(~Negate(anyNA)(.) & is.numeric(.)), ~.*2))

Resources