I am looking to generate a longitudinal dataset. I have generated my pat numbers and treatment groups:
library(dplyr)
set.seed(420)
Pat_TNO <- 1001:1618
data.frame(Pat_TNO = Pat_TNO) %>%
rowwise() %>%
mutate(
trt = rbinom(1, 1, 0.5)
)
My timepoints (in days) are:
timepoint_weeks <- c(seq(2, 12, 2), 16, 20, 24, 52)
timepoint_days <- 7 * timepoint_weeks
How can I pivot this dataset using the vector timepoint_days, so I have 10 rows per participant and column names Pat_TNO, trt, timepoint_days.
You can use the unnest function from tidyr to achieve what you want.
Here is the code
library(dplyr)
library(tidyr)
set.seed(420)
Pat_TNO <- 1001:1618
x <- data.frame(Pat_TNO = Pat_TNO) %>%
rowwise() %>%
mutate(
trt = rbinom(1, 1, 0.5)
)
timepoint_weeks <- c(seq(2, 12, 2), 16, 20, 24, 52)
timepoint_days <- 7 * timepoint_weeks
x %>%
mutate(timepoint_days = list(timepoint_days)) %>%
unnest()
Output
# A tibble: 6,180 × 3
Pat_TNO trt timepoint_days
<int> <int> <dbl>
1 1001 1 14
2 1001 1 28
3 1001 1 42
4 1001 1 56
5 1001 1 70
6 1001 1 84
7 1001 1 112
8 1001 1 140
9 1001 1 168
10 1001 1 364
# … with 6,170 more rows
Here I used the mutate function to add a column with a list containing timepoint_days in every row. And then unnest collapses each row to get 10 rows per participant.
I have a dataset which i want to loop over with an if function:
id <- c(1,2,3,4,5)
value1 <- c(25, 100, 15, 20, 30)
value2 <- c(130, 25, 10, 30, 20)
value3 <- c(50, 60, 20, 120, 10)
month <- c(2, 3, 4, 2, 3)
df1 <- dataframe(id,value1, value2, value3, month)
I need an if function that would calculate the differences between value 1 and 2 and between 1 and 3 and check if one OR the other is higher than 35%. This should not be the change(increase or decrease) but just the difference in percentages. And when the differences is equal or higher than 35% it should add all the variables for that specific id to a seperate dataframe/table.
However i the function to loop over all the id's instead of just the top one.
This is what i have so far:
library(threadr)
if (percentage_difference(value1, value2) >= 35 | percentage_difference(value1, value3) >= 35) {print "bad"}
If I understand you correctly, this should do the trick, though I am not sure how you would want to indicate if it is >35.
Here I used a boolean T/F:
library(dplyr)
df2 <- df1
df2$perc_diff <- pmax(value1-value2, value1-value3)
df2$over35 <- df2$perc_diff >= 35
# id value1 value2 value3 month perc_diff over35
# 1 1 25 130 50 2 -25 FALSE
# 2 2 100 25 60 3 75 TRUE
# 3 3 15 10 20 4 5 FALSE
# 4 4 20 30 120 2 -10 FALSE
# 5 5 30 20 10 3 20 FALSE
Here I removed those < 35:
df3 <- df1
df3$perc_diff <- pmax(value1-value2, value1-value3)
df3[df3$perc_diff >= 35,]
# id value1 value2 value3 month perc_diff
# 2 100 25 60 3 75
Here is a reproducible example of the situation I need help for. I have a database (db1) in which weekly ratings of behavioral outcomes are recorded. The variable "Week" corresponds to the number of the week from the beginning of the year (e.g., Week = 1 indicates the week between January 1st and 7th, and so on...) and the variable "Score" to the value obtained by the subject on the criterion measure. In the real data set, I have several participants and a different number of ratings for each subject; however, in this example there is only one subject to make things easier.
library(magrittr)
x1 <- c(14, 18, 19, 20, 21, 23, 24, 25)
y1 <- c(34, 21, 45, 32, 56, 45, 23, 48)
db1 <- cbind(x1, y1) %>% as.data.frame() %>% setNames(c("Week", "Score"))
db1
# Week Score
#1 14 34
#2 18 21
#3 19 45
#4 20 32
#5 21 56
#6 23 45
#7 24 23
#8 25 48
What I need to do is to identify the highest number of ratings occurred in consecutive weeks in the database. In the example, the highest number is 4 because the ratings were consecutive from week 18 to 21. Here I added a column for demonstration, but it might not be necessary for the solution.
x2 <- c(14, 18, 19, 20, 21, 23, 24, 25)
y2 <- c(34, 21, 45, 32, 56, 45, 23, 48)
z2 <- c(1, 1, 2, 3, 4, 1, 2, 3)
db2 <- cbind(x2, y2, z2) %>% as.data.frame() %>% setNames(c("Week", "Score", "Consecutive"))
db2
# Week Score Consecutive
#1 14 34 1
#2 18 21 1
#3 19 45 2
#4 20 32 3
#5 21 56 4
#6 23 45 1
#7 24 23 2
#8 25 48 3
Finally, because every subject has to have a total of five consecutive ratings, I need to add a row with a missing datum where the highest number of consecutive weeks is below five (so that I can impute the missing data later on). However, there might be ratings before and after the sequence. If that is the case, I want to add the row based on the minimal distance between the first or last week of the longest series of consecutive weeks from the other existing rating. In the example, that means that the row with missing datum will be added after 21 because there are 4 missing weeks between week 14 and 18 whereas only 1 between week 21 and 23.
x3 <- c(14, 18, 19, 20, 21, 22, 23, 24, 25)
y3 <- c(34, 21, 45, 32, 56, NA, 45, 23, 48)
z3 <- c(1, 1, 2, 3, 4, 5, 1, 2, 3)
db3 <- cbind(x3, y3, z3) %>% as.data.frame() %>% setNames(c("Week", "Score", "Consecutive"))
db3
# Week Score Consecutive
#1 14 34 1
#2 18 21 1
#3 19 45 2
#4 20 32 3
#5 21 56 4
#6 22 NA 5
#7 23 45 1
#8 24 23 2
#9 25 48 3
For your information, this is not going to be part of the main statistical analyses but rather one of several ways I want to use to test the sensitivity of my model. So do not worry about whether it makes sense from a methodological point of view. In addition, if possible, a tidyverse solution would be greatly appreciated.
Thanks so much to anyone who will take the time.
The code is relatively easier, if you want to do it just for max group and if more than one, just for one.
db1 %>% mutate(consecutive = accumulate(diff(Week), .init = 1, ~if(.y == 1) { .x +1} else {1}),
dummy = max(consecutive) == consecutive & max(consecutive) < 5) %>%
group_by(grp = cumsum(consecutive == 1)) %>%
filter(sum(dummy) > 0) %>% #filter out group(s) with max consecutive
ungroup() %>% select(-dummy) %>%
filter(grp == min(grp)) %>% # filter out first such group, if there are more than 1
complete(consecutive = 1:5) %>%
select(-grp) %>%
mutate(Week = first(Week) + consecutive -1)
# A tibble: 5 x 3
consecutive Week Score
<dbl> <dbl> <dbl>
1 1 18 21
2 2 19 45
3 3 20 32
4 4 21 56
5 5 22 NA
OLD ANSWER Another tidyverse strategy (this can be modified to suit your additional column requirements which you have not given in sample)
library(tidyverse)
db1
#> Week Score
#> 1 14 34
#> 2 18 21
#> 3 19 45
#> 4 20 32
#> 5 21 56
#> 6 23 45
#> 7 24 23
#> 8 25 48
library(data.table)
db1 %>% mutate(consecutive = accumulate(diff(Week), .init = 1, ~if(.y == 1) { .x +1} else {1}),
dummy = max(consecutive) == consecutive & max(consecutive) < 5,
dummy2 = rleid(dummy)) %>%
group_split(dummy2, .keep = F) %>%
map_if( ~.x$dummy[[1]], ~.x %>% complete(consecutive = seq(max(consecutive), 5, 1), fill = list(Week = 1)) %>%
mutate(Week = cumsum(Week))) %>%
map_dfr(~.x %>% select(-dummy))
#> # A tibble: 9 x 3
#> Week Score consecutive
#> <dbl> <dbl> <dbl>
#> 1 14 34 1
#> 2 18 21 1
#> 3 19 45 2
#> 4 20 32 3
#> 5 21 56 4
#> 6 22 NA 5
#> 7 23 45 1
#> 8 24 23 2
#> 9 25 48 3
Created on 2021-06-10 by the reprex package (v2.0.0)
if I understand correctly
library(data.table)
library(tidyverse)
x1 <- c(14, 18, 19, 20, 21, 23, 24, 25)
y1 <- c(34, 21, 45, 32, 56, 45, 23, 48)
db1 <- cbind(x1, y1) %>% as.data.frame() %>% setNames(c("Week", "Score"))
db1 %>%
mutate(grp = cumsum(c(0, diff(Week)) > 1)) %>%
group_by(grp) %>%
mutate(n_grp = n()) %>%
ungroup() %>%
filter(n_grp == max(n_grp, na.rm = TRUE)) %>%
complete(grp,
n_grp,
nesting(Week = seq(from = first(Week), length = 5))) %>%
select(-c(grp, n_grp)) %>%
rows_upsert(db1, by = c("Week", "Score"))
#> # A tibble: 9 x 2
#> Week Score
#> <dbl> <dbl>
#> 1 18 21
#> 2 19 45
#> 3 20 32
#> 4 21 56
#> 5 22 NA
#> 6 14 34
#> 7 23 45
#> 8 24 23
#> 9 25 48
Created on 2021-06-10 by the reprex package (v2.0.0)
You can also use the following solution. Midway through this solution before we use add_row to add your additional rows, we can filter the whole data set for we use group_split I filtered the whole data set to keep only those groups with the maximum observations which means they have longer consecutive Weeks than others. So after we split by grouping variable we may end of with 2 or more groups of equal consecutive Weeks so then you can choose whichever your like based on your preference:
library(dplyr)
library(purrr)
library(tibble)
db1 %>%
mutate(Consecutive = +(Week - lag(Week, default = first(Week)) == 1),
grp = cumsum(Consecutive == 0)) %>%
group_by(grp) %>%
mutate(Consecutive = row_number()) %>%
group_by(grp, .drop = TRUE) %>%
add_count() %>%
ungroup() -> db2 # We create our grouping variable `grp` here
db2 %>%
filter(n == max(n)) %>%
group_split(grp) %>%
map_dfr(~ add_row(.x, Week = .x$Week[.x$n[1]] + seq(1, 5 - .x$n[1], 1),
Consecutive = .x$Consecutive[.x$n[1]] + seq(1, 5 - .x$n[1], 1),
grp = .x$grp[1])) %>%
bind_rows(db2 %>%
filter(n != max(n))) %>%
select(-c(grp, n)) %>%
arrange(Week)
# A tibble: 9 x 3
Week Score Consecutive
<dbl> <dbl> <dbl>
1 14 34 1
2 18 21 1
3 19 45 2
4 20 32 3
5 21 56 4
6 22 NA 5
7 23 45 1
8 24 23 2
9 25 48 3
I have a data frame with values and I need a new column with shifted values some rows down but data frame has to get more rows to accommodate the shifted data.
What I've got so far:
df <- data.frame(day=1:5,value=floor(runif(5, min=0, max=101)))
> df %>% dplyr::mutate(value2=dplyr::lag(value,n=2, default = 0))
day value value2
1 1 19 0
2 2 78 0
3 3 18 19
4 4 14 78
5 5 10 18
Expected result:
day value value2
1 1 19 0
2 2 78 0
3 3 18 19
4 4 14 78
5 5 10 18
6 6 0 14
7 7 0 10
Stuck on making the data frame grow the needed rows.
Here's a way with dplyr -
df %>%
bind_rows(
tail(df, 2) %>%
mutate(day = day + 2, value = 0)
) %>%
mutate(value2 = lag(value, 2, default = 0))
day value value2
1 1 19 0
2 2 78 0
3 3 18 19
4 4 14 78
5 5 10 18
6 6 0 14
7 7 0 10
Use a merge. Create the "target" dataset with however many rows you want, fill in NA values with 0, then remap the lagged value onto "value2". It's useful to store "lag" as a variable, at the risk of being more verbose.
have <- data.frame(
day= 1:5,
value = c(19, 78, 18, 14, 10),
value2 = c(0, 0, 19, 78, 18)
)
target <- data.frame(
day=1:7
)
want <- merge(have, target, by='day', all=T)
want[is.na(want)] <- 0
lag <- 2
## just one way of mapping a lagged response
want$value2 <- c(rep(0, lag), rev(rev(want$value)[-{1:lag}]))
I'm trying to extract the rows from a data frame containing the lowest value in a specific column:
income = c(2, 3, 5, 5, -15, 2, 1)
balance = c(15, 17, 20, 25, 30, 15, 17)
date = as.Date(c("2016/02/11", "2016/02/14", "2017/02/16", "2016/03/01", "2017/03/12", "2016/04/11", "2017/04/24"))
df = data.frame(income, balance, date)
Now what I want to get the rows containing the minimum "balance" value from each month, so that the outcome would be a data frame looking like this:
income balance date
1 2 15 2016-02-11
2 5 25 2016-03-01
3 2 33 2016-04-11
I have tryed the aggregate function:
bymonth = aggregate(balance~months(date), data=df,FUN=min)
print(bymonth)
But this gives me the following output:
months(date) balance
1 April 15
2 Februar 15
3 Marts 25
Help!
We can do with dplyr. After grouping by months of 'date', we slice the row which has the min 'balance' and remove the 'mth' column using select
library(dplyr)
df %>%
group_by(mth = months(date)) %>%
slice(which.min(balance)) %>%
ungroup() %>%
select(-mth)
# A tibble: 3 x 3
# income balance date
# <dbl> <dbl> <date>
#1 2 15 2016-04-11
#2 2 15 2016-02-11
#3 5 25 2016-03-01
Note that if there are ties for the 'balance', then use filter(balance == min(balance)) in place of slice
Or with ave from base R tp create a logical vector and use that to subset the rows of 'df'
df[with(df, ave(balance, months(date), FUN = min)==balance),]
# income balance date
#1 2 15 2016-02-11
#4 5 25 2016-03-01
#6 2 15 2016-04-11