Computing Change from Baseline in R - r

I have a dataset in R, which contains observations by time. For each subject, I have up to 4 rows, and a variable of ID along with a variable of Time and a variable called X, which is numerical (but can also be categorical for the sake of the question). I wish to compute the change from baseline for each row, by ID. Until now, I did this in SAS, and this was my SAS code:
data want;
retain baseline;
set have;
if (first.ID) then baseline = .;
if (first.ID) then baseline = X;
else baseline = baseline;
by ID;
Change = X-baseline;
run;
My question is: How do I do this in R ?
Thank you in advance.
Dataset Example (in SAS, I don't know how to do it in R).
data have;
input ID, Time, X;
datalines;
1 1 5
1 2 6
1 3 8
1 4 9
2 1 2
2 2 2
2 3 7
2 4 0
3 1 1
3 2 4
3 3 5
;
run;

Generate some example data:
dta <- data.frame(id = rep(1:3, each=4), time = rep(1:4, 3), x = rnorm(12))
# > dta
# id time x
# 1 1 1 -0.232313499
# 2 1 2 1.116983376
# 3 1 3 -0.682125947
# 4 1 4 -0.398029820
# 5 2 1 0.440525082
# 6 2 2 0.952058966
# 7 2 3 0.690180586
# 8 2 4 -0.995872696
# 9 3 1 0.009735667
# 10 3 2 0.556254340
# 11 3 3 -0.064571775
# 12 3 4 -1.003582676
I use the package dplyr for this. This package is not installed by default, so, you'll have to install it first if it isn't already.
The steps are: group the data by id (following operations are done per group), sort the data to make sure it is ordered on time (that the first record is the baseline), then calculate a new column which is the difference between x and the first value of x. The result is stored in a new data.frame, but can of course also be assigned back to dta.
library(dplyr)
dta_new <- dta %>% group_by(id) %>% arrange(id, time) %>%
mutate(change = x - first(x))
# > dta_new
# Source: local data frame [12 x 4]
# Groups: id [3]
#
# id time x change
# <int> <int> <dbl> <dbl>
# 1 1 1 -0.232313499 0.00000000
# 2 1 2 1.116983376 1.34929688
# 3 1 3 -0.682125947 -0.44981245
# 4 1 4 -0.398029820 -0.16571632
# 5 2 1 0.440525082 0.00000000
# 6 2 2 0.952058966 0.51153388
# 7 2 3 0.690180586 0.24965550
# 8 2 4 -0.995872696 -1.43639778
# 9 3 1 0.009735667 0.00000000
# 10 3 2 0.556254340 0.54651867
# 11 3 3 -0.064571775 -0.07430744
# 12 3 4 -1.003582676 -1.01331834

Related

Select Random Consecutive Rows Per Group

I have data which is grouped by 'student_id':
my_data = data.frame(student_id = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3),
exam_no = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
result = rnorm(15,60,10))
my_data
student_id exam_no result
1 1 1 56.60374
2 1 2 55.76655
3 1 3 53.81728
4 1 4 74.82202
5 1 5 34.91834
6 2 1 58.32422
7 2 2 60.38213
8 2 3 49.40390
9 2 4 63.85426
10 2 5 40.32912
11 3 1 69.54969
12 3 2 43.36639
13 3 3 37.97265
14 3 4 52.36436
15 3 5 61.62080
My Question:
For each student, I want to select a set of consecutive rows, with random start and end rows.
For example, keep exams 2-4 for student 1, keep exams 2-5 for student 2, etc.
I thought of the following way to do this:
Create a data frame that contains the max number of exams each student takes (in my problem, each student takes the same number of exams, but in the future this could be different)
library(dplyr)
counts = my_data %>% group_by(student_id) %>% summarise(counts = n())
# create variables that indicate where to start ("min") and where to end ("max") for each student
counts$min = sample(1:counts$counts, 1)
counts$max = sample(counts$min:counts$counts,1)
From here, I was then going to write a loop that would select rows between "min" and "max" index for each student (e.g. my_data[min:max]), but the results from the previous code are giving me warnings and illogical results:
Warning message:
In 1:counts$counts :
numerical expression has 3 elements: only the first used
Warning messages:
1: In counts$min:counts$counts :
numerical expression has 3 elements: only the first used
2: In counts$min:counts$counts :
numerical expression has 3 elements: only the first used
# A tibble: 3 x 4
student_id counts min max
<dbl> <int> <int> <int>
1 1 5 4 5
2 2 5 4 5
3 3 5 4 5
I am not sure how to continue this - can someone please show me how to continue?
Thanks!
A base R option using cumsum to label the in-between consecutive rows
subset(
my_data,
ave(
exam_no,
student_id,
FUN = function(x) cumsum(seq_along(x) %in% sample.int(length(x), 2))
) == 1
)
which gives, for example
student_id exam_no result
2 1 2 61.83643
3 1 3 51.64371
4 1 4 75.95281
6 2 1 51.79532
7 2 2 64.87429
8 2 3 67.38325
11 3 1 75.11781
12 3 2 63.89843
13 3 3 53.78759
A more compact version by data.table with a similar idea as above is
library(data.table)
setDT(my_data)[, .SD[cumsum((1:.N) %in% sample.int(.N, 2)) == 1], student_id]
Using data.table, within each group, sample two values from .I (without replacement), and create a sequence of indices.
library(data.table)
setDT(my_data)
set.seed(3)
my_data[my_data[ , {ix = sample(.I, 2); ix[1]:ix[2]}, by = student_id]$V1]
# student_id exam_no result
# <num> <num> <num>
# 1: 1 5 74.05672
# 2: 1 4 49.37525
# 3: 1 3 67.41662
# 4: 1 2 67.64935
# 5: 2 4 55.15337
# 6: 2 3 58.95694
# 7: 3 4 50.79859
# 8: 3 3 53.66886
# 9: 3 2 47.01089

R: splitting dataframe into distinct subgroups containing sequence of groups

This question is similar to one already answered: R: Splitting dataframe into subgroups consisting of every consecutive 2 groups
However, rather than splitting into subgroups that have a type in common, I need to split into subgroups that contain two consecutive types and are distinct. The groups in my actual data have differing numbers of rows as well.
df <- data.frame(ID=c('1','1','1','1','1','1','1'), Type=c('a','a','b','c','c','d','d'), value=c(10,2,5,3,7,3,9))
ID Type value
1 1 a 10
2 1 a 2
3 1 b 5
4 1 c 3
5 1 c 7
6 1 d 3
7 1 d 9
So subgroup 1 would be Type a and b:
ID Type value
1 1 a 10
2 1 a 2
3 1 b 5
And subgroup 2 would be Type c and d:
ID Type value
4 1 c 3
5 1 c 7
6 1 d 3
7 1 d 9
I have tried manipulating the code from this previous example, but I can't figure out how to make this happen without having overlapping Types in each group. Any help would be greatly appreciated - thanks!
EDIT: thanks for pointing out I didn't actually include the correct link.
We can do a little manipulation of a dense_rank of the Type variable to make an appropriate grouping variable:
library(dplyr)
df %>%
group_by(g = (dense_rank(match(Type, Type)) - 1) %/% 2) %>%
group_split()
# [[1]]
# # A tibble: 3 × 4
# ID Type value g
# <chr> <chr> <dbl> <dbl>
# 1 1 a 10 0
# 2 1 a 2 0
# 3 1 b 5 0
#
# [[2]]
# # A tibble: 4 × 4
# ID Type value g
# <chr> <chr> <dbl> <dbl>
# 1 1 c 3 1
# 2 1 c 7 1
# 3 1 d 3 1
# 4 1 d 9 1
Explanation: match(Type, Type) converts Type into integers ordered by number of appearance - but not dense. dense_rank() makes that dense (no gaps). We then subtract 1 to make it start at 0 and %/% 2 to see how many 2s go into it, effectively grouping by pairs.
Here is a rle way, written as a function. Pass the data.frame and the split column name as a character string.
df <- data.frame(ID=c('1','1','1','1','1','1','1'),
Type=c('a','a','b','c','c','d','d'),
value=c(10,2,5,3,7,3,9))
split_two <- function(x, col) {
r <- rle(x[[col]])
r$values[c(FALSE, TRUE)] <- r$values[c(TRUE, FALSE)]
split(x, inverse.rle(r))
}
split_two(df, "Type")
#> $a
#> ID Type value
#> 1 1 a 10
#> 2 1 a 2
#> 3 1 b 5
#>
#> $c
#> ID Type value
#> 4 1 c 3
#> 5 1 c 7
#> 6 1 d 3
#> 7 1 d 9
Created on 2023-02-09 with reprex v2.0.2

reshaping data with time represented as spells

I have a dataset in which time is represented as spells (i.e. from time 1 to time 2), like this:
d <- data.frame(id = c("A","A","B","B","C","C"),
t1 = c(1,3,1,3,1,3),
t2 = c(2,4,2,4,2,4),
value = 1:6)
I want to reshape this into a panel dataset, i.e. one row for each unit and time period, like this:
result <- data.frame(id = c("A","A","A","A","B","B","B","B","C","C","C","C"),
t= c(1:4,1:4,1:4),
value = c(1,1,2,2,3,3,4,4,5,5,6,6))
I am attempting to do this with tidyr and gather but not getting the desired result. I am trying something like this which is clearly wrong:
gather(d, 't1', 't2', key=t)
In the actual dataset the spells are irregular.
You were almost there.
Code
d %>%
# Gather the needed variables. Explanation:
# t_type: How will the call the column where we will put the former
# variable names under?
# t: How will we call the column where we will put the
# values of above variables?
# -id,
# -value: Which columns should stay the same and NOT be gathered
# under t_type (key) and t (value)?
#
gather(t_type, t, -id, -value) %>%
# Select the right columns in the right order.
# Watch out: We did not select t_type, so it gets dropped.
select(id, t, value) %>%
# Arrange / sort the data by the following columns.
# For a descending order put a "-" in front of the column name.
arrange(id, t)
Result
id t value
1 A 1 1
2 A 2 1
3 A 3 2
4 A 4 2
5 B 1 3
6 B 2 3
7 B 3 4
8 B 4 4
9 C 1 5
10 C 2 5
11 C 3 6
12 C 4 6
So, the goal is to melt t1 and t2 columns and to drop the key column that will appear as a result. There are a couple of options. Base R's reshape seems to be tedious. We may, however, use melt:
library(reshape2)
melt(d, measure.vars = c("t1", "t2"), value.name = "t")[-3]
# id value t
# 1 A 1 1
# 2 A 2 3
# 3 B 3 1
# 4 B 4 3
# 5 C 5 1
# 6 C 6 3
# 7 A 1 2
# 8 A 2 4
# 9 B 3 2
# 10 B 4 4
# 11 C 5 2
# 12 C 6 4
where -3 drop the key column. We may indeed also use gather as in
gather(d, "key", "t", t1, t2)[-3]
# id value t
# 1 A 1 1
# 2 A 2 3
# 3 B 3 1
# 4 B 4 3
# 5 C 5 1
# 6 C 6 3
# 7 A 1 2
# 8 A 2 4
# 9 B 3 2
# 10 B 4 4
# 11 C 5 2
# 12 C 6 4

how to subset a data frame up until a point R

i want to subset a data frame and take all observations for each id until the first observation that didn't meet my condition. Something like this:
goodDaysAfterTreatMent <- subset(Patientdays, treatmentDate < date & goodThings > badThings)
Except that this returns all observations that meet the condition. I want something that stops with the first observation that didn't meet the condition, moves on to the next id, and returns all observations for this id that meets the condition, and so on.
the only way i can see is to use a lot of loops but loops and that's usually not a god thing.
Hope you guys have an idea
Assume that your condition is to return rows where v < 5 :
# example dataset
df = data.frame(id = c(1,1,1,1,2,2,2,2,3,3,3),
v = c(2,4,3,5,4,5,6,7,5,4,1))
df
# id v
# 1 1 2
# 2 1 4
# 3 1 3
# 4 1 5
# 5 2 4
# 6 2 5
# 7 2 6
# 8 2 7
# 9 3 5
# 10 3 4
# 11 3 1
library(tidyverse)
df %>%
group_by(id) %>% # for each id
mutate(flag = cumsum(ifelse(v < 5, 1, NA))) %>% # check if v < 5 and fill with NA all rows when condition is FALSE and after that
filter(!is.na(flag)) %>% # keep only rows with no NA flags
ungroup() %>% # forget the grouping
select(-flag) # remove flag column
# # A tibble: 4 x 2
# id v
# <dbl> <dbl>
# 1 1 2
# 2 1 4
# 3 1 3
# 4 2 4
Easy way:
Find First FALSE by (min(which(condition == F)):
Patientdays<-cbind.data.frame(treatmentDate=c(1:5,4,6:10),date=c(2:5,3,6:10,10),goodThings=c(1:11),badThings=c(0:10))
attach(Patientdays)# Just due to ease of use (optional)
condition<-treatmentDate < date & goodThings > badThings
Patientdays[1:(min(which(condition == F))-1),]
Edit: Adding result.
treatmentDate date goodThings badThings
1 1 2 1 0
2 2 3 2 1
3 3 4 3 2
4 4 5 4 3

Subset data frame that include a variable

I have a list of events and sequences. I would like to print the sequences in a separate table if event = x is included somewhere in the sequence. See table below:
Event Sequence
1 a 1
2 a 1
3 x 1
4 a 2
5 a 2
6 a 3
7 a 3
8 x 3
9 a 4
10 a 4
In this case I would like a new table that includes only the sequences where Event=x was included:
Event Sequence
1 a 1
2 a 1
3 x 1
4 a 3
5 a 3
6 x 3
Base R solution:
d[d$Sequence %in% d$Sequence[d$Event == "x"], ]
Event Sequence
1: a 1
2: a 1
3: x 1
4: a 3
5: a 3
6: x 3
data.table solution:
library(data.table)
setDT(d)[Sequence %in% Sequence[Event == "x"]]
As you can see syntax/logic is quite similar between these two solutions:
Find event's that are equal to x
Extract their Sequence
Subset table according to specified Sequence
We can use dplyr to group the data and filter the sequence with any "x" in it.
library(dplyr)
df2 <- df %>%
group_by(Sequence) %>%
filter(any(Event %in% "x")) %>%
ungroup()
df2
# A tibble: 6 x 2
Event Sequence
<chr> <int>
1 a 1
2 a 1
3 x 1
4 a 3
5 a 3
6 x 3
DATA
df <- read.table(text = " Event Sequence
1 a 1
2 a 1
3 x 1
4 a 2
5 a 2
6 a 3
7 a 3
8 x 3
9 a 4
10 a 4",
header = TRUE, stringsAsFactors = FALSE)

Resources