Add group to data frame using monotonically increasing numbers - r

I've got a data frame that looks like this (the real data is much larger and more complicated):
df.test = data.frame(
sample = c("a","a","a","a","a","a","b","b"),
day = c(0,1,2,0,1,3,0,2),
value = rnorm(8)
)
sample day value
1 a 0 -1.11182146
2 a 1 0.65679637
3 a 2 0.03652325
4 a 0 -0.95351736
5 a 1 0.16094840
6 a 3 0.06829702
7 b 0 0.33705141
8 b 2 0.24579603
The data frame is organized by experiments but the experiment ids are missed. The same sample can be used in different experiment, but I know that in a single experiment the days start from 0 and are monotonically increasing.
How can I add the experiment ids that can be a numbers {1, 2, ...}?
So the resulted data frame will be
sample day value exp
1 a 0 -1.11182146 1
2 a 1 0.65679637 1
3 a 2 0.03652325 1
4 a 0 -0.95351736 2
5 a 1 0.16094840 2
6 a 3 0.06829702 2
7 b 0 0.33705141 3
8 b 2 0.24579603 3
I would appreciate any help, especially with a tidy/dplyr solution.

As indicated in the comments, you can do this with cumsum:
df.test %>% mutate(exp = cumsum(day == 0))
## sample day value exp
## 1 a 0 0.09300394 1
## 2 a 1 0.85322925 1
## 3 a 2 -0.25167313 1
## 4 a 0 -0.14811243 2
## 5 a 1 -1.86789014 2
## 6 a 3 0.45983987 2
## 7 b 0 2.81199150 3
## 8 b 2 0.31951634 3

You can use diff :
library(dplyr)
df.test %>% mutate(exp = cumsum(c(TRUE, diff(day) < 0)))
# sample day value exp
#1 a 0 -0.3382010 1
#2 a 1 2.2241041 1
#3 a 2 2.2202612 1
#4 a 0 1.0359635 2
#5 a 1 0.4134727 2
#6 a 3 1.0144439 2
#7 b 0 -0.1292119 3
#8 b 2 -0.1191505 3

Related

Assign sequential group ID given a group start indicator

I need to assign subgroup IDs given a group ID and an indicator showing the beginning of the new subgroup. Here's a test dataset:
group <- c(rep("A", 8), rep("B", 8))
x1 <- c(rep(0, 3), rep(1, 3), rep(0, 2))
x2 <- rep(0:1, 4)
df <- data.frame(group=group, indic=c(x1, x2))
Here is the resulting data frame:
df
group indic
1 A 0
2 A 0
3 A 0
4 A 1
5 A 1
6 A 1
7 A 0
8 A 0
9 B 0
10 B 1
11 B 0
12 B 1
13 B 0
14 B 1
15 B 0
16 B 1
indic==1 means that row is the beginning of a new subgroup, and the subgroup should be numbered 1 higher than the previous subgroup. Where indic==0 the subgroup should be the same as the previous subgroup. The subgroup numbering starts at 1. When the group variable changes, the subgroup numbering resets to 1. I would like to use the tidyverse framework.
Here is the result that I want:
df
group indic subgroup
1 A 0 1
2 A 0 1
3 A 0 1
4 A 1 2
5 A 1 3
6 A 1 4
7 A 0 4
8 A 0 4
9 B 0 1
10 B 1 2
11 B 0 2
12 B 1 3
13 B 0 3
14 B 1 4
15 B 0 4
16 B 1 5
I would like to be able to give some methods that I've tried already but didn't work, but I haven't been able to find anything even close. Any help will be appreciated.
You can just use
library(dplyr)
df %>% group_by(group) %>%
mutate(subgroup=cumsum(indic)+1)
# group indic subgroup
# <fct> <dbl> <dbl>
# 1 A 0 1
# 2 A 0 1
# 3 A 0 1
# 4 A 1 2
# 5 A 1 3
# 6 A 1 4
# 7 A 0 4
# 8 A 0 4
# 9 B 0 1
# 10 B 1 2
# 11 B 0 2
# 12 B 1 3
# 13 B 0 3
# 14 B 1 4
# 15 B 0 4
# 16 B 1 5
We use dplyr to do the grouping and then we just use cumsum with takes the cumulative sum of the indic column so each time it sees a 1 it increases.

Create equal length vectors from time series based upon factor in R

I have a data frame that is something like this:
time type count
1 -2 a 1
2 -1 a 4
3 0 a 6
4 1 a 2
5 2 a 5
6 0 b 3
7 1 b 7
8 2 b 2
I want to create a new data frame that takes type 'b' and creates the full time series by filling in zeroes for count. It should look like this:
time type count
1 -2 b 0
2 -1 b 0
3 0 b 3
4 1 b 7
5 2 b 2
I can certainly subset(df, df$type = 'b') and then hack the beginning and rbind, but I want it to be more dynamic just in case the time vector changes.
We can use complete from tidyr to get the full 'time' for all the unique values of 'type' and filter the value of interest in 'type'.
library(tidyr)
library(dplyr)
val <- "b"
df1 %>%
complete(time, type, fill=list(count=0)) %>%
filter(type== val)
# time type count
# <int> <chr> <dbl>
#1 -2 b 0
#2 -1 b 0
#3 0 b 3
#4 1 b 7
#5 2 b 2
With base R:
df1 <- data.frame(time=df[df$type == 'a',]$time, type='b', count=0)
df1[match(df[df$type=='b',]$time, df1$time),]$count <- df[df$type=='b',]$count
df1
time type count
1 -2 b 0
2 -1 b 0
3 0 b 3
4 1 b 7
5 2 b 2

Carry Forward First Observation for a Variable For Each Patient

My dataset has 3 variables:
Patient ID Outcome Duration
1 1 3
1 0 4
1 0 5
2 0 2
3 1 1
3 1 2
What I want is the first observation for "Duration" for each patient ID to be carried forward.
That is, for patient #1 I want duration to read 3,3,3 for patient #3 I want duration to read 1, 1.
Here is one way with data.table. You take the first number in Duration and ask R to repeat it for each PatientID.
mydf <- read.table(text = "PatientID Outcome Duration
1 1 3
1 0 4
1 0 5
2 0 2
3 1 1
3 1 2", header = T)
library(data.table)
setDT(mydf)[, Duration := Duration[1L], by = PatientID]
print(mydf)
# PatientID Outcome Duration
#1: 1 1 3
#2: 1 0 3
#3: 1 0 3
#4: 2 0 2
#5: 3 1 1
#6: 3 1 1
This is a good job for dplyr (a data.frame wicked-better successor to plyr with far better syntax than data.table):
library(dplyr)
dat %>%
group_by(`Patient ID`) %>%
mutate(Duration=first(Duration))
## Source: local data frame [6 x 3]
## Groups: Patient ID
##
## Patient ID Outcome Duration
## 1 1 1 3
## 2 1 0 3
## 3 1 0 3
## 4 2 0 2
## 5 3 1 1
## 6 3 1 1
Another alternative using plyr (if you will be doing lots of operations on your dataframe though, and particularly if it's big, I recommend data.table. It has a steeper learning curve but well worth it).
library(plyr)
ddply(mydf, .(PatientID), transform, Duration=Duration[1]) PatientID
# Outcome Duration
# 1 1 1 3
# 2 1 0 3
# 3 1 0 3
# 4 2 0 2
# 5 3 1 1
# 6 3 1 1

Applying a function for calculating AUC for each subject

I want to calculate the area under the curve(AUC) of concentration-TIME profiles for many subjects (~200 subjects). I am using the package MESS where:
AUC = auc(data$TIME,data$CONC, type = "spline")
How can I apply it to each unique ID in the data set? and retain the results in R by adding a new "AUC" column in the original data set?
The data has the following columns:
ID TIME CONC
1 0 0
1 2 4
1 3 7
2 0 0
2 1 NA
2 3 5
2 4 10
One way would be like this. foo is your data.
library(MESS)
library(dplyr)
foo %>%
group_by(ID) %>%
summarize(AUC = auc(TIME,CONC, type = "spline"))
# ID AUC
#1 1 9.12500
#2 2 12.08335
If you want to keep all data, you could do this.
foo %>%
group_by(ID) %>%
mutate(AUC = auc(TIME,CONC, type = "spline"))
# ID TIME CONC AUC
#1 1 0 0 9.12500
#2 1 2 4 9.12500
#3 1 3 7 9.12500
#4 2 0 0 12.08335
#5 2 1 NA 12.08335
#6 2 3 5 12.08335
#7 2 4 10 12.08335
In my opinion, the dplyrsolution provided by #jazzurro is the way to go, but here's a base approach for good measure.
d <- read.table(text='ID TIME CONC
1 0 0
1 2 4
1 3 7
2 0 0
2 1 NA
2 3 5
2 4 10', header=TRUE)
library(MESS)
auc <- t(sapply(split(d, d$ID), function(x) {
data.frame(ID=x$ID[1], auc=auc(x$TIME, x$CONC, type='spline'))
}))
merge(d, auc)
# ID TIME CONC auc
# 1 1 0 0 9.125
# 2 1 2 4 9.125
# 3 1 3 7 9.125
# 4 2 0 0 12.08335
# 5 2 1 NA 12.08335
# 6 2 3 5 12.08335
# 7 2 4 10 12.08335

Subsequent row summing in dataframe object

I would like to do subsequent row summing of a columnvalue and put the result into a new columnvariable without deleting any row by another columnvalue .
Below is some R-code and an example that does the trick and hopefully illustrates my question. I was wondering if there is a more elegant way to do since the for loop will be time consuming in my actual object.
Thanks for any feedback.
As an example dataframe:
MyDf <- data.frame(ID = c(1,1,1,2,2,2), Y = 1:6)
MyDf$FIRST <- c(1,0,0,1,0,0)
MyDf.2 <- MyDf
MyDf.2$Y2 <- c(1,3,6,4,9,15)
The purpose of this is so that I can write code that calculates Y2 in MyDf.2 above for each ID, separately.
This is what I came up with and, it does the trick. (Calculating a TEST column in MyDf that has to be equal to Y2 cin MyDf.2)
MyDf$TEST <- NA
for(i in 1:length(MyDf$Y)){
MyDf[i,]$TEST <- ifelse(MyDf[i,]$FIRST == 1, MyDf[i,]$Y,MyDf[i,]$Y + MyDf[i-1,]$TEST)
}
MyDf
ID Y FIRST TEST
1 1 1 1 1
2 1 2 0 3
3 1 3 0 6
4 2 4 1 4
5 2 5 0 9
6 2 6 0 15
MyDf.2
ID Y FIRST Y2
1 1 1 1 1
2 1 2 0 3
3 1 3 0 6
4 2 4 1 4
5 2 5 0 9
6 2 6 0 15
You need ave and cumsum to get the column you want. transform is just to modify your existing data.frame.
> MyDf <- transform(MyDf, TEST=ave(Y, ID, FUN=cumsum))
ID Y FIRST TEST
1 1 1 1 1
2 1 2 0 3
3 1 3 0 6
4 2 4 1 4
5 2 5 0 9
6 2 6 0 15

Resources