Creating Event Onset Variable - r

I have clinical data that records a patient at three time points with a disease outcome indicated by a binary variable. It looks something like this
patientid <- c(100,100,100,101,101,101,102,102,102)
time <- c(1,2,3,1,2,3,1,2,3)
outcome <- c(0,1,1,0,0,1,1,1,0)
Data<- data.frame(patientid=patientid,time=time,outcome=outcome)
Data
I want to create an onset variable, so for each patient it would code a 1 for the time which the patient first got the disease, but would then be a 0 for any time period before or a time period after (even if that patient still had the disease). For the example data it should now look like this.
patientid <- c(100,100,100,101,101,101,102,102,102)
time <- c(1,2,3,1,2,3,1,2,3)
outcome <- c(0,1,1,0,0,1,1,1,0)
outcome_onset <- c(0,1,0,0,0,1,1,0,0)
Data<- data.frame(patientid=patientid,time=time,outcome=outcome,
outcome_onset=outcome_onset)
Data
Therefore I would like some code/ some help automating the creation of the outcome_onset variable.

Here is an option with cumsum to create a logical vector after grouping by the 'patientid'
library(dplyr)
Data %>%
group_by(patientid) %>%
mutate(outcome_onset = +(cumsum(outcome) == 1))
Or use match and %in%
Data %>%
group_by(patientid) %>%
mutate(outcome_onset = +(row_number() %in% match(1, outcome_onset)))

We can use which.max to get the index of 1st one in outcome variable and make that row as 1 and rest of them as 0.
library(dplyr)
Data %>%
group_by(patientid) %>%
mutate(outcome_onset = as.integer(row_number() %in% which.max(outcome)),
outcome_onset = replace(outcome_onset, is.na(outcome), NA))
# patientid time outcome outcome_onset
# <dbl> <dbl> <dbl> <int>
#1 100 1 0 0
#2 100 2 1 1
#3 100 3 1 0
#4 101 1 0 0
#5 101 2 0 0
#6 101 3 1 1
#7 102 1 1 1
#8 102 2 1 0
#9 102 3 0 0

Related

How to sum a set of columns grouped by one column

I have a dataframe like so
ID <- c('John', 'Bill', 'Alice','Paulina')
Type1 <- c(1,1,0,1)
Type2 <- c(0,1,1,0)
cluster <- c(1,2,3,1)
test <- data.frame(ID, Type1, Type2, cluster)
I want to group by cluster and sum the values in all the other columns apart from ID that should be dropped.
I achieved it through
test.sum <- test %>%
group_by(cluster)%>%
summarise(sum(Type1), sum(Type2))
However, I have thousands of types and I can't write out each column in summarise manually. Can you help me?
This is whereacross() and contains comes in incredibly useful to select the columns you want to summarise across:
test %>%
group_by(cluster) %>%
summarise(across(contains("Type"), sum))
cluster Type1 Type2
<dbl> <dbl> <dbl>
1 1 2 0
2 2 1 1
3 3 0 1
Alternatively, pivoting the dataset into long and then back into wide means you can easily analyse all groups and clusters at once:
library(dplyr)
library(tidyr)
test %>%
pivot_longer(-c(ID, cluster)) %>%
group_by(cluster, name) %>%
summarise(sum_value = sum(value)) %>%
pivot_wider(names_from = "name", values_from = "sum_value")
cluster Type1 Type2
<dbl> <dbl> <dbl>
1 1 2 0
2 2 1 1
3 3 0 1
Base R
You can exploit split which is equivalent to group_by(). This should give you what you are looking for, regardless of how many Types you have.
my_split <- split(subset(test, select = grep('^Ty', names(test))), test[, -1]$cluster)
my_sums <- sapply(my_split, \(x) colSums(x))
my_sums <- data.frame( cluster = as.numeric(gsub("\\D", '', colnames(my_sums))),
t(my_sums) )
Output
> my_sums
cluster Type1 Type2
1 1 2 0
2 2 1 1
3 3 0 1
Note: use function(x) instead of \(x) if you use a version of R <4.1.0

How to visualize the data if one participant has multiple entries in different rows?

I am currently working on a dataset which consists of multiple participants. Some participants have participated all followups, whereas others have skipped some followups.
For example, in the dataset below, participant 2 only participated the 3rd followup, and participant 3 only participated the 2nd and the 3rd followup. You can also see that some participants have more than 1 rows of entry because they have several followups.
The original dataset only has the 1st and the 2nd column. Since I am aiming to create a progress chart like this
I have tried to create extra columns for each visit by using the code below:
participant <- c(1,1,1,2,3,3,4,5,5,5 )
visit <- c(1,2,3,3,2,3,1,1,2,3)
df <- data.frame(participant, visit)
df[,3] <- as.integer(df$visit=="1")
df[,4] <- as.integer(df$visit=="2")
df[,5] <- as.integer(df$visit=="3")
colnames(df)[colnames(df) %in% c("V3","V4","V5")] <- c(
"Visit1","Visit2","Visit3")
However, I still experience a hard time combining rows of the same participant, and hence I could not proceed to making the chart (which I also have no clue about). I have tried the 'reshape' function but it did not work out. group_by function also did not work out and still showed the original dataset
df1 <- df[,-2]
df1 %>%
group_by(participant)
What function should I use this case for:
combining rows of the same participant?
how to produce the progress chart?
Thank you in advance for your help!
Based on your df you could produce the chart with
library(ggplot2)
library(dplyr)
df %>%
ggplot(aes(x = as.factor(visit),
y = as.factor(participant),
fill = as.factor(visit))) +
geom_tile(aes(width = 0.7, height = 0.7), color = "black") +
scale_fill_grey() +
xlab("Visit") +
ylab("Participants") +
guides(fill = "none")
If you need your data.frame in a wide format (similar to the image shown but with only one row per participant), use
library(tidyr)
library(dplyr)
df %>%
mutate(value = 1) %>%
pivot_wider(
names_from = visit,
values_from = value,
names_glue = "Visit{visit}",
values_fill = 0)
to get
# A tibble: 5 x 4
participant Visit1 Visit2 Visit3
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 1
2 2 0 0 1
3 3 0 1 1
4 4 1 0 0
5 5 1 1 1
I think you are looking for a way to dummify a variable.
There are several ways to do that.
I like the fastDummies package. You can use dummy_cols, with remove_selected_columns=TRUE.
df %>% fastDummies::dummy_cols(select_columns = 'visit',
remove_selected_columns = TRUE)
participant visit_1 visit_2 visit_3
1 1 1 0 0
2 1 0 1 0
3 1 0 0 1
4 2 0 0 1
5 3 0 1 0
6 3 0 0 1
7 4 1 0 0
8 5 1 0 0
9 5 0 1 0
10 5 0 0 1
You may want to pipe in some summariseoperation to make the table even cleaner, as in:
df %>% fastDummies::dummy_cols(select_columns = 'visit', remove_selected_columns = TRUE)%>%
group_by(participant)%>%
summarise(across(starts_with('visit'), max))
# A tibble: 5 x 4
participant visit_1 visit_2 visit_3
<dbl> <int> <int> <int>
1 1 1 1 1
2 2 0 0 1
3 3 0 1 1
4 4 1 0 0
5 5 1 1 1
In a certain way, this looks a bit like a pivoting operation too.
You may be interested in using dplyr::pivot_wider here too
EDIT: #MartinGal had just given a similar answer, I removed a very similar version of his pivot_wider

Subset data based on conditional statement

I would like to know if there is a way of combining ifelse statement and the filter function (in dplyr package) to subset a data frame. Consider the data
df<-data.frame(id=c(1,1,1,2,2,2,2,3,3),
A=c(3,6,2,5,4,3,8,9,8),
D1=c(0,0,0,1,1,0,0,0,0),
D2=c(1,0,0,0,0,1,1,0,1))
I want to delete rows following D2=1 or D1=D2=0 for each id. The expected output would look like
df<-data.frame(id=c(1,2,2,2,3),
A=c(3,5,4,3,9),
D1=c(0,1,1,0,0),
D2=c(1,0,0,1,0))
I have approached this by several attempts using group_by and the filter function but it appears conditional statements are needed but I'm finding it difficulty to combine these with the filter function. I have come across several Q&A on subsetting data (e.g. How to subset data by filtering and grouping efficiently in R) but these do not respond to my question. I greatly appreciate any help on this.
In dplyr , you can find out the first index where the condition is met and select rows which occur before the condition is satisfied for each group.
library(dplyr)
df %>%
group_by(id) %>%
filter(row_number() <= which(D1 == 0 & D2 == 0 | D2 == 1)[1])
# id A D1 D2
# <dbl> <dbl> <dbl> <dbl>
#1 1 3 0 1
#2 2 5 1 0
#3 2 4 1 0
#4 2 3 0 1
#5 3 9 0 0
The above works assuming that at least one row in each group satisfies the condition. A general case, where there might be instances that none of the row satisfies the condition and we want to select all the rows in the group we can use :
df %>%
group_by(id) %>%
slice({
inds <- which(D1 == 0 & D2 == 0 | D2 == 1)[1]
if(!is.na(inds)) -((inds + 1):n()) else seq_len(n())})
It doesn't seem like you need to use dplyr here (unless I'm missing something). Try this:
df<-data.frame(id=c(1,1,1,2,2,2,2,3,3),
A=c(3,6,2,5,4,3,8,9,8),
D1=c(0,0,0,1,1,0,0,0,0),
D2=c(1,0,0,0,0,1,1,0,1))
del = c()
for (i in 1:nrow(df)){
if (df$D2[i] == 1 | (df$D1[i] ==0 & df$D2[i] == 0)){
del = c(del, i)
}
}
df = df[del,]
Pure dplyr:
df %>%
group_by(id) %>%
filter(row_number() == n() | rev(cumany(rev(!(D2 == 1 | (D1 == D2 & D2 == 0))))))
# # A tibble: 5 x 4
# # Groups: id [3]
# id A D1 D2
# <dbl> <dbl> <dbl> <dbl>
# 1 1 2 0 0
# 2 2 5 1 0
# 3 2 4 1 0
# 4 2 8 0 1
# 5 3 8 0 1

How to generate a dummy treatment variable based on values from two different variables

I would like to generate a dummy treatment variable "treatment" based on country variable "iso" and earthquakes dummy variable "quake" (for dataset "data").
I would basically like to get a dummy variable "treatment" where, if quake==1 for at least one time in my entire timeframe (let's say 2000-2018), I would like all values for that "iso" have "treatment"==1, for all other countries "iso"==0. So countries that are affected by earthquakes have all observations 1, others 0.
I have tried using dplyr but since I'm still very green at R, it has taken me multiple tries and I haven't found a solution yet. I've looked on this website and google.
I suspect the solution should be something along the lines of but I can't finish it myself:
data %>%
filter(quake==1) %>%
group_by(iso) %>%
mutate(treatment)
Welcome to StackOverflow ! You should really consider Sotos's links for your next questions on SO :)
Here is a dplyr solution (following what you started) :
## data
set.seed(123)
data <- data.frame(year = rep(2000:2002, each = 26),
iso = rep(LETTERS, times = 3),
quake = sample(0:1, 26*3, replace = T))
## solution (dplyr option)
library(dplyr)
data2 <- data %>% arrange(iso) %>%
group_by(iso) %>%
mutate(treatment = if_else(sum(quake) == 0, 0, 1))
data2
# A tibble: 78 x 4
# Groups: iso [26]
year iso quake treatment
<int> <fct> <int> <dbl>
1 2000 A 0 1
2 2001 A 1 1
3 2002 A 1 1
4 2000 B 1 1
5 2001 B 1 1
6 2002 B 0 1
7 2000 C 0 1
8 2001 C 0 1
9 2002 C 1 1
10 2000 D 1 1
# ... with 68 more rows

Create group variable based on common dates

I have a large data set containing animal ID's and dates. There are two groups within this dataset but there is no grouping variable, so I have to extrapolate who belongs to which group based on the dates they appear to have in common.
Dummy data.
mydf<-data.frame(
Date=sort(rep(seq(as.Date("2012/1/1"),as.Date("2012/1/4"), length.out = 4),5)),
ID = c(1,2,3,4,5,5,6,7,8,9,1,2,3,4,5,6,7,8,9,10))
The other issue I have is that every now and then an ID belonging to group 1 might appear with a date associated with group 2, which is what has thrown off every attempt I've made so far at grouping.
What I need is a output with ID's and a new Group ID like this
ID Group
1 1
2 1
3 1
4 1
5 1
6 2
7 2
8 2
9 2
10 2
1:5 all appear together on the 1st and the 3rd so they are likely to be one group.
6:10 appear on the 2nd and 4th and are likely to be the 2nd group.
ID 5 belongs to group 1, because even though it was observed once on the 2nd with ID's 6:9, it was observed twice on the 1st and 2nd 1:4, so it's most likely to belong to group 1.
All my attempts have fallen flat. Can anyone offer a solution to this?
Thanks in advance.
EDIT:
I thought we had nailed a solution using Jon's kmeans solution (in the comments below):
mydf_wide <- mydf %>%
select(ID, date) %>%
distinct(ID,date)%>% #
mutate(x = 1) %>%
spread(date, x, fill = 0)
mydf_wide$clusters <- mydf_wide %>%
kmeans(centers = 2) %>%
pluck("cluster")
but I'm actually finding the kmeans method not quite getting it right every time. See below:
The groups where certain tags (ID) appear on the same day as each other are fairly easy to spot by eye. There are two groups, one is in the center, and the other group appears on either side. The clustering should be vertical by common dates as in Jon's answer below, but it is clustering across the entire date range. (Apologies for the messy axis labels)
The k-means method has worked on other groups, but it's not consistently able to group by common dates. I think the clustering approach is sensible, but I was wondering if there may be other clustering methods that may cope better than kmeans?
Alternatively, could a filtering method help reduce any background noise and help the kmeans approach more reliable?
Again, very grateful for any and all advice.
Cheers.
My thinking here is that you just assign each Date to a group, then take the average of group for each ID. You could then round to the nearest whole number from there. In this case, average group of ID == 5 would be 1.33
library(dplyr)
mydf %>%
mutate(group = case_when(
Date %in% as.Date(c("2012-01-01", "2012-01-03")) ~ 1,
Date %in% as.Date(c("2012-01-02", "2012-01-04")) ~ 2,
TRUE ~ NA_real_
)) %>%
group_by(ID) %>%
summarise(likely_group = mean(group) %>% round)
Which gives you the following:
# A tibble: 10 x 2
ID likely_group
<dbl> <dbl>
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 2
7 7 2
8 8 2
9 9 2
10 10 2
This works as long as there isn't an even split between groups for a single ID. But there isn't currently a way to address this situation with the information provided.
As a general solution, you might consider using k-means as an automatic way to split the data into groups based on similarity to other IDs.
First, I converted the data into wide format so that each ID gets one row. Then fed that into the base kmeans function to get the clustering output as a list, and purrr::pluck to extract just the assignment part of that list.
library(tidyverse)
mydf_wide <- mydf %>%
mutate(x = 1) %>%
spread(Date, x, fill = 0)
mydf_wide
# ID 2012-01-01 2012-01-02 2012-01-03 2012-01-04
#1 1 1 0 1 0
#2 2 1 0 1 0
#3 3 1 0 1 0
#4 4 1 0 1 0
#5 5 1 1 1 0
#6 6 0 1 0 1
#7 7 0 1 0 1
#8 8 0 1 0 1
#9 9 0 1 0 1
#10 10 0 0 0 1
clusters <- mydf_wide %>%
kmeans(centers = 2) %>%
pluck("cluster")
clusters
# [1] 2 2 2 2 2 1 1 1 1 1
Here's what that looks if you add those to the original data and plot.
mydf_wide %>%
mutate(cluster = clusters) %>%
# ggplot works better with long (tidy) data...
gather(date, val, -ID, -cluster) %>%
filter(val != 0) %>%
arrange(cluster) %>%
ggplot(aes(date, ID, color = as.factor(cluster))) +
geom_point(size = 5) +
scale_y_continuous(breaks = 1:10, minor_breaks = NULL) +
scale_color_discrete(name = "cluster")

Resources