Change from baseline for repeated ids with missing baseline points - r

Change from baseline for repeated ids with missing baseline points
A similar question has been asked and answered below:
Change from baseline for repeated ids
My question differs from the original question in that I have missing baseline values. I am including a small reproducible example below:
df1 <- data.frame( probeID = c( rep("A", 19), rep("B",19), rep("C",19)),
Subject_ID = c( rep( c( rep(1,5), rep(2,4), rep(3,5), rep(4,5)),3)),
time = c(rep( c( c(1:5), c(2:5), rep( 1:5,2)),3)))
df1$measure <- df1$Subject_ID*c( 1:nrow(df1))
df2 <- subset( df1, Subject_ID != 2)
df2 %>%
group_by(probeID, Subject_ID) %>%
mutate(change = measure - measure[time==1])
However, when I replace df2 with df1 in the pipe above, it fails because data is missing for the time = 1 data point for Subject_ID=2. My desired output in the df1 case should be be identical to the output from df2. I would appreciate any help.
Thanks
JJ

Was having some trouble trying to figure out what your question was asking for, does this work?
df1 %>%
group_by(probeID, Subject_ID) %>%
mutate(change = measure - first(measure))

Related

Using dplyr to remove duplicates conditionally

I have a dataset in longformat that contains both visit and measure dates for each ID. What I want is to remove the duplicate visit dates for each ID conditionally, namely:
IF visit date - measure date does not equal 0, then I want the to include the first visit date.
IF visit date - measure date is a draw, however, then I want to include the lastest visit date.
I already wrote part of the code using dplyr. However, I cannot seem to figure out how to code the second part of the condition.
Any help would be very much appreciated.
library(dplyr)
df <- data.frame(ID = c(1, 1),
VISIT = c(as.Date("2020-01-01"), as.Date("2020-01-01")),
MEASURE = c(as.Date("2020-01-01"), as.Date("2020-01-01")),
VALUE = c(5, 10))
df2 <- df %>%
mutate(DIFF = abs(VISIT - MEASURE)) %>%
arrange(DIFF) %>%
group_by(ID) %>%
group_by(VISIT) %>%
# If DIFF dates is != 0, I want the first value
# If DIFF dates is a draw, I want the latest value
slice(1) %>%
ungroup()
I am not sure what exactly you try to achieve, but maybe this could help you. I adjusted the example dataframe a bit, maybe you will need to edit yours in your question such that it makes sense. In your example data DIFF dates is never unequal to 0.
library(dplyr)
df <- data.frame(
ID = c(1, 1, 2, 2),
VISIT = c(
as.Date("2020-01-01"),
as.Date("2020-01-01"),
as.Date("2020-01-01"),
as.Date("2020-01-02")
),
MEASURE = c(
as.Date("2020-01-01"),
as.Date("2020-01-01"),
as.Date("2020-01-01"),
as.Date("2020-01-03")
),
VALUE = c(5, 10, 15, 20)
)
df2 <- df %>%
group_by(ID) %>%
mutate(
DIFF = abs(VISIT - MEASURE),
# get days as a digit
DIFF = stringr::str_extract(DIFF, "\\d+") %>% as.numeric(),
# your if conditions
DIFF_filter = case_when(
DIFF != 0 ~ min(VISIT),
DIFF == 0 ~ max(VISIT)
)
)

Using a loop to create columns based on two data frames

I have a situation where I think a loop would be appropriate to avoid repeating chunks of code.
I have two data frames which look like the following:
patid <- seq(1,10)
date_of_session <- sample(seq(as.Date("2010-01-01"), as.Date("2020-01-01") by = "day), 10)
date_of_referral <- sample(seq(seq(as.Date("2010-01-01"), as.Date("2020-01-01") by = "day), 10)
df1 <- data.frame(patid, date_of_session, date_of_referral)
patid1 <- sample(seq(1,10), 50, replace = TRUE)
eventdate <- sample(seq(as.Date("2010-01-01"), as.Date("2020-01-01") by = "day), 50)
comorbidity <- sample(c("hypertension", "stroke", "AF"), 50, replace = TRUE)
df2 <- data.frame(patid1, eventdate, comorbidity)
I need to repeat the following code for each comorbidity in df2 which basically generates a binary (1/0) column for each comorbidity based on whether the earliest "eventdate" (diagnosis) came before "date of session" OR "date of referral" (if "date of session" is NA) for each patient.
df_comorb <- df2 %>%
filter(comorbidity == "hypertension") %>%
group_by(patid) %>%
filter(eventdate == min(eventdate)) %>%
df1 <- left_join(df1, df2_comorb, by = "patid")
df1 <- df1 %>%
mutate(hypertension_baseline = ifelse(eventdate < date_of_session | eventdate < date_of_referral, 1, 0)) %>%
replace_na(list(hypertension_baseline = 0)) %>%
select(-eventdate)
I'd like to avoid repeating the code for each of the 27 comorbid conditions in the full dataset. I figured a loop would be the best way to repeat this for each comorbidity but I don't know how to approach writing one for this problem.
Any help would be appreciated.

Merge data based upon id and date in R

I want to merge several fields from 2 dataframes into a new dataframe. The merged data is based upon ID and Date and, the Date must equal or fall between start and end dates in the second dataframe.
The following answer to a similar question almost works for me however if the Date in the first dataframe equals the start date in the second dataframe, I get NA instead of the matching colour. Any help on ways to include the colour when the Date falls on the start date would be very much appreciated.
library(tidyverse)
library(lubridate)
df1 <- data.frame(ID=c(1, 2, 2, 3),
actual.date=mdy('3/31/2017', '2/11/2016','4/10/2016','5/15/2015'))
df2 <- data.frame(ID = c(1, 1, 1, 2, 3),
start = mdy('1/1/2000', '4/1/2011', '3/31/2017', '2/11/2016', '1/12/2012'),
end = mdy('3/31/2011', '6/4/2012', '04/04/2017', '3/31/2017', '2/12/2014'),
colour = c("blue", "purple", "blue", "red", "purple"))
df <- full_join(df1, df2, by = "ID") %>%
mutate(test = ifelse(actual.date <= end & actual.date > start,
TRUE,
FALSE)) %>%
filter(test) %>%
left_join(df1, ., by = c("ID", "actual.date")) %>%
select(ID, actual.date, colour)
If you could show us a dataframe of the output you're looking for that would be useful, but I think this may achieve what you're trying to do. I don't think you want to be joining twice in the code above. When you do the filter() you drop the observations that are showing NAs and when you join again you've dropped those observations so they show up as NAs because they are only in one of the dataframes.
full_join(df1, df2, by = "ID") %>%
filter(actual.date <= end & actual.date >= start) %>%
select(ID, actual.date, colour)

Time series function in dplyr

I am working with data that stops in a specific year and is NA afterwards. And I need to calculate allot of variables based on lagged values of other variables. I would like to find a way that a whole series is calculated instead of each time one year when one of the variables is NA. I was looking at dplyr given that I am working with panel data and thus need to group it by ID.
I provide the example below:
set.seed(1)
df <- data.frame( year = c(seq(2000, 2018), seq(2000, 2018)) , id = c(rep(1, 19),rep(2, 19)), varA = floor(rnorm(38)*100), varB= floor(rnorm(38)*100), varC= floor(rnorm(38)*100))
df <- df %>% mutate(varA = if_else(year>2010, as.double(NA) , varA) ,
varB = if_else(year>2010, as.double(NA) , varB),
varC = if_else(year>2010, as.double(NA) , varC)) %>% group_by(id) %>% arrange(year)
What I would like is to find a way to calculate a variable that is equal to variable C when it is available, but afterwards is equal to a formula based on lagged values of variable C, B and A. When executing the code below, varResult and D are ony calculated for one year given that the lags are only available for one year:
df <- df %>% mutate( varD = lag(varA)*lag(varB),
varRESULT = if_else(is.na(varC), lag(varC, 1)/lag(varD, 2)*lag(varD, 1), varC))
But I would like to find a way to calculate immidiatly the whole serries (taking into account the panel dimension of the data) instead of heaving to repeat the code 7 times. Preferably a solution where you can calculate varD seperatly from varResults, given that in the final application I have multiple variables that are linked to each other.
Proposed solution:
Starting with the first NA, the "recursive" lags of vars varA, varB, and varC are equal to the last value of these variables.
Thus, starting from these initial variables, we can create new variables: varA1, varB1, and varC1 where we fill the NAs with the last value, by id:
library(dplyr)
library(tidyr) # for the function `fill`
df <- df %>%
mutate(varA1 = varA, varB1 = varB, varC1 = varC) %>%
group_by(id) %>%
arrange(year) %>%
fill(varA1, varB1, varC1) # fills with last value
Then, we apply the formula:
df <- df %>%
mutate( varD = lag(varA1)*lag(varB1),
varRESULT = if_else(is.na(varC), lag(varC1, 1)/lag(varD, 2)*lag(varD, 1), varC)) %>%
select(-varA1, -varB1, -varC1)

How to restrict full_join() duplicates? - R

I am a novice R programmer. Below is the dataframe I am using.
I am currently running into a filtering problem with the full_join() from tidyverse.
library(tidyverse)
set.seed(1234)
df <- data.frame(
trial = rep(0:1, each = 8),
sex = rep(c('M','F'), 4),
participant = rep(1:4, 4),
x = runif(16, 1, 10),
y = runif(16, 1, 10))
df
I am currently doing the following operation to do the full_join()
df <- df %>% mutate(k = 1)
df <- df %>%
full_join(df, by = "k")
I am restricting the results to obtain the combination of points for the same participant between the trials
df2 <- filter(df, sex.x == sex.y, participant.x == participant.y, trial.x != trial.y)
df3 <- filter(df2, participant.x == 1)
df3
Here, at this step, I am running into trouble. I do not care about the order of the points. How do I condense the duplicates into one row?
Thank you
Depending on the columns you are considering, use the duplicate function. The first one will weed out duplicates based on the first 5 columns. The last one will weed out duplicates based on
df3[!duplicated(df3[,1:5]),]
df3[!duplicated(df3[,7:11]),]

Resources