Using a loop to create columns based on two data frames - r

I have a situation where I think a loop would be appropriate to avoid repeating chunks of code.
I have two data frames which look like the following:
patid <- seq(1,10)
date_of_session <- sample(seq(as.Date("2010-01-01"), as.Date("2020-01-01") by = "day), 10)
date_of_referral <- sample(seq(seq(as.Date("2010-01-01"), as.Date("2020-01-01") by = "day), 10)
df1 <- data.frame(patid, date_of_session, date_of_referral)
patid1 <- sample(seq(1,10), 50, replace = TRUE)
eventdate <- sample(seq(as.Date("2010-01-01"), as.Date("2020-01-01") by = "day), 50)
comorbidity <- sample(c("hypertension", "stroke", "AF"), 50, replace = TRUE)
df2 <- data.frame(patid1, eventdate, comorbidity)
I need to repeat the following code for each comorbidity in df2 which basically generates a binary (1/0) column for each comorbidity based on whether the earliest "eventdate" (diagnosis) came before "date of session" OR "date of referral" (if "date of session" is NA) for each patient.
df_comorb <- df2 %>%
filter(comorbidity == "hypertension") %>%
group_by(patid) %>%
filter(eventdate == min(eventdate)) %>%
df1 <- left_join(df1, df2_comorb, by = "patid")
df1 <- df1 %>%
mutate(hypertension_baseline = ifelse(eventdate < date_of_session | eventdate < date_of_referral, 1, 0)) %>%
replace_na(list(hypertension_baseline = 0)) %>%
select(-eventdate)
I'd like to avoid repeating the code for each of the 27 comorbid conditions in the full dataset. I figured a loop would be the best way to repeat this for each comorbidity but I don't know how to approach writing one for this problem.
Any help would be appreciated.

Related

fast way to create a date based lookup R

I have a pretty large data set with users and their membership start and end dates. For each membership period there is one entry.
I have another dataset, which is coming from the support system, and it has records of user id's along with the dates of each system usage. This dataset is even larger, as there is one record for each usage.
I need to aggregate the second and combine with the first one, based on each user and membership period.
I tried a function for a for loop but for an extremeley large dataset (her we are talking about some few millions of rows) this will take ages.
Edit: The join or merge will not work, because here there are several ranges (between start and end dates) for each ID in the first frame. Each range has been assigned a number. (Period of membership) The second data frame has dates and IDs and the problem is finding the membership period for each ID & date by comparing it to the date ranges in the first frame.
Here is the code, along with mock datasets and what I want to achieve at the end:
ids <- c(rep("id1", 5), rep("id2", 5), rep("id3", 5))
#
stdates <- c("2015-08-01", "2016-08-01", "2017-08-01", "2018-08-01", "2019-08-01",
"2013-05-07", "2014-05-07", "2015-05-07", "2016-05-07", "2017-05-07",
"2011-02-13", "2013-02-13", "2015-02-13", "2016-02-13", "2017-02-13")
#
endates <- c("2016-07-31", "2017-07-31", "2018-07-31", "2019-07-31", "2020-07-31",
"2014-05-06", "2015-05-06", "2016-05-06", "2017-05-06", "2018-05-06",
"2013-02-12", "2015-02-12", "2016-02-12", "2017-02-12", "2018-02-12")
#
# First dataset:
df <- data.table(id = ids,
stdate = stdates,
endate = endates)
#
df <- df %>%
arrange(id, desc(endate))
#
# Add the membership period number for each user:
setDT(df)
df[, counter := rowid(id)]
#
# Second dataset:
ids2 <- sample(df$id, 1000, replace = TRUE)
dates2 <- sample(seq(Sys.Date() - 7*365, Sys.Date() - 365, 1), 1000)
#
df2 <- data.table(id = ids2,
dateticket = dates2)
#
# Function
counterFunc <- function(d2, d1) {
d2$groupCounter <- NA
for (i in 1:nrow(d2)) {
crdate <- d2$dateticket[i]
idtemp <- d2$id[i]
dtemp <- d1 %>%
filter(id == idtemp) %>%
data.table()
dtemp[, drcode := ifelse(crdate >= stdate & crdate <= endate, 1, 0)]
if (length(unique(dtemp$drcode)) == 2) {
dtempgc <- dtemp[drcode == 1]$counter
d2$groupCounter[i] <- dtempgc
}
if (length(unique(dtemp$drcode)) != 2) {
d2$groupCounter[i] <- 0
}
print(i)
}
return(d2)
}
#
# The result I want to get without a for loop:
df2gc <- counterFunc(df2, df)
#
The operation you want to do is called "joining", so depending on the direction and completion of that "joining" there are some options.
Here is a simple example:
df1<-data.frame("ID"=c("1","2","3","1","2"),"First_Name"=c("A","B","C","D","E"))
df2<-data.frame("ID"=c("1","2","3"),"Last_Name"=c("Ko","Lo","To"))
left_join(df1,df2,by = "ID")
The result looks like this:
ID First_Name Last_Name
1 A Ko
2 B Lo
3 C To
1 A Ko
2 B Lo
left_joinfrom the dplyrpackage simply looked up the relevant values in the look-up table (df2) and added them to the original table (df1, the left table) based on a "key" (by = "ID" in this case).
There are other operations that specify the terms of the joining more but left_joinshould be helpful in your case.
EDIT:
I have better understood your problem now. Please check if this solves it:
library(tidyverse)
df %>%
mutate(stdate = as.Date(stdate), endate = as.Date(endate)) %>%
left_join(df2, by = "id") %>%
mutate(check = case_when(dateticket >= stdate & dateticket <= endate ~ "TRUE", TRUE ~ "FALSE")) %>%
filter(check == "TRUE")
Edit:
For the problem the error "Cannot allocate vector of size" with join please refer to this:
Left_join error cannot allocate vector of size

How to calculate correlation by group

I am trying to run an iterative for loop to calculate correlations for levels of a factor variable. I have 16 rows of data for each of 32 teams in my data set. I want to correlate year with points for each of the teams individually. I can do this one by one but want to get better at looping.
correlate <- data %>%
select(Team, Year, Points_Game) %>%
filter(Team == "ARI") %>%
select(Year, Points_Game)
cor(correlate)
I made an object "teams" by:
teams <- levels(data$Team)
A little help in using [i] to iterate over all 32 teams to get each teams correlation of year and points would be greatly helpful!
require(dplyr)
# dummy data
data = data.frame(
Team = sapply(1:32, function(x) paste0("T", x)),
Year = rep(c(2000:2009), 32),
Points_Game = rnorm(320, 100, 10)
)
# find correlation of Year and Points_Game for each team
# r - correlation coefficient
correlate <- data %>%
group_by(Team) %>%
summarise(r = cor(Year, Points_Game))
The data.table way:
library(data.table)
# dummy data (same as #Aleksandr's)
dat <- data.table(
Team = sapply(1:32, function(x) paste0("T", x)),
Year = rep(c(2000:2009), 32),
Points_Game = rnorm(320, 100, 10)
)
# find correlation of Year and Points_Game for each Team
result <- dat[ , .(r = cor(Year, Points_Game)), by = Team]

Keep rows within a group untill condition is met in R

I am trying to filter the data within a group until a condition is met (in this case until status is "completed") and drop rest of the rows within a group. I've managed to come up with this ranking solution but I've ran into few issues with it when applying the code to my "real data". The function would sometimes not keep the last row (with max rank). Is there a more elegant solution to this?
Code i've used:
require(dplyr)
time <- seq(as.Date('2017/01/01'), as.Date('2017/01/15'), by="day")
set.seed(42); status <- sample(c("Completed", "On hold", "Active"), 15, replace = T)
ID <- c(rep(1, 5),rep(2, 5),rep(3, 5))
DF <- data.frame(Time = time,
Status = status,
ID = ID)
DF <- DF %>% group_by(ID) %>% mutate(ID_Rank = row_number())
DF$ID_Rank[DF$Status == "Completed"] <- max(DF$ID_Rank)+1
DF2 <- DF %>% group_by(ID) %>% filter(row_number() <= which.max(ID_Rank))

How to restrict full_join() duplicates? - R

I am a novice R programmer. Below is the dataframe I am using.
I am currently running into a filtering problem with the full_join() from tidyverse.
library(tidyverse)
set.seed(1234)
df <- data.frame(
trial = rep(0:1, each = 8),
sex = rep(c('M','F'), 4),
participant = rep(1:4, 4),
x = runif(16, 1, 10),
y = runif(16, 1, 10))
df
I am currently doing the following operation to do the full_join()
df <- df %>% mutate(k = 1)
df <- df %>%
full_join(df, by = "k")
I am restricting the results to obtain the combination of points for the same participant between the trials
df2 <- filter(df, sex.x == sex.y, participant.x == participant.y, trial.x != trial.y)
df3 <- filter(df2, participant.x == 1)
df3
Here, at this step, I am running into trouble. I do not care about the order of the points. How do I condense the duplicates into one row?
Thank you
Depending on the columns you are considering, use the duplicate function. The first one will weed out duplicates based on the first 5 columns. The last one will weed out duplicates based on
df3[!duplicated(df3[,1:5]),]
df3[!duplicated(df3[,7:11]),]

Why do i got different results using SE or NSE dplyr functions

Hi I got differents results from dplyr function when I use standard evaluation through lazyeval package.
Here is how to reproduce something close to my real datas with 250k rows and about 230k groups. I would like to group by id1, id2 and subset the rows with the max(datetime) for each group.
library(dplyr)
# random datetime generation function by Dirk Eddelbuettel
# http://stackoverflow.com/questions/14720983/efficiently-generate-a-random-sample-of-times-and-dates-between-two-dates
rand.datetime <- function(N, st = "2012/01/01", et = "2015/08/13") {
st <- as.POSIXct(as.Date(st))
et <- as.POSIXct(as.Date(et))
dt <- as.numeric(difftime(et,st,unit="sec"))
ev <- sort(runif(N, 0, dt))
rt <- st + ev
}
set.seed(42)
# Creating 230000 ids couples
ids <- data_frame(id1 = stringi::stri_rand_strings(23e4, 9, pattern = "[0-9]"),
id2 = stringi::stri_rand_strings(23e4, 9, pattern = "[0-9]"))
# Repeating randomly the ids[1:2000, ] to create groups
ids <- rbind(ids, ids[sample(1:2000, 20000, replace = TRUE), ])
datas <- mutate(ids, datetime = rand.datetime(25e4))
When I use the NSE way I got 230000 rows
df1 <-
datas %>%
group_by(id1, id2) %>%
filter(datetime == max(datetime))
nrow(df1) #230000
But when I use the SE, I got only 229977 rows
ids <- c("id1", "id2")
filterVar <- "datetime"
filterFun <- "max"
df2 <-
datas %>%
group_by_(ids) %>%
filter_(.dots = lazyeval::interp(~var == fun(var),
var = as.name(filterVar),
fun = as.name(filterFun)))
nrow(df2) #229977
My two pieces of code are equivalent right ?
Why do I experience different results ? Thanks.
You'll need to specify the .dots argument in group_by_ when giving a vector of column names.
df2 <- datas %>%
group_by_(.dots = ids) %>%
filter_(.dots = lazyeval::interp(~var == fun(var),
var = as.name(filterVar),
fun = as.name(filterFun)))
nrow(df2)
[1] 230000
It looks like group_by_ might take the first column name from the vector as the only grouping variable when you don't specify the .dots argument. You can check this by grouping on id1 only.
df1 <- datas %>%
group_by(id1) %>%
filter(datetime == max(datetime))
nrow(df1)
[1] 229977
(If you group just on id2 the number of rows is 229976).

Resources