"new drug user" design R - r

I want to establish a cohort of new users of drugs (Ray 2003). My original dataset is huge approx 19 million rows, so a loop is proving inefficient. Here is a dummy dataset (done with fruits instead of drugs):
df2
names dates age sex fruit
1 tom 2010-02-01 60 m apple
2 mary 2010-05-01 55 f orange
3 tom 2010-03-01 60 m banana
4 john 2010-07-01 57 m kiwi
5 mary 2010-07-01 55 f apple
6 tom 2010-06-01 60 m apple
7 john 2010-09-01 57 m apple
8 mary 2010-07-01 55 f orange
9 john 2010-11-01 57 m banana
10 mary 2010-09-01 55 f apple
11 tom 2010-08-01 60 m kiwi
12 mary 2010-11-01 55 f apple
13 john 2010-12-01 57 m orange
14 john 2011-01-01 57 m apple
I have identified people who were prescribed an apple between 04-2010 and 10-2010:
temp2
names dates age sex fruit
6 tom 2010-06-01 60 m apple
5 mary 2010-07-01 55 f apple
7 john 2010-09-01 57 m apple
I would like to make a new column in the original DF called "index" which is the first date that a person was prescribed a drug in the the defined date range. This is what I have tried to get the dates from temp into df$index:
df2$index<-temp2$dates
df2$index<-df2$dates == temp2$dates
df2$index<-df2$dates %in% temp2$dates
df2$index<-ifelse(as.Date(df$dates)==as.Date(temp2$dates), as.Date(temp2$dates),NA)
I'm not doing this right - as none of these work. This is the desired output.
df2
names dates age sex fruit index
1 tom 2010-02-01 60 m apple <NA>
2 mary 2010-05-01 55 f orange <NA>
3 tom 2010-03-01 60 m banana <NA>
4 john 2010-07-01 57 m kiwi <NA>
5 mary 2010-07-01 55 f apple 2010-07-01
6 tom 2010-06-01 60 m apple 2010-06-01
7 john 2010-09-01 57 m apple 2010-09-01
8 mary 2010-07-01 55 f orange <NA>
9 john 2010-11-01 57 m banana <NA>
10 mary 2010-09-01 55 f apple <NA>
11 tom 2010-08-01 60 m kiwi <NA>
12 mary 2010-11-01 55 f apple <NA>
13 john 2010-12-01 57 m orange <NA>
14 john 2011-01-01 57 m apple <NA>
Once I have the desired output, I want to trace back from the index date to see if any person had an apple in the previous 180 days. if they did not have an apple - I want to keep them. If they did have an apple (e.g., tom) I want to discard him. This is the code i have tried on the desired output:
df4<-df2[df2$fruit!='apple' & df2$index-180,]
df4<-df2[df2$fruit!='apple' & df2$dates<=df2$index-180,] ##neither work for me
I would appreciate any guidance at all on these questions - even a direction to what I should read to help me learn how to do this. Perhaps my logic is flawed and my method won't work - please tell me if thats the case! Thank you in advance.
Here is my df:
names<-c("tom", "mary", "tom", "john", "mary",
"tom", "john", "mary", "john", "mary", "tom", "mary", "john", "john")
dates<-as.Date(c("2010-02-01", "2010-05-01", "2010-03-01",
"2010-07-01", "2010-07-01", "2010-06-01", "2010-09-01",
"2010-07-01", "2010-11-01", "2010-09-01", "2010-08-01",
"2010-11-01", "2010-12-01", "2011-01-01"))
fruit<-as.character(c("apple", "orange", "banana", "kiwi",
"apple", "apple", "apple", "orange", "banana", "apple",
"kiwi", "apple", "orange", "apple"))
age<-as.numeric(c(60,55,60,57,55,60,57,55,57,55,60,55, 57,57))
sex<-as.character(c("m","f","m","m","f","m","m",
"f","m","f","m","f","m", "m"))
df2<-data.frame(names,dates, age, sex, fruit)
df2
Here is temp2:
data1<-df2[df2$fruit=="apple"& (df2$dates >= "2010-04-01" & df2$dates< "2010-10-01"), ]
index <- with(data1, order(dates))
temp<-data1[index, ]
dup<-duplicated(temp$names)
temp1<-cbind(temp,dup)
temp2<-temp1[temp1$dup!=TRUE,]
temp2$dup<-NULL
SOLUTION
df2 <- df2[with(df2, order(names, dates)), ]
df2$first.date <- ave(df2$date, df2$name, df2$fruit,
FUN=function(dt) dt[dt <="2010-10-31" & dt>="2010-04-01"][1]) ##DWin code for assigning index date for each fruit in the pre-period
df2$x<-df2$fruit=='apple' & df2$dates>df2$first.date-180 & df2$dates<df2$first.date ##assigns TRUE to row that tom is not a new user
ids <- with(df2, unique(names[x == "TRUE"])) ##finding the id which has one value of true
new_users<-subset(df2, !names %in% ids) ##gets rid of id that has at least one value of true

First order by name and date:
df <- df[with(df, order(names, dates)), ]
Then just pick the first date within each name:
df$first.date <- ave(df$date, df$name, FUN="[", 1)
Now that you have will see "the power of the fully operational Death Star \w\w", er, the ave-function. You are ready to pick out the first date within individual 'names' and 'fruits' within that date-range:
> df$first.date <- ave(df$date, df$name, df$fruit,
FUN=function(dt) dt[dt <="2010-10-31" & dt>="2010-04-01"][1] )
> df
names dates age sex fruit first.date
4 john 2010-07-01 57 m kiwi 2010-07-01
7 john 2010-09-01 57 m apple 2010-09-01
9 john 2010-11-01 57 m banana <NA>
13 john 2010-12-01 57 m orange <NA>
14 john 2011-01-01 57 m apple 2010-09-01
2 mary 2010-05-01 55 f orange 2010-05-01
5 mary 2010-07-01 55 f apple 2010-07-01
8 mary 2010-07-01 55 f orange 2010-05-01
10 mary 2010-09-01 55 f apple 2010-07-01
12 mary 2010-11-01 55 f apple 2010-07-01
1 tom 2010-02-01 60 m apple 2010-06-01
3 tom 2010-03-01 60 m banana <NA>
6 tom 2010-06-01 60 m apple 2010-06-01
11 tom 2010-08-01 60 m kiwi 2010-08-01

Since you have 19 million rows , I think you should try a data.table solution. Here my attempt. The result is slightly different from #Dwin result since I filter my data between (begin,end) and then I create a new index variable which is the min dates occurring in this chosen range for each (names,fruits)
library(data.table)
DT <- data.table(df2,key=c('names','dates'))
DT[,dates := as.Date(dates)]
DT[between(dates,as.Date("2010-04-01"),as.Date("2010-10-31")),
index := as.character(min(dates))
, by=c('names','fruit')]
## names dates age sex fruit index
## 1: john 2010-07-01 57 m kiwi 2010-07-01
## 2: john 2010-09-01 57 m apple 2010-09-01
## 3: john 2010-11-01 57 m banana NA
## 4: john 2010-12-01 57 m orange NA
## 5: john 2011-01-01 57 m apple NA
## 6: mary 2010-05-01 55 f orange 2010-05-01
## 7: mary 2010-07-01 55 f apple 2010-07-01
## 8: mary 2010-07-01 55 f orange 2010-05-01
## 9: mary 2010-09-01 55 f apple 2010-07-01
## 10: mary 2010-11-01 55 f apple NA
## 11: tom 2010-02-01 60 m apple NA
## 12: tom 2010-03-01 60 m banana NA
## 13: tom 2010-06-01 60 m apple 2010-06-01
## 14: tom 2010-08-01 60 m kiwi 2010-08-01

Related

Join dataframes in dplyr by characters

So I have two dataframes:
DF1
X Y ID
banana 14 1
orange 20 2
pineapple 1 3
guava 300 4
grapes 1 5
DF2
Store State ID
Walmart NY 1
Sears AL 1;2
Target DC 3
Old Navy PA 3
Popeye's HA 5
Footlocker NJ 4;5
I join with the following and get:
df1 %>%
inner_join(df2, by = "ID")
X Y ID Store State
banana 14 1 Walmart NY
pineapple 1 3 Target DC
pineapple 1 3 Old Navy PA
grapes 1 5 Popeye's HA
But due to the semi-colons I'm not capturing those data points on the join, the end result should look like this:
X Y ID Store State
banana 14 1 Walmart NY
banana 14 1 Sears AL
orange 20 2 Sears AL
pineapple 1 3 Target DC
pineapple 1 3 Old Navy PA
guava 300 4 Foot Locker NJ
grapes 1 5 Popeye's HA
grapes 1 5 Popeye's HA
Using separate_rows from tidyr in combination with dplyr will get you there.
First table I called fruit, the other stores.
library(dplyr)
library(tidyr)
fruit %>%
inner_join(separate_rows(stores, ID) %>% mutate(ID = as.integer(ID)))
Joining, by = "ID"
X Y ID Store State
1 banana 14 1 Walmart NY
2 banana 14 1 Sears AL
3 orange 20 2 Sears AL
4 pineapple 1 3 Target DC
5 pineapple 1 3 Old Navy PA
6 guava 300 4 Footlocker NJ
7 grapes 1 5 Popeye's HA
8 grapes 1 5 Footlocker NJ
With base R, we can use strsplit with merge
lst1 <- strsplit(DF2$ID, ";")
merge(DF1, transform(DF2[rep(seq_len(nrow(DF2)),
lengths(lst1)), 1:2], ID = unlist(lst1)))
# ID X Y Store State
#1 1 banana 14 Walmart NY
#2 1 banana 14 Sears AL
#3 2 orange 20 Sears AL
#4 3 pineapple 1 Target DC
#5 3 pineapple 1 Old Navy PA
#6 4 guava 300 Footlocker NJ
#7 5 grapes 1 Popeye's HA
#8 5 grapes 1 Footlocker NJ

R: calculate number of distinct categories in the specified time frame

here's some dummy data:
user_id date category
27 2016-01-01 apple
27 2016-01-03 apple
27 2016-01-05 pear
27 2016-01-07 plum
27 2016-01-10 apple
27 2016-01-14 pear
27 2016-01-16 plum
11 2016-01-01 apple
11 2016-01-03 pear
11 2016-01-05 pear
11 2016-01-07 pear
11 2016-01-10 apple
11 2016-01-14 apple
11 2016-01-16 apple
I'd like to calculate for each user_id the number of distinct categories in the specified time period (e.g. in the past 7, 14 days), including the current order
The solution would look like this:
user_id date category distinct_7 distinct_14
27 2016-01-01 apple 1 1
27 2016-01-03 apple 1 1
27 2016-01-05 pear 2 2
27 2016-01-07 plum 3 3
27 2016-01-10 apple 3 3
27 2016-01-14 pear 3 3
27 2016-01-16 plum 3 3
11 2016-01-01 apple 1 1
11 2016-01-03 pear 2 2
11 2016-01-05 pear 2 2
11 2016-01-07 pear 2 2
11 2016-01-10 apple 2 2
11 2016-01-14 apple 2 2
11 2016-01-16 apple 1 2
I posted similar questions here or here, however none of it referred to counting cumulative unique values for the specified time period. Thanks a lot for your help!
I recommend using runner package. You can use any R function on running windows with runner function. Code below obtains desided output, which is past 7-days + current and past 14-days + current (current 8 and 15 days):
df <- read.table(
text = " user_id date category
27 2016-01-01 apple
27 2016-01-03 apple
27 2016-01-05 pear
27 2016-01-07 plum
27 2016-01-10 apple
27 2016-01-14 pear
27 2016-01-16 plum
11 2016-01-01 apple
11 2016-01-03 pear
11 2016-01-05 pear
11 2016-01-07 pear
11 2016-01-10 apple
11 2016-01-14 apple
11 2016-01-16 apple", header = TRUE, colClasses = c("integer", "Date", "character"))
library(dplyr)
library(runner)
df %>%
group_by(user_id) %>%
mutate(distinct_7 = runner(category, k = 7 + 1, idx = date,
f = function(x) length(unique(x))),
distinct_14 = runner(category, k = 14 + 1, idx = date,
f = function(x) length(unique(x))))
More informations in package and function documentation.
Here are two data.table solutions, one with two nested lapplyand the other using non-equi joins.
The first one is a rather clumsy data.table solution but it reproduces the expected answer. And it would work for an arbitrary number of time frames. (Although #alistaire's concise tidyverse solution he had suggested in his comment could be modified as well).
It uses two nested lapply. The first one loops over the time frames, the second one over the dates. The tempory result is joined with the original data and then reshaped from long to wide format so that we will end with a separate column for each of the time frames.
library(data.table)
tmp <- rbindlist(
lapply(c(7L, 14L),
function(ldays) rbindlist(
lapply(unique(dt$date),
function(ldate) {
dt[between(date, ldate - ldays, ldate),
.(distinct = sprintf("distinct_%02i", ldays),
date = ldate,
N = uniqueN(category)),
by = .(user_id)]
})
)
)
)
dcast(tmp[dt, on=c("user_id", "date")],
... ~ distinct, value.var = "N")[order(-user_id, date, category)]
# date user_id category distinct_07 distinct_14
# 1: 2016-01-01 27 apple 1 1
# 2: 2016-01-03 27 apple 1 1
# 3: 2016-01-05 27 pear 2 2
# 4: 2016-01-07 27 plum 3 3
# 5: 2016-01-10 27 apple 3 3
# 6: 2016-01-14 27 pear 3 3
# 7: 2016-01-16 27 plum 3 3
# 8: 2016-01-01 11 apple 1 1
# 9: 2016-01-03 11 pear 2 2
#10: 2016-01-05 11 pear 2 2
#11: 2016-01-07 11 pear 2 2
#12: 2016-01-10 11 apple 2 2
#13: 2016-01-14 11 apple 2 2
#14: 2016-01-16 11 apple 1 2
Here is a variant following a suggestion by #Frank which uses data.table's non-equi joins instead of the second lapply:
tmp <- rbindlist(
lapply(c(7L, 14L),
function(ldays) {
dt[.(user_id = user_id, dago = date - ldays, d = date),
on=.(user_id, date >= dago, date <= d),
.(distinct = sprintf("distinct_%02i", ldays),
N = uniqueN(category)),
by = .EACHI]
}
)
)[, date := NULL]
#
dcast(tmp[dt, on=c("user_id", "date")],
... ~ distinct, value.var = "N")[order(-user_id, date, category)]
Data:
dt <- fread("user_id date category
27 2016-01-01 apple
27 2016-01-03 apple
27 2016-01-05 pear
27 2016-01-07 plum
27 2016-01-10 apple
27 2016-01-14 pear
27 2016-01-16 plum
11 2016-01-01 apple
11 2016-01-03 pear
11 2016-01-05 pear
11 2016-01-07 pear
11 2016-01-10 apple
11 2016-01-14 apple
11 2016-01-16 apple")
dt[, date := as.IDate(date)]
BTW: The wording in the past 7, 14 days is somewhat misleading as the time periods actually consist of 8 and 15 days, resp.
In the tidyverse, you can use map_int to iterate over a set of values and simplify to an integer à la sapply or vapply. Count distinct occurrences with n_distinct (like length(unique(...))) of an object subset by comparisons or the helper between, with a minimum set by the appropriate amount subtracted from that day, and you're set.
library(tidyverse)
df %>% group_by(user_id) %>%
mutate(distinct_7 = map_int(date, ~n_distinct(category[between(date, .x - 7, .x)])),
distinct_14 = map_int(date, ~n_distinct(category[between(date, .x - 14, .x)])))
## Source: local data frame [14 x 5]
## Groups: user_id [2]
##
## user_id date category distinct_7 distinct_14
## <int> <date> <fctr> <int> <int>
## 1 27 2016-01-01 apple 1 1
## 2 27 2016-01-03 apple 1 1
## 3 27 2016-01-05 pear 2 2
## 4 27 2016-01-07 plum 3 3
## 5 27 2016-01-10 apple 3 3
## 6 27 2016-01-14 pear 3 3
## 7 27 2016-01-16 plum 3 3
## 8 11 2016-01-01 apple 1 1
## 9 11 2016-01-03 pear 2 2
## 10 11 2016-01-05 pear 2 2
## 11 11 2016-01-07 pear 2 2
## 12 11 2016-01-10 apple 2 2
## 13 11 2016-01-14 apple 2 2
## 14 11 2016-01-16 apple 1 2

In R: Duplicate rows except for the first row based on condition

I have a data.table dt:
names <- c("john","mary","mary","mary","mary","mary","mary","tom","tom","tom","mary","john","john","john","tom","tom")
dates <- c(as.Date("2010-06-01"),as.Date("2010-06-01"),as.Date("2010-06-05"),as.Date("2010-06-09"),as.Date("2010-06-13"),as.Date("2010-06-17"),as.Date("2010-06-21"),as.Date("2010-07-09"),as.Date("2010-07-13"),as.Date("2010-07-17"),as.Date("2010-06-01"),as.Date("2010-08-01"),as.Date("2010-08-05"),as.Date("2010-08-09"),as.Date("2010-09-03"),as.Date("2010-09-04"))
shifts_missed <- c(2,11,11,11,11,11,11,6,6,6,1,5,5,5,0,2)
shift <- c("Day","Night","Night","Night","Night","Night","Night","Day","Day","Day","Day","Night","Night","Night","Night","Day")
df <- data.frame(names=names, dates=dates, shifts_missed=shifts_missed, shift=shift)
dt <- as.data.table(df)
names dates shifts_missed shift
john 2010-06-01 2 Day
mary 2010-06-01 11 Night
mary 2010-06-05 11 Night
mary 2010-06-09 11 Night
mary 2010-06-13 11 Night
mary 2010-06-17 11 Night
mary 2010-06-21 11 Night
tom 2010-07-09 6 Day
tom 2010-07-13 6 Day
tom 2010-07-17 6 Day
mary 2010-06-01 1 Day
john 2010-08-01 5 Night
john 2010-08-05 5 Night
john 2010-08-09 5 Night
tom 2010-09-03 0 Night
tom 2010-09-04 2 Day
Ultimately, what I want is to get the following:
names dates shifts_missed shift count
john 2010-06-01 2 Day 1
mary 2010-06-01 11 Night 1
mary 2010-06-05 11 Night 1
mary 2010-06-09 11 Night 1
mary 2010-06-13 11 Night 1
mary 2010-06-17 11 Night 1
mary 2010-06-21 11 Night 1
tom 2010-07-09 6 Day 1
tom 2010-07-13 6 Day 1
tom 2010-07-17 6 Day 1
mary 2010-06-01 1 Day 1
john 2010-08-01 5 Night 1
john 2010-08-05 5 Night 1
john 2010-08-09 5 Night 1
tom 2010-09-03 0 Night 0
tom 2010-09-04 2 Day 1
john 2010-06-01 2 Night 1
mary 2010-06-05 11 Day 1
mary 2010-06-09 11 Day 1
mary 2010-06-13 11 Day 1
mary 2010-06-17 11 Day 1
mary 2010-06-21 11 Day 1
tom 2010-07-09 6 Night 1
tom 2010-07-13 6 Night 1
tom 2010-07-17 6 Night 1
john 2010-08-05 5 Day 1
john 2010-08-09 5 Day 1
tom 2010-09-04 2 Night 1
As you can see, the second half of the data is almost a duplicate of the first half. However, if shifts_missed = 0, it should not be duplicated, and if shifts_missed is odd, the first row should not be duplicated but the remaining rows should. It should then add a 1 in the count column for all except when shifts_missed = 0.
I've seen some answers that speak about !duplicate or unique, but these values in shifts_missed are not unique. I'm sure this isn't overly complicated and is probably a multi-step process, but I can't figure out how to isolate the first rows of the odd shifts_missed column.
dt[, is.in := if(shifts_missed[1] %% 2 == 0) T else c(F, rep(T, .N-1))
, by = .(names, shift)]
rbind(dt, dt[is.in & shifts_missed != 0])
Adding the extra column part should be obvious.

remove individuals based on their range of values

I have a df with two variables, one with IDs and one with a variable called numbers. I would like to excude individuals who do not start their sequence of numbers with the number 1.
I have managed to do this by creating a binary indicator and excluding if the person has this indicator. However, there must be a simpler more elegant way to do this?
Example data and the code I've used to achieve desired result are below.
Thank you.
sample df:
zz<-" names numbers
1 john 1
2 john 2
3 john 3
4 john 4
5 john 5
6 john 6
7 john 7
8 john 8
9 mary 4
10 mary 5
11 mary 6
12 mary 7
13 mary 8
14 mary 9
15 mary 10
16 mary 11
17 mary 12
18 pat 1
19 pat 2
20 pat 3
21 pat 4
22 pat 5
23 pat 6
24 pat 7
25 pat 8
26 pat 9
27 pat 10
28 sue 2
29 sue 3
30 sue 4
31 sue 5
32 sue 6
33 sue 7
34 sue 8
35 sue 9
36 tom 5
37 tom 6
38 tom 7
39 tom 8
40 tom 9
41 tom 10
42 tom 11
"
Data <- read.table(text=zz, header = TRUE)
Step 1 - add binary indicator
df$all<-ifelse(df$numbers==1, 1,0)
df$allperson<-ave(df$all, df$names, FUN=cumsum)
Step two - get rid of people who do not have 1 as their start number
df[!df$allperson==0,]
If you want elegance, I must recommend the package dplyr:
library(dplyr)
Data %>%
group_by(names) %>%
filter(min(numbers) != 1)
It means just what it appears to mean: filter only records where a group (defined by names) has a minimum numbers value inequal to 1.
names numbers
1 mary 4
2 mary 5
3 mary 6
4 mary 7
5 mary 8
6 mary 9
7 mary 10
8 mary 11
9 mary 12
10 sue 2
11 sue 3
You may also try:
zz1 <- zz[with(zz, names %in% unique(names)[!!table(zz)[,1]]),]
head(zz1,4)
# names numbers
#1 john 1
#2 john 2
#3 john 3
#4 john 4

give each id the same column value R

I want to give each unique id the same column value for first.date based on their first.date for fruit=='apple'.
This is what I have:
names dates fruit first.date
1 john 2010-07-01 kiwi <NA>
2 john 2010-09-01 apple 2010-09-01
3 john 2010-11-01 banana <NA>
4 john 2010-12-01 orange <NA>
5 john 2011-01-01 apple 2010-09-01
6 mary 2010-05-01 orange <NA>
7 mary 2010-07-01 apple 2010-07-01
8 mary 2010-07-01 orange <NA>
9 mary 2010-09-01 apple 2010-07-01
10 mary 2010-11-01 apple 2010-07-01
this is what I want:
names dates fruit first.date
1 john 2010-07-01 kiwi 2010-09-01
2 john 2010-09-01 apple 2010-09-01
3 john 2010-11-01 banana 2010-09-01
4 john 2010-12-01 orange 2010-09-01
5 john 2011-01-01 apple 2010-09-01
6 mary 2010-05-01 orange 2010-07-01
7 mary 2010-07-01 apple 2010-07-01
8 mary 2010-07-01 orange 2010-07-01
9 mary 2010-09-01 apple 2010-07-01
10 mary 2010-11-01 apple 2010-07-01
This is my disastrous attempt:
getdates$first.date[is.na]<-getdates[getdates$first.date & getdates$fruit=='apple',]
Thank you in advance
reproducible DF
names<-as.character(c("john", "john", "john", "john", "john", "mary", "mary","mary","mary","mary"))
dates<-as.Date(c("2010-07-01", "2010-09-01", "2010-11-01", "2010-12-01", "2011-01-01", "2010-05-01", "2010-07-01", "2010-07-01", "2010-09-01", "2010-11-01"))
fruit<-as.character(c("kiwi","apple","banana","orange","apple","orange","apple","orange", "apple", "apple"))
first.date<-as.Date(c(NA, "2010-09-01",NA,NA, "2010-09-01", NA, "2010-07-01", NA, "2010-07-01","2010-07-01"))
getdates<-data.frame(names,dates,fruit, first.date)
It's unclear what you want to do when there are duplicate entries for first.date and apple (for a given name), this will just take the first one:
library(data.table)
dt = data.table(getdates)
dt[, first.date := first.date[fruit == 'apple'][1], by = names]
dt
# names dates fruit first.date
# 1: john 2010-07-01 kiwi 2010-09-01
# 2: john 2010-09-01 apple 2010-09-01
# 3: john 2010-11-01 banana 2010-09-01
# 4: john 2010-12-01 orange 2010-09-01
# 5: john 2011-01-01 apple 2010-09-01
# 6: mary 2010-05-01 orange 2010-07-01
# 7: mary 2010-07-01 apple 2010-07-01
# 8: mary 2010-07-01 orange 2010-07-01
# 9: mary 2010-09-01 apple 2010-07-01
#10: mary 2010-11-01 apple 2010-07-01

Resources