Related
I'm having some trouble trying to do a count of days based on starting dates. I basically just want a count of days passed since the starting date by product.
I think it is best illustrated by example.
This is what I start with:
df1 <- data.frame(Dates = seq(as.Date("2021/1/1"), as.Date("2021/1/15"), "days"),
Product = rep(c(rep("Banana", 5), rep("Apple", 5), rep("Orange", 5)))
)
Dates Product
1 2021-01-01 Banana
2 2021-01-02 Banana
3 2021-01-03 Banana
4 2021-01-04 Banana
5 2021-01-05 Banana
6 2021-01-06 Apple
7 2021-01-07 Apple
8 2021-01-08 Apple
9 2021-01-09 Apple
10 2021-01-10 Apple
11 2021-01-11 Orange
12 2021-01-12 Orange
13 2021-01-13 Orange
14 2021-01-14 Orange
15 2021-01-15 Orange
I currently have several measurements for each product that I need to plot as number of days rather than dates and I cannot make the transformation.
And this is what I want:
desired_df <- data.frame(Dates = seq(as.Date("2021/1/1"), as.Date("2021/1/15"), "days"),
Product = rep(c(rep("Banana", 5), rep("Apple", 5), rep("Orange", 5))),
Days = rep(seq(0, 4), 3)
)
Dates Product Days
1 2021-01-01 Banana 0
2 2021-01-02 Banana 1
3 2021-01-03 Banana 2
4 2021-01-04 Banana 3
5 2021-01-05 Banana 4
6 2021-01-06 Apple 0
7 2021-01-07 Apple 1
8 2021-01-08 Apple 2
9 2021-01-09 Apple 3
10 2021-01-10 Apple 4
11 2021-01-11 Orange 0
12 2021-01-12 Orange 1
13 2021-01-13 Orange 2
14 2021-01-14 Orange 3
15 2021-01-15 Orange 4
So far I've tried a few approaches, but none works.
df2 <- df1 %>%
mutate(Days = Dates - Dates[1])
df3 <- df1 %>%
group_by(Product) %>%
mutate(Days = Dates - Dates[1])
Dates Product Days
starter_dates <- df1 %>%
aggregate(by = list(df1$Product), FUN = first)
Group.1 Dates Product
1 Apple 2021-01-06 Apple
2 Banana 2021-01-01 Banana
3 Orange 2021-01-11 Orange
df4 <- df1 %>%
mutate(
Days = case_when(Product == starter_dates$Product ~ Dates - starter_dates$Dates)
)
But none produced what I want. How can I calculate the number of days from first appearance?
EDIT:
This is what I get from suggested answers:
> df1 %>% group_by(Product) %>% mutate(Days = as.numeric(Dates - Dates[1]))
# A tibble: 15 x 3
# Groups: Product [3]
Dates Product Days
<date> <chr> <dbl>
1 2021-01-01 Banana 0
2 2021-01-02 Banana 1
3 2021-01-03 Banana 2
4 2021-01-04 Banana 3
5 2021-01-05 Banana 4
6 2021-01-06 Apple 5
7 2021-01-07 Apple 6
8 2021-01-08 Apple 7
9 2021-01-09 Apple 8
10 2021-01-10 Apple 9
11 2021-01-11 Orange 10
12 2021-01-12 Orange 11
13 2021-01-13 Orange 12
14 2021-01-14 Orange 13
15 2021-01-15 Orange 14
Ensuring no conflicts from other packages, below now works.
df1 %>% group_by(Product) %>%
mutate(Days=lubridate::day(Dates)-first(lubridate::day(Dates)))
We can subtract the "Date", for every row, from the first "Date" value:
df1 %>% group_by(Product) %>%
mutate(Days=lubridate::day(Dates)-first(lubridate::day(Dates)))
# A tibble: 15 x 3
# Groups: Product [3]
Dates Product Days
<date> <chr> <int>
1 2021-01-01 Banana 0
2 2021-01-02 Banana 1
3 2021-01-03 Banana 2
4 2021-01-04 Banana 3
5 2021-01-05 Banana 4
6 2021-01-06 Apple 0
7 2021-01-07 Apple 1
8 2021-01-08 Apple 2
9 2021-01-09 Apple 3
10 2021-01-10 Apple 4
11 2021-01-11 Orange 0
12 2021-01-12 Orange 1
13 2021-01-13 Orange 2
14 2021-01-14 Orange 3
15 2021-01-15 Orange 4
Since using tidyverse is not a requirement, here a base R solution:
data.frame( df1, Days=as.vector( sapply( unique(df1$Product),
function(x) df1$Dates[df1$Product==x] - df1$Dates[df1$Product==x][1] ) ) )
Dates Product Days
1 2021-01-01 Banana 0
2 2021-01-02 Banana 1
3 2021-01-03 Banana 2
4 2021-01-04 Banana 3
5 2021-01-05 Banana 4
6 2021-01-06 Apple 0
7 2021-01-07 Apple 1
8 2021-01-08 Apple 2
9 2021-01-09 Apple 3
10 2021-01-10 Apple 4
11 2021-01-11 Orange 0
12 2021-01-12 Orange 1
13 2021-01-13 Orange 2
14 2021-01-14 Orange 3
15 2021-01-15 Orange 4
So I have two dataframes:
DF1
X Y ID
banana 14 1
orange 20 2
pineapple 1 3
guava 300 4
grapes 1 5
DF2
Store State ID
Walmart NY 1
Sears AL 1;2
Target DC 3
Old Navy PA 3
Popeye's HA 5
Footlocker NJ 4;5
I join with the following and get:
df1 %>%
inner_join(df2, by = "ID")
X Y ID Store State
banana 14 1 Walmart NY
pineapple 1 3 Target DC
pineapple 1 3 Old Navy PA
grapes 1 5 Popeye's HA
But due to the semi-colons I'm not capturing those data points on the join, the end result should look like this:
X Y ID Store State
banana 14 1 Walmart NY
banana 14 1 Sears AL
orange 20 2 Sears AL
pineapple 1 3 Target DC
pineapple 1 3 Old Navy PA
guava 300 4 Foot Locker NJ
grapes 1 5 Popeye's HA
grapes 1 5 Popeye's HA
Using separate_rows from tidyr in combination with dplyr will get you there.
First table I called fruit, the other stores.
library(dplyr)
library(tidyr)
fruit %>%
inner_join(separate_rows(stores, ID) %>% mutate(ID = as.integer(ID)))
Joining, by = "ID"
X Y ID Store State
1 banana 14 1 Walmart NY
2 banana 14 1 Sears AL
3 orange 20 2 Sears AL
4 pineapple 1 3 Target DC
5 pineapple 1 3 Old Navy PA
6 guava 300 4 Footlocker NJ
7 grapes 1 5 Popeye's HA
8 grapes 1 5 Footlocker NJ
With base R, we can use strsplit with merge
lst1 <- strsplit(DF2$ID, ";")
merge(DF1, transform(DF2[rep(seq_len(nrow(DF2)),
lengths(lst1)), 1:2], ID = unlist(lst1)))
# ID X Y Store State
#1 1 banana 14 Walmart NY
#2 1 banana 14 Sears AL
#3 2 orange 20 Sears AL
#4 3 pineapple 1 Target DC
#5 3 pineapple 1 Old Navy PA
#6 4 guava 300 Footlocker NJ
#7 5 grapes 1 Popeye's HA
#8 5 grapes 1 Footlocker NJ
I have df1:
State date fips score score1
1 Alabama 2020-03-24 1 242 0
2 Alabama 2020-03-26 1 538 3
3 Alabama 2020-03-28 1 720 4
4 Alabama 2020-03-21 1 131 0
5 Alabama 2020-03-15 1 23 0
6 Alabama 2020-03-18 1 51 0
7 Texas 2020-03-14 2 80 0
7 Texas 2020-03-16 2 102 0
7 Texas 2020-03-20 2 702 1
8 Texas 2020-03-23 2 1005 1
I would like to see which date a State surpasses a score of 100. I would then like to select the row 7 days after that date? For example, Alabama passes 100 on March 21st, so I would like to keep the March 28th data.
State date fips score score1
3 Alabama 2020-03-28 1 720 4
8 Texas 2020-03-23 2 1005 1
Here is a solution tidyverse and lubridate.
library(tidyverse)
library(lubridate)
df %>%
#Convert date column to date format
mutate_at(vars(date), ymd) %>%
#Group by State
group_by(State) %>%
#Ignore scores under 100
filter(score > 100) %>%
#Stay only with the date of the first date with score over 100 + 7 days
filter(date == min(date) + days(7))
Using a by approach (assuming date + 7 is available).
res <- do.call(rbind, by(dat, dat$state, function(x) {
st <- x[x$cases > 100, ]
st[as.Date(st$date) == as.Date(st$date[1]) + 7, ]
}))
head(res)
# date state fips cases deaths
# Alabama 2020-03-27 Alabama 1 639 4
# Alaska 2020-04-04 Alaska 2 169 3
# Arizona 2020-03-28 Arizona 4 773 15
# Arkansas 2020-03-28 Arkansas 5 409 5
# California 2020-03-15 California 6 478 6
# Colorado 2020-03-21 Colorado 8 475 6
here's some dummy data:
user_id date category
27 2016-01-01 apple
27 2016-01-03 apple
27 2016-01-05 pear
27 2016-01-07 plum
27 2016-01-10 apple
27 2016-01-14 pear
27 2016-01-16 plum
11 2016-01-01 apple
11 2016-01-03 pear
11 2016-01-05 pear
11 2016-01-07 pear
11 2016-01-10 apple
11 2016-01-14 apple
11 2016-01-16 apple
I'd like to calculate for each user_id the number of distinct categories in the specified time period (e.g. in the past 7, 14 days), including the current order
The solution would look like this:
user_id date category distinct_7 distinct_14
27 2016-01-01 apple 1 1
27 2016-01-03 apple 1 1
27 2016-01-05 pear 2 2
27 2016-01-07 plum 3 3
27 2016-01-10 apple 3 3
27 2016-01-14 pear 3 3
27 2016-01-16 plum 3 3
11 2016-01-01 apple 1 1
11 2016-01-03 pear 2 2
11 2016-01-05 pear 2 2
11 2016-01-07 pear 2 2
11 2016-01-10 apple 2 2
11 2016-01-14 apple 2 2
11 2016-01-16 apple 1 2
I posted similar questions here or here, however none of it referred to counting cumulative unique values for the specified time period. Thanks a lot for your help!
I recommend using runner package. You can use any R function on running windows with runner function. Code below obtains desided output, which is past 7-days + current and past 14-days + current (current 8 and 15 days):
df <- read.table(
text = " user_id date category
27 2016-01-01 apple
27 2016-01-03 apple
27 2016-01-05 pear
27 2016-01-07 plum
27 2016-01-10 apple
27 2016-01-14 pear
27 2016-01-16 plum
11 2016-01-01 apple
11 2016-01-03 pear
11 2016-01-05 pear
11 2016-01-07 pear
11 2016-01-10 apple
11 2016-01-14 apple
11 2016-01-16 apple", header = TRUE, colClasses = c("integer", "Date", "character"))
library(dplyr)
library(runner)
df %>%
group_by(user_id) %>%
mutate(distinct_7 = runner(category, k = 7 + 1, idx = date,
f = function(x) length(unique(x))),
distinct_14 = runner(category, k = 14 + 1, idx = date,
f = function(x) length(unique(x))))
More informations in package and function documentation.
Here are two data.table solutions, one with two nested lapplyand the other using non-equi joins.
The first one is a rather clumsy data.table solution but it reproduces the expected answer. And it would work for an arbitrary number of time frames. (Although #alistaire's concise tidyverse solution he had suggested in his comment could be modified as well).
It uses two nested lapply. The first one loops over the time frames, the second one over the dates. The tempory result is joined with the original data and then reshaped from long to wide format so that we will end with a separate column for each of the time frames.
library(data.table)
tmp <- rbindlist(
lapply(c(7L, 14L),
function(ldays) rbindlist(
lapply(unique(dt$date),
function(ldate) {
dt[between(date, ldate - ldays, ldate),
.(distinct = sprintf("distinct_%02i", ldays),
date = ldate,
N = uniqueN(category)),
by = .(user_id)]
})
)
)
)
dcast(tmp[dt, on=c("user_id", "date")],
... ~ distinct, value.var = "N")[order(-user_id, date, category)]
# date user_id category distinct_07 distinct_14
# 1: 2016-01-01 27 apple 1 1
# 2: 2016-01-03 27 apple 1 1
# 3: 2016-01-05 27 pear 2 2
# 4: 2016-01-07 27 plum 3 3
# 5: 2016-01-10 27 apple 3 3
# 6: 2016-01-14 27 pear 3 3
# 7: 2016-01-16 27 plum 3 3
# 8: 2016-01-01 11 apple 1 1
# 9: 2016-01-03 11 pear 2 2
#10: 2016-01-05 11 pear 2 2
#11: 2016-01-07 11 pear 2 2
#12: 2016-01-10 11 apple 2 2
#13: 2016-01-14 11 apple 2 2
#14: 2016-01-16 11 apple 1 2
Here is a variant following a suggestion by #Frank which uses data.table's non-equi joins instead of the second lapply:
tmp <- rbindlist(
lapply(c(7L, 14L),
function(ldays) {
dt[.(user_id = user_id, dago = date - ldays, d = date),
on=.(user_id, date >= dago, date <= d),
.(distinct = sprintf("distinct_%02i", ldays),
N = uniqueN(category)),
by = .EACHI]
}
)
)[, date := NULL]
#
dcast(tmp[dt, on=c("user_id", "date")],
... ~ distinct, value.var = "N")[order(-user_id, date, category)]
Data:
dt <- fread("user_id date category
27 2016-01-01 apple
27 2016-01-03 apple
27 2016-01-05 pear
27 2016-01-07 plum
27 2016-01-10 apple
27 2016-01-14 pear
27 2016-01-16 plum
11 2016-01-01 apple
11 2016-01-03 pear
11 2016-01-05 pear
11 2016-01-07 pear
11 2016-01-10 apple
11 2016-01-14 apple
11 2016-01-16 apple")
dt[, date := as.IDate(date)]
BTW: The wording in the past 7, 14 days is somewhat misleading as the time periods actually consist of 8 and 15 days, resp.
In the tidyverse, you can use map_int to iterate over a set of values and simplify to an integer à la sapply or vapply. Count distinct occurrences with n_distinct (like length(unique(...))) of an object subset by comparisons or the helper between, with a minimum set by the appropriate amount subtracted from that day, and you're set.
library(tidyverse)
df %>% group_by(user_id) %>%
mutate(distinct_7 = map_int(date, ~n_distinct(category[between(date, .x - 7, .x)])),
distinct_14 = map_int(date, ~n_distinct(category[between(date, .x - 14, .x)])))
## Source: local data frame [14 x 5]
## Groups: user_id [2]
##
## user_id date category distinct_7 distinct_14
## <int> <date> <fctr> <int> <int>
## 1 27 2016-01-01 apple 1 1
## 2 27 2016-01-03 apple 1 1
## 3 27 2016-01-05 pear 2 2
## 4 27 2016-01-07 plum 3 3
## 5 27 2016-01-10 apple 3 3
## 6 27 2016-01-14 pear 3 3
## 7 27 2016-01-16 plum 3 3
## 8 11 2016-01-01 apple 1 1
## 9 11 2016-01-03 pear 2 2
## 10 11 2016-01-05 pear 2 2
## 11 11 2016-01-07 pear 2 2
## 12 11 2016-01-10 apple 2 2
## 13 11 2016-01-14 apple 2 2
## 14 11 2016-01-16 apple 1 2
I want to establish a cohort of new users of drugs (Ray 2003). My original dataset is huge approx 19 million rows, so a loop is proving inefficient. Here is a dummy dataset (done with fruits instead of drugs):
df2
names dates age sex fruit
1 tom 2010-02-01 60 m apple
2 mary 2010-05-01 55 f orange
3 tom 2010-03-01 60 m banana
4 john 2010-07-01 57 m kiwi
5 mary 2010-07-01 55 f apple
6 tom 2010-06-01 60 m apple
7 john 2010-09-01 57 m apple
8 mary 2010-07-01 55 f orange
9 john 2010-11-01 57 m banana
10 mary 2010-09-01 55 f apple
11 tom 2010-08-01 60 m kiwi
12 mary 2010-11-01 55 f apple
13 john 2010-12-01 57 m orange
14 john 2011-01-01 57 m apple
I have identified people who were prescribed an apple between 04-2010 and 10-2010:
temp2
names dates age sex fruit
6 tom 2010-06-01 60 m apple
5 mary 2010-07-01 55 f apple
7 john 2010-09-01 57 m apple
I would like to make a new column in the original DF called "index" which is the first date that a person was prescribed a drug in the the defined date range. This is what I have tried to get the dates from temp into df$index:
df2$index<-temp2$dates
df2$index<-df2$dates == temp2$dates
df2$index<-df2$dates %in% temp2$dates
df2$index<-ifelse(as.Date(df$dates)==as.Date(temp2$dates), as.Date(temp2$dates),NA)
I'm not doing this right - as none of these work. This is the desired output.
df2
names dates age sex fruit index
1 tom 2010-02-01 60 m apple <NA>
2 mary 2010-05-01 55 f orange <NA>
3 tom 2010-03-01 60 m banana <NA>
4 john 2010-07-01 57 m kiwi <NA>
5 mary 2010-07-01 55 f apple 2010-07-01
6 tom 2010-06-01 60 m apple 2010-06-01
7 john 2010-09-01 57 m apple 2010-09-01
8 mary 2010-07-01 55 f orange <NA>
9 john 2010-11-01 57 m banana <NA>
10 mary 2010-09-01 55 f apple <NA>
11 tom 2010-08-01 60 m kiwi <NA>
12 mary 2010-11-01 55 f apple <NA>
13 john 2010-12-01 57 m orange <NA>
14 john 2011-01-01 57 m apple <NA>
Once I have the desired output, I want to trace back from the index date to see if any person had an apple in the previous 180 days. if they did not have an apple - I want to keep them. If they did have an apple (e.g., tom) I want to discard him. This is the code i have tried on the desired output:
df4<-df2[df2$fruit!='apple' & df2$index-180,]
df4<-df2[df2$fruit!='apple' & df2$dates<=df2$index-180,] ##neither work for me
I would appreciate any guidance at all on these questions - even a direction to what I should read to help me learn how to do this. Perhaps my logic is flawed and my method won't work - please tell me if thats the case! Thank you in advance.
Here is my df:
names<-c("tom", "mary", "tom", "john", "mary",
"tom", "john", "mary", "john", "mary", "tom", "mary", "john", "john")
dates<-as.Date(c("2010-02-01", "2010-05-01", "2010-03-01",
"2010-07-01", "2010-07-01", "2010-06-01", "2010-09-01",
"2010-07-01", "2010-11-01", "2010-09-01", "2010-08-01",
"2010-11-01", "2010-12-01", "2011-01-01"))
fruit<-as.character(c("apple", "orange", "banana", "kiwi",
"apple", "apple", "apple", "orange", "banana", "apple",
"kiwi", "apple", "orange", "apple"))
age<-as.numeric(c(60,55,60,57,55,60,57,55,57,55,60,55, 57,57))
sex<-as.character(c("m","f","m","m","f","m","m",
"f","m","f","m","f","m", "m"))
df2<-data.frame(names,dates, age, sex, fruit)
df2
Here is temp2:
data1<-df2[df2$fruit=="apple"& (df2$dates >= "2010-04-01" & df2$dates< "2010-10-01"), ]
index <- with(data1, order(dates))
temp<-data1[index, ]
dup<-duplicated(temp$names)
temp1<-cbind(temp,dup)
temp2<-temp1[temp1$dup!=TRUE,]
temp2$dup<-NULL
SOLUTION
df2 <- df2[with(df2, order(names, dates)), ]
df2$first.date <- ave(df2$date, df2$name, df2$fruit,
FUN=function(dt) dt[dt <="2010-10-31" & dt>="2010-04-01"][1]) ##DWin code for assigning index date for each fruit in the pre-period
df2$x<-df2$fruit=='apple' & df2$dates>df2$first.date-180 & df2$dates<df2$first.date ##assigns TRUE to row that tom is not a new user
ids <- with(df2, unique(names[x == "TRUE"])) ##finding the id which has one value of true
new_users<-subset(df2, !names %in% ids) ##gets rid of id that has at least one value of true
First order by name and date:
df <- df[with(df, order(names, dates)), ]
Then just pick the first date within each name:
df$first.date <- ave(df$date, df$name, FUN="[", 1)
Now that you have will see "the power of the fully operational Death Star \w\w", er, the ave-function. You are ready to pick out the first date within individual 'names' and 'fruits' within that date-range:
> df$first.date <- ave(df$date, df$name, df$fruit,
FUN=function(dt) dt[dt <="2010-10-31" & dt>="2010-04-01"][1] )
> df
names dates age sex fruit first.date
4 john 2010-07-01 57 m kiwi 2010-07-01
7 john 2010-09-01 57 m apple 2010-09-01
9 john 2010-11-01 57 m banana <NA>
13 john 2010-12-01 57 m orange <NA>
14 john 2011-01-01 57 m apple 2010-09-01
2 mary 2010-05-01 55 f orange 2010-05-01
5 mary 2010-07-01 55 f apple 2010-07-01
8 mary 2010-07-01 55 f orange 2010-05-01
10 mary 2010-09-01 55 f apple 2010-07-01
12 mary 2010-11-01 55 f apple 2010-07-01
1 tom 2010-02-01 60 m apple 2010-06-01
3 tom 2010-03-01 60 m banana <NA>
6 tom 2010-06-01 60 m apple 2010-06-01
11 tom 2010-08-01 60 m kiwi 2010-08-01
Since you have 19 million rows , I think you should try a data.table solution. Here my attempt. The result is slightly different from #Dwin result since I filter my data between (begin,end) and then I create a new index variable which is the min dates occurring in this chosen range for each (names,fruits)
library(data.table)
DT <- data.table(df2,key=c('names','dates'))
DT[,dates := as.Date(dates)]
DT[between(dates,as.Date("2010-04-01"),as.Date("2010-10-31")),
index := as.character(min(dates))
, by=c('names','fruit')]
## names dates age sex fruit index
## 1: john 2010-07-01 57 m kiwi 2010-07-01
## 2: john 2010-09-01 57 m apple 2010-09-01
## 3: john 2010-11-01 57 m banana NA
## 4: john 2010-12-01 57 m orange NA
## 5: john 2011-01-01 57 m apple NA
## 6: mary 2010-05-01 55 f orange 2010-05-01
## 7: mary 2010-07-01 55 f apple 2010-07-01
## 8: mary 2010-07-01 55 f orange 2010-05-01
## 9: mary 2010-09-01 55 f apple 2010-07-01
## 10: mary 2010-11-01 55 f apple NA
## 11: tom 2010-02-01 60 m apple NA
## 12: tom 2010-03-01 60 m banana NA
## 13: tom 2010-06-01 60 m apple 2010-06-01
## 14: tom 2010-08-01 60 m kiwi 2010-08-01