I want to give each unique id the same column value for first.date based on their first.date for fruit=='apple'.
This is what I have:
names dates fruit first.date
1 john 2010-07-01 kiwi <NA>
2 john 2010-09-01 apple 2010-09-01
3 john 2010-11-01 banana <NA>
4 john 2010-12-01 orange <NA>
5 john 2011-01-01 apple 2010-09-01
6 mary 2010-05-01 orange <NA>
7 mary 2010-07-01 apple 2010-07-01
8 mary 2010-07-01 orange <NA>
9 mary 2010-09-01 apple 2010-07-01
10 mary 2010-11-01 apple 2010-07-01
this is what I want:
names dates fruit first.date
1 john 2010-07-01 kiwi 2010-09-01
2 john 2010-09-01 apple 2010-09-01
3 john 2010-11-01 banana 2010-09-01
4 john 2010-12-01 orange 2010-09-01
5 john 2011-01-01 apple 2010-09-01
6 mary 2010-05-01 orange 2010-07-01
7 mary 2010-07-01 apple 2010-07-01
8 mary 2010-07-01 orange 2010-07-01
9 mary 2010-09-01 apple 2010-07-01
10 mary 2010-11-01 apple 2010-07-01
This is my disastrous attempt:
getdates$first.date[is.na]<-getdates[getdates$first.date & getdates$fruit=='apple',]
Thank you in advance
reproducible DF
names<-as.character(c("john", "john", "john", "john", "john", "mary", "mary","mary","mary","mary"))
dates<-as.Date(c("2010-07-01", "2010-09-01", "2010-11-01", "2010-12-01", "2011-01-01", "2010-05-01", "2010-07-01", "2010-07-01", "2010-09-01", "2010-11-01"))
fruit<-as.character(c("kiwi","apple","banana","orange","apple","orange","apple","orange", "apple", "apple"))
first.date<-as.Date(c(NA, "2010-09-01",NA,NA, "2010-09-01", NA, "2010-07-01", NA, "2010-07-01","2010-07-01"))
getdates<-data.frame(names,dates,fruit, first.date)
It's unclear what you want to do when there are duplicate entries for first.date and apple (for a given name), this will just take the first one:
library(data.table)
dt = data.table(getdates)
dt[, first.date := first.date[fruit == 'apple'][1], by = names]
dt
# names dates fruit first.date
# 1: john 2010-07-01 kiwi 2010-09-01
# 2: john 2010-09-01 apple 2010-09-01
# 3: john 2010-11-01 banana 2010-09-01
# 4: john 2010-12-01 orange 2010-09-01
# 5: john 2011-01-01 apple 2010-09-01
# 6: mary 2010-05-01 orange 2010-07-01
# 7: mary 2010-07-01 apple 2010-07-01
# 8: mary 2010-07-01 orange 2010-07-01
# 9: mary 2010-09-01 apple 2010-07-01
#10: mary 2010-11-01 apple 2010-07-01
Related
I have a dataframe that contains several fields related to an identifier but some are disjointed:
id store manager fruit vegetable
1 Grocery1 Joe apple NA
1 Grocery1 Joe lemon NA
1 Grocery1 Joe NA zucchini
2 Grocery2 Amy orange NA
2 Grocery2 Amy NA asparagus
2 Grocery2 Amy NA spinach
3 Grocery3 Bill NA NA
I want the dataframe to look like:
id store manager fruit vegetable
1 Grocery1 Joe apple zucchini
1 Grocery1 Joe lemon zucchini
2 Grocery2 Amy orange asparagus
2 Grocery2 Amy orange spinach
3 Grocery3 Bill NA NA
Is there a way to easily do this?
You can use tidyr::fill to fill the NA, and only keep the non-duplicated rows using distinct.
library(dplyr)
library(tidyr)
df %>%
group_by(store, manager) %>%
fill(fruit, vegetable, .direction = "updown") %>%
distinct()
# A tibble: 5 × 5
# Groups: store, manager [3]
id store manager fruit vegetable
<int> <chr> <chr> <chr> <chr>
1 1 Grocery1 Joe apple zucchini
2 1 Grocery1 Joe lemon zucchini
3 2 Grocery2 Amy orange asparagus
4 2 Grocery2 Amy orange spinach
5 3 Grocery3 Bill NA NA
I'm having some trouble trying to do a count of days based on starting dates. I basically just want a count of days passed since the starting date by product.
I think it is best illustrated by example.
This is what I start with:
df1 <- data.frame(Dates = seq(as.Date("2021/1/1"), as.Date("2021/1/15"), "days"),
Product = rep(c(rep("Banana", 5), rep("Apple", 5), rep("Orange", 5)))
)
Dates Product
1 2021-01-01 Banana
2 2021-01-02 Banana
3 2021-01-03 Banana
4 2021-01-04 Banana
5 2021-01-05 Banana
6 2021-01-06 Apple
7 2021-01-07 Apple
8 2021-01-08 Apple
9 2021-01-09 Apple
10 2021-01-10 Apple
11 2021-01-11 Orange
12 2021-01-12 Orange
13 2021-01-13 Orange
14 2021-01-14 Orange
15 2021-01-15 Orange
I currently have several measurements for each product that I need to plot as number of days rather than dates and I cannot make the transformation.
And this is what I want:
desired_df <- data.frame(Dates = seq(as.Date("2021/1/1"), as.Date("2021/1/15"), "days"),
Product = rep(c(rep("Banana", 5), rep("Apple", 5), rep("Orange", 5))),
Days = rep(seq(0, 4), 3)
)
Dates Product Days
1 2021-01-01 Banana 0
2 2021-01-02 Banana 1
3 2021-01-03 Banana 2
4 2021-01-04 Banana 3
5 2021-01-05 Banana 4
6 2021-01-06 Apple 0
7 2021-01-07 Apple 1
8 2021-01-08 Apple 2
9 2021-01-09 Apple 3
10 2021-01-10 Apple 4
11 2021-01-11 Orange 0
12 2021-01-12 Orange 1
13 2021-01-13 Orange 2
14 2021-01-14 Orange 3
15 2021-01-15 Orange 4
So far I've tried a few approaches, but none works.
df2 <- df1 %>%
mutate(Days = Dates - Dates[1])
df3 <- df1 %>%
group_by(Product) %>%
mutate(Days = Dates - Dates[1])
Dates Product Days
starter_dates <- df1 %>%
aggregate(by = list(df1$Product), FUN = first)
Group.1 Dates Product
1 Apple 2021-01-06 Apple
2 Banana 2021-01-01 Banana
3 Orange 2021-01-11 Orange
df4 <- df1 %>%
mutate(
Days = case_when(Product == starter_dates$Product ~ Dates - starter_dates$Dates)
)
But none produced what I want. How can I calculate the number of days from first appearance?
EDIT:
This is what I get from suggested answers:
> df1 %>% group_by(Product) %>% mutate(Days = as.numeric(Dates - Dates[1]))
# A tibble: 15 x 3
# Groups: Product [3]
Dates Product Days
<date> <chr> <dbl>
1 2021-01-01 Banana 0
2 2021-01-02 Banana 1
3 2021-01-03 Banana 2
4 2021-01-04 Banana 3
5 2021-01-05 Banana 4
6 2021-01-06 Apple 5
7 2021-01-07 Apple 6
8 2021-01-08 Apple 7
9 2021-01-09 Apple 8
10 2021-01-10 Apple 9
11 2021-01-11 Orange 10
12 2021-01-12 Orange 11
13 2021-01-13 Orange 12
14 2021-01-14 Orange 13
15 2021-01-15 Orange 14
Ensuring no conflicts from other packages, below now works.
df1 %>% group_by(Product) %>%
mutate(Days=lubridate::day(Dates)-first(lubridate::day(Dates)))
We can subtract the "Date", for every row, from the first "Date" value:
df1 %>% group_by(Product) %>%
mutate(Days=lubridate::day(Dates)-first(lubridate::day(Dates)))
# A tibble: 15 x 3
# Groups: Product [3]
Dates Product Days
<date> <chr> <int>
1 2021-01-01 Banana 0
2 2021-01-02 Banana 1
3 2021-01-03 Banana 2
4 2021-01-04 Banana 3
5 2021-01-05 Banana 4
6 2021-01-06 Apple 0
7 2021-01-07 Apple 1
8 2021-01-08 Apple 2
9 2021-01-09 Apple 3
10 2021-01-10 Apple 4
11 2021-01-11 Orange 0
12 2021-01-12 Orange 1
13 2021-01-13 Orange 2
14 2021-01-14 Orange 3
15 2021-01-15 Orange 4
Since using tidyverse is not a requirement, here a base R solution:
data.frame( df1, Days=as.vector( sapply( unique(df1$Product),
function(x) df1$Dates[df1$Product==x] - df1$Dates[df1$Product==x][1] ) ) )
Dates Product Days
1 2021-01-01 Banana 0
2 2021-01-02 Banana 1
3 2021-01-03 Banana 2
4 2021-01-04 Banana 3
5 2021-01-05 Banana 4
6 2021-01-06 Apple 0
7 2021-01-07 Apple 1
8 2021-01-08 Apple 2
9 2021-01-09 Apple 3
10 2021-01-10 Apple 4
11 2021-01-11 Orange 0
12 2021-01-12 Orange 1
13 2021-01-13 Orange 2
14 2021-01-14 Orange 3
15 2021-01-15 Orange 4
I have a data.table dt:
names <- c("john","mary","mary","mary","mary","mary","mary","tom","tom","tom","mary","john","john","john","tom","tom")
dates <- c(as.Date("2010-06-01"),as.Date("2010-06-01"),as.Date("2010-06-05"),as.Date("2010-06-09"),as.Date("2010-06-13"),as.Date("2010-06-17"),as.Date("2010-06-21"),as.Date("2010-07-09"),as.Date("2010-07-13"),as.Date("2010-07-17"),as.Date("2010-06-01"),as.Date("2010-08-01"),as.Date("2010-08-05"),as.Date("2010-08-09"),as.Date("2010-09-03"),as.Date("2010-09-04"))
shifts_missed <- c(2,11,11,11,11,11,11,6,6,6,1,5,5,5,0,2)
shift <- c("Day","Night","Night","Night","Night","Night","Night","Day","Day","Day","Day","Night","Night","Night","Night","Day")
df <- data.frame(names=names, dates=dates, shifts_missed=shifts_missed, shift=shift)
dt <- as.data.table(df)
names dates shifts_missed shift
john 2010-06-01 2 Day
mary 2010-06-01 11 Night
mary 2010-06-05 11 Night
mary 2010-06-09 11 Night
mary 2010-06-13 11 Night
mary 2010-06-17 11 Night
mary 2010-06-21 11 Night
tom 2010-07-09 6 Day
tom 2010-07-13 6 Day
tom 2010-07-17 6 Day
mary 2010-06-01 1 Day
john 2010-08-01 5 Night
john 2010-08-05 5 Night
john 2010-08-09 5 Night
tom 2010-09-03 0 Night
tom 2010-09-04 2 Day
Ultimately, what I want is to get the following:
names dates shifts_missed shift count
john 2010-06-01 2 Day 1
mary 2010-06-01 11 Night 1
mary 2010-06-05 11 Night 1
mary 2010-06-09 11 Night 1
mary 2010-06-13 11 Night 1
mary 2010-06-17 11 Night 1
mary 2010-06-21 11 Night 1
tom 2010-07-09 6 Day 1
tom 2010-07-13 6 Day 1
tom 2010-07-17 6 Day 1
mary 2010-06-01 1 Day 1
john 2010-08-01 5 Night 1
john 2010-08-05 5 Night 1
john 2010-08-09 5 Night 1
tom 2010-09-03 0 Night 0
tom 2010-09-04 2 Day 1
john 2010-06-01 2 Night 1
mary 2010-06-05 11 Day 1
mary 2010-06-09 11 Day 1
mary 2010-06-13 11 Day 1
mary 2010-06-17 11 Day 1
mary 2010-06-21 11 Day 1
tom 2010-07-09 6 Night 1
tom 2010-07-13 6 Night 1
tom 2010-07-17 6 Night 1
john 2010-08-05 5 Day 1
john 2010-08-09 5 Day 1
tom 2010-09-04 2 Night 1
As you can see, the second half of the data is almost a duplicate of the first half. However, if shifts_missed = 0, it should not be duplicated, and if shifts_missed is odd, the first row should not be duplicated but the remaining rows should. It should then add a 1 in the count column for all except when shifts_missed = 0.
I've seen some answers that speak about !duplicate or unique, but these values in shifts_missed are not unique. I'm sure this isn't overly complicated and is probably a multi-step process, but I can't figure out how to isolate the first rows of the odd shifts_missed column.
dt[, is.in := if(shifts_missed[1] %% 2 == 0) T else c(F, rep(T, .N-1))
, by = .(names, shift)]
rbind(dt, dt[is.in & shifts_missed != 0])
Adding the extra column part should be obvious.
I would like to check that an individual does not have any gaps in their eligibility status. I define a gap as a date_of_claim that occurs 30 days after the last elig_end_date. therefore, what I would like to do is check that each date_of_claim is no longer than the elig_end_date +30days in the row immediately preceeding. Ideally I would like an indicator that says 0 for no gap and 1 if there is a gap per person and where the gap occurs. Here is a sample df with the solution built in as 'gaps'.
names date_of_claim elig_end_date obs gaps
1 tom 2010-01-01 2010-07-01 1 NA
2 tom 2010-05-04 2010-07-01 1 0
3 tom 2010-06-01 2014-01-01 2 0
4 tom 2010-10-10 2014-01-01 2 0
5 mary 2010-03-01 2014-06-14 1 NA
6 mary 2010-05-01 2014-06-14 1 0
7 mary 2010-08-01 2014-06-14 1 0
8 mary 2010-11-01 2014-06-14 1 0
9 mary 2011-01-01 2014-06-14 1 0
10 john 2010-03-27 2011-03-01 1 NA
11 john 2010-07-01 2011-03-01 1 0
12 john 2010-11-01 2011-03-01 1 0
13 john 2011-02-01 2011-03-01 1 0
14 sue 2010-02-01 2010-04-30 1 NA
15 sue 2010-02-27 2010-04-30 1 0
16 sue 2010-03-13 2010-05-31 2 0
17 sue 2010-04-27 2010-06-30 3 0
18 sue 2010-04-27 2010-06-30 3 0
19 sue 2010-05-06 2010-08-31 4 0
20 sue 2010-06-08 2010-09-30 5 0
21 mike 2010-05-01 2010-07-30 1 NA
22 mike 2010-06-01 2010-07-30 1 0
23 mike 2010-11-12 2011-07-30 2 1
I have found this post quite useful How can I compare a value in a column to the previous one using R?, but feel that I cant use a loop as my df has 4 million rows, and I have had a lot of difficulty trying to run a loop on it already.
to this end, i think the code i need is something like this:
df$gaps<-ifelse(df$date_of_claim>=df$elig_end_date+30,1,0) ##this doesn't use the preceeding row.
I've made a clumsy attempt using this:
df$gaps<-df$date_of_claim>=df$elig_end_date[-1,]
but I get an error to say i have an incorrect number of dimensions.
all help greatly appreciated! thank you.
With four million observations I would use data.table:
DF <- read.table(text="names date_of_claim elig_end_date obs gaps
1 tom 2010-01-01 2010-07-01 1 NA
2 tom 2010-05-04 2010-07-01 1 0
3 tom 2010-06-01 2014-01-01 2 0
4 tom 2010-10-10 2014-01-01 2 0
5 mary 2010-03-01 2014-06-14 1 NA
6 mary 2010-05-01 2014-06-14 1 0
7 mary 2010-08-01 2014-06-14 1 0
8 mary 2010-11-01 2014-06-14 1 0
9 mary 2011-01-01 2014-06-14 1 0
10 john 2010-03-27 2011-03-01 1 NA
11 john 2010-07-01 2011-03-01 1 0
12 john 2010-11-01 2011-03-01 1 0
13 john 2011-02-01 2011-03-01 1 0
14 sue 2010-02-01 2010-04-30 1 NA
15 sue 2010-02-27 2010-04-30 1 0
16 sue 2010-03-13 2010-05-31 2 0
17 sue 2010-04-27 2010-06-30 3 0
18 sue 2010-04-27 2010-06-30 3 0
19 sue 2010-05-06 2010-08-31 4 0
20 sue 2010-06-08 2010-09-30 5 0
21 mike 2010-05-01 2010-07-30 1 NA
22 mike 2010-06-01 2010-07-30 1 0
23 mike 2010-11-12 2011-07-30 2 1", header=TRUE)
library(data.table)
DT <- data.table(DF)
DT[, c("date_of_claim", "elig_end_date") := list(as.Date(date_of_claim), as.Date(elig_end_date))]
DT[, gaps2:= c(NA, date_of_claim[-1] > head(elig_end_date, -1)+30), by=names]
# names date_of_claim elig_end_date obs gaps gaps2
# 1: tom 2010-01-01 2010-07-01 1 NA NA
# 2: tom 2010-05-04 2010-07-01 1 0 FALSE
# 3: tom 2010-06-01 2014-01-01 2 0 FALSE
# 4: tom 2010-10-10 2014-01-01 2 0 FALSE
# 5: mary 2010-03-01 2014-06-14 1 NA NA
# 6: mary 2010-05-01 2014-06-14 1 0 FALSE
# 7: mary 2010-08-01 2014-06-14 1 0 FALSE
# 8: mary 2010-11-01 2014-06-14 1 0 FALSE
# 9: mary 2011-01-01 2014-06-14 1 0 FALSE
# 10: john 2010-03-27 2011-03-01 1 NA NA
# 11: john 2010-07-01 2011-03-01 1 0 FALSE
# 12: john 2010-11-01 2011-03-01 1 0 FALSE
# 13: john 2011-02-01 2011-03-01 1 0 FALSE
# 14: sue 2010-02-01 2010-04-30 1 NA NA
# 15: sue 2010-02-27 2010-04-30 1 0 FALSE
# 16: sue 2010-03-13 2010-05-31 2 0 FALSE
# 17: sue 2010-04-27 2010-06-30 3 0 FALSE
# 18: sue 2010-04-27 2010-06-30 3 0 FALSE
# 19: sue 2010-05-06 2010-08-31 4 0 FALSE
# 20: sue 2010-06-08 2010-09-30 5 0 FALSE
# 21: mike 2010-05-01 2010-07-30 1 NA NA
# 22: mike 2010-06-01 2010-07-30 1 0 FALSE
# 23: mike 2010-11-12 2011-07-30 2 1 TRUE
# names date_of_claim elig_end_date obs gaps gaps2
I want to establish a cohort of new users of drugs (Ray 2003). My original dataset is huge approx 19 million rows, so a loop is proving inefficient. Here is a dummy dataset (done with fruits instead of drugs):
df2
names dates age sex fruit
1 tom 2010-02-01 60 m apple
2 mary 2010-05-01 55 f orange
3 tom 2010-03-01 60 m banana
4 john 2010-07-01 57 m kiwi
5 mary 2010-07-01 55 f apple
6 tom 2010-06-01 60 m apple
7 john 2010-09-01 57 m apple
8 mary 2010-07-01 55 f orange
9 john 2010-11-01 57 m banana
10 mary 2010-09-01 55 f apple
11 tom 2010-08-01 60 m kiwi
12 mary 2010-11-01 55 f apple
13 john 2010-12-01 57 m orange
14 john 2011-01-01 57 m apple
I have identified people who were prescribed an apple between 04-2010 and 10-2010:
temp2
names dates age sex fruit
6 tom 2010-06-01 60 m apple
5 mary 2010-07-01 55 f apple
7 john 2010-09-01 57 m apple
I would like to make a new column in the original DF called "index" which is the first date that a person was prescribed a drug in the the defined date range. This is what I have tried to get the dates from temp into df$index:
df2$index<-temp2$dates
df2$index<-df2$dates == temp2$dates
df2$index<-df2$dates %in% temp2$dates
df2$index<-ifelse(as.Date(df$dates)==as.Date(temp2$dates), as.Date(temp2$dates),NA)
I'm not doing this right - as none of these work. This is the desired output.
df2
names dates age sex fruit index
1 tom 2010-02-01 60 m apple <NA>
2 mary 2010-05-01 55 f orange <NA>
3 tom 2010-03-01 60 m banana <NA>
4 john 2010-07-01 57 m kiwi <NA>
5 mary 2010-07-01 55 f apple 2010-07-01
6 tom 2010-06-01 60 m apple 2010-06-01
7 john 2010-09-01 57 m apple 2010-09-01
8 mary 2010-07-01 55 f orange <NA>
9 john 2010-11-01 57 m banana <NA>
10 mary 2010-09-01 55 f apple <NA>
11 tom 2010-08-01 60 m kiwi <NA>
12 mary 2010-11-01 55 f apple <NA>
13 john 2010-12-01 57 m orange <NA>
14 john 2011-01-01 57 m apple <NA>
Once I have the desired output, I want to trace back from the index date to see if any person had an apple in the previous 180 days. if they did not have an apple - I want to keep them. If they did have an apple (e.g., tom) I want to discard him. This is the code i have tried on the desired output:
df4<-df2[df2$fruit!='apple' & df2$index-180,]
df4<-df2[df2$fruit!='apple' & df2$dates<=df2$index-180,] ##neither work for me
I would appreciate any guidance at all on these questions - even a direction to what I should read to help me learn how to do this. Perhaps my logic is flawed and my method won't work - please tell me if thats the case! Thank you in advance.
Here is my df:
names<-c("tom", "mary", "tom", "john", "mary",
"tom", "john", "mary", "john", "mary", "tom", "mary", "john", "john")
dates<-as.Date(c("2010-02-01", "2010-05-01", "2010-03-01",
"2010-07-01", "2010-07-01", "2010-06-01", "2010-09-01",
"2010-07-01", "2010-11-01", "2010-09-01", "2010-08-01",
"2010-11-01", "2010-12-01", "2011-01-01"))
fruit<-as.character(c("apple", "orange", "banana", "kiwi",
"apple", "apple", "apple", "orange", "banana", "apple",
"kiwi", "apple", "orange", "apple"))
age<-as.numeric(c(60,55,60,57,55,60,57,55,57,55,60,55, 57,57))
sex<-as.character(c("m","f","m","m","f","m","m",
"f","m","f","m","f","m", "m"))
df2<-data.frame(names,dates, age, sex, fruit)
df2
Here is temp2:
data1<-df2[df2$fruit=="apple"& (df2$dates >= "2010-04-01" & df2$dates< "2010-10-01"), ]
index <- with(data1, order(dates))
temp<-data1[index, ]
dup<-duplicated(temp$names)
temp1<-cbind(temp,dup)
temp2<-temp1[temp1$dup!=TRUE,]
temp2$dup<-NULL
SOLUTION
df2 <- df2[with(df2, order(names, dates)), ]
df2$first.date <- ave(df2$date, df2$name, df2$fruit,
FUN=function(dt) dt[dt <="2010-10-31" & dt>="2010-04-01"][1]) ##DWin code for assigning index date for each fruit in the pre-period
df2$x<-df2$fruit=='apple' & df2$dates>df2$first.date-180 & df2$dates<df2$first.date ##assigns TRUE to row that tom is not a new user
ids <- with(df2, unique(names[x == "TRUE"])) ##finding the id which has one value of true
new_users<-subset(df2, !names %in% ids) ##gets rid of id that has at least one value of true
First order by name and date:
df <- df[with(df, order(names, dates)), ]
Then just pick the first date within each name:
df$first.date <- ave(df$date, df$name, FUN="[", 1)
Now that you have will see "the power of the fully operational Death Star \w\w", er, the ave-function. You are ready to pick out the first date within individual 'names' and 'fruits' within that date-range:
> df$first.date <- ave(df$date, df$name, df$fruit,
FUN=function(dt) dt[dt <="2010-10-31" & dt>="2010-04-01"][1] )
> df
names dates age sex fruit first.date
4 john 2010-07-01 57 m kiwi 2010-07-01
7 john 2010-09-01 57 m apple 2010-09-01
9 john 2010-11-01 57 m banana <NA>
13 john 2010-12-01 57 m orange <NA>
14 john 2011-01-01 57 m apple 2010-09-01
2 mary 2010-05-01 55 f orange 2010-05-01
5 mary 2010-07-01 55 f apple 2010-07-01
8 mary 2010-07-01 55 f orange 2010-05-01
10 mary 2010-09-01 55 f apple 2010-07-01
12 mary 2010-11-01 55 f apple 2010-07-01
1 tom 2010-02-01 60 m apple 2010-06-01
3 tom 2010-03-01 60 m banana <NA>
6 tom 2010-06-01 60 m apple 2010-06-01
11 tom 2010-08-01 60 m kiwi 2010-08-01
Since you have 19 million rows , I think you should try a data.table solution. Here my attempt. The result is slightly different from #Dwin result since I filter my data between (begin,end) and then I create a new index variable which is the min dates occurring in this chosen range for each (names,fruits)
library(data.table)
DT <- data.table(df2,key=c('names','dates'))
DT[,dates := as.Date(dates)]
DT[between(dates,as.Date("2010-04-01"),as.Date("2010-10-31")),
index := as.character(min(dates))
, by=c('names','fruit')]
## names dates age sex fruit index
## 1: john 2010-07-01 57 m kiwi 2010-07-01
## 2: john 2010-09-01 57 m apple 2010-09-01
## 3: john 2010-11-01 57 m banana NA
## 4: john 2010-12-01 57 m orange NA
## 5: john 2011-01-01 57 m apple NA
## 6: mary 2010-05-01 55 f orange 2010-05-01
## 7: mary 2010-07-01 55 f apple 2010-07-01
## 8: mary 2010-07-01 55 f orange 2010-05-01
## 9: mary 2010-09-01 55 f apple 2010-07-01
## 10: mary 2010-11-01 55 f apple NA
## 11: tom 2010-02-01 60 m apple NA
## 12: tom 2010-03-01 60 m banana NA
## 13: tom 2010-06-01 60 m apple 2010-06-01
## 14: tom 2010-08-01 60 m kiwi 2010-08-01