I tried to find out(guess) current status based on lastest status.
Assume that we have following data frame(it's abbreviation of real data)
examineData
ID Date Status_Value
A 2012-01-01 100
A 2012-01-10 200
A 2012-02-20 500
B 2012-01-01 1100
B 2012-01-10 1200
B 2012-02-20 1500
C 2012-01-01 2100
C 2012-01-10 2200
C 2012-02-20 2500
In above, A,B and C are objects which have status_value. Status_values were examined on the Date.
asked
ID Date
A 2012-01-09
A 2012-02-28
B 2012-02-19
C 2012-01-10
But, someone asked about status from A,B and C (it could be less) on specific date.
As you can see, some of asked$Date does not match to the examinData$Date.
In that case, we decided to get lastest data from examineData$Date.
ID Date Status_Value
A 2012-01-09 100
A 2012-02-28 500
B 2012-02-19 1200
C 2012-01-10 2200
Would you give me a sample code? (Speed is important - 1,600,000 rows of examineData, 110,000 rows of asked)
In addition, There are over 60,000 kinds of ID. And, there are no duplicate date in a same ID in examineData
This seems to work:
examineData$Date <- as.Date(examineData$Date, format = "%Y-%m-%d")
asked$Date <- as.Date(asked$Date, format = "%Y-%m-%d")
#res <- unlist(lapply(split(examineData, examineData$ID),
# function(x) { merged <- c(x$Date, asked$Date[asked$ID == unique(x$ID)]) ;
# x$Status_Value[which(order(merged) %in% length(merged)) - 1] }))
I guess, though, a data.table solution might be more efficient than this.
EDIT Modified solution, provided -now- that there might be duplicate IDs in asked:
#dates should, still, be turned into actual dates if they aren't
#function to (m)apply over asked
fun <- function(id, date)
{
subsetted_examineData <- examineData[examineData$ID == id,]
merged <- c(subsetted_examineData$Date, date)
res <- subsetted_examineData$Status_Value[which(order(merged) %in% length(merged)) -1]
return(res)
}
res <- mapply(fun, asked$ID, asked$Date)
res
# A A B C
# 100 500 1200 2200
cbind(asked, Status_Value = unname(res))
# ID Date Status_Value
#1 A 2012-01-09 100
#2 A 2012-02-28 500
#3 B 2012-02-19 1200
#4 C 2012-01-10 2200
sel <- vector()
for(i in 1:length(unique(examineData$ID))){
id <- unique(examineData$ID)[i]
set <- subset(examineData,ID==id)
dif <- asked[asked$ID==id,"Date"] - set$Date
dif[dif<0] <- NA
sel[i] <- row.names(set)[which.min(dif)]
}
examineData[sel,]
To get this
ID Date Status_Value
1 A 2012-01-01 100
5 B 2012-01-10 1200
8 C 2012-01-10 2200
You can build in some "corrections" for missing values, but as you have not specified any of, this is the clean way.
Related
I am trying to merge two dataframes based on a conditional relationship between several dates associated with unique identifiers but distributed across different observations (rows).
I have two large datasets with unique identifiers. One dataset has 'enter' and 'exit' dates (alongside some other variables).
> df1 <- data.frame(ID=c(1,1,1,2,2,3,4),
enter.date=c('5/07/2015','7/10/2015','8/25/2017','9/1/2016','1/05/2018','5/01/2016','4/08/2017'),
+ exit.date = c('7/1/2015', '10/15/2015', '9/03/2017', '9/30/2016', '6/01/2019',
'5/01/2017', '6/08/2017'));
> dcis <- grep('date$',names(df1));
> df1[dcis] <- lapply(df1[dcis],as.Date,'%m/%d/%Y');
> df1;
ID enter.date exit.date
1 1 2015-05-07 2015-07-01
2 1 2015-07-10 2015-10-15
3 1 2017-08-25 2017-09-03
4 2 2016-09-01 2016-09-30
5 2 2018-01-05 2019-06-01
6 3 2016-05-01 2017-05-01
7 4 2017-04-08 2017-06-08
and the other has "eval" dates.
> df2 <- data.frame(ID=c(1,2,2,3,4), eval.date=c('10/30/2015',
'10/10/2016','9/10/2019','5/15/2018','1/19/2015'));
> df2$eval.date<-as.Date(df2$eval.date, '%m/%d/%Y')
> df2;
ID eval.date
1 1 2015-10-30
2 2 2016-10-10
3 2 2019-09-10
4 3 2018-05-15
5 4 2015-01-19
I am trying to calculate the average interval of time from 'exit' to 'eval' for each individual in the dataset. However, I only want those 'evals' that come after a given individual's 'exit' and before the next 'enter' for that individual (there are no 'eval' observations between enter and exit for a given individual), if such an 'eval' exists.
In other words, I'm trying to get an output that looks like this from the two dataframes above.
> df3 <- data.frame(ID=c(1,2,2,3), enter.date=c('7/10/2015','9/1/2016','1/05/2018','5/01/2016'),
+ exit.date = c('10/15/2015', '9/30/2016', '6/01/2019', '5/01/2017'),
+ assess.date=c('10/30/2015', '10/10/2016', '9/10/2019', '5/15/2018'));
> dcis <- grep('date$',names(df3));
> df3[dcis] <- lapply(df3[dcis],as.Date,'%m/%d/%Y');
> df3$time.diff<-difftime(df3$exit.date, df3$assess.date)
> df3;
ID enter.date exit.date assess.date time.diff
1 1 2015-07-10 2015-10-15 2015-10-30 -15 days
2 2 2016-09-01 2016-09-30 2016-10-10 -10 days
3 2 2018-01-05 2019-06-01 2019-09-10 -101 days
4 3 2016-05-01 2017-05-01 2018-05-15 -379 days
Once I perform the merge finding the averages is easy enough with
> aggregate(df3[,5], list(df3$ID), mean)
Group.1 x
1 1 -15.0
2 2 -55.5
3 3 -379.0
but I'm really at a loss as to how to perform the merge. I've tried to use leftjoin and fuzzyjoin to perform the merge per the advice given here and here, but I'm inexperienced at R and couldn't figure it out. I would really appreciate if someone could walk me through it - thanks!
A few other descriptive notes about the data: each ID may have some number of rows associated with it in each dataframe. df1 has enter dates which mark the beginning of a service delivery and exit dates that mark the end of a service delivery. All enters have one corresponding exit. df2 has eval dates. Eval dates can occur at any time when an individual is not receiving the service. There may be many evals between one period of service delivery and the next, or there may be no evals.
Just discovered the sqldf package. Assuming that for each ID the date ranges are in ascending order, you might use it like this:
df1 <- data.frame(ID=c(1,1,1,2,2,3,4), enter.date=c('5/07/2015','7/10/2015','8/25/2017','9/1/2016','1/05/2018','5/01/2016','4/08/2017'), exit.date = c('7/1/2015', '10/15/2015', '9/03/2017', '9/30/2016', '6/01/2019',
'5/01/2017', '6/08/2017'));
dcis <- grep('date$',names(df1));
df1[dcis] <- lapply(df1[dcis],as.Date,'%m/%d/%Y');
df1;
df2 <- data.frame(ID=c(1,2,2,3,4), eval.date=c('10/30/2015',
'10/10/2016','9/10/2019','5/15/2018','1/19/2015'));
df2$eval.date<-as.Date(df2$eval.date, '%m/%d/%Y')
df2;
library(sqldf)
df1 = unsplit(lapply(split(df1, df1$ID, drop=FALSE), function(df) {
df$next.date = as.Date('2100-12-31')
if (nrow(df) > 1)
df$next.date[1:(nrow(df) - 1)] = df$enter.date[2:nrow(df)]
df
}), df1$ID)
sqldf('
select df1.*, df2.*, df1."exit.date" - df2."eval.date" as "time.diff"
from df1, df2
where df1.ID == df2.ID
and df2."eval.date" between df1."exit.date"
and df1."next.date"')
ID enter.date exit.date next.date ID..5 eval.date time.diff
1 1 2015-07-10 2015-10-15 2017-08-25 1 2015-10-30 -15
2 2 2016-09-01 2016-09-30 2018-01-05 2 2016-10-10 -10
3 2 2018-01-05 2019-06-01 2100-12-31 2 2019-09-10 -101
4 3 2016-05-01 2017-05-01 2100-12-31 3 2018-05-15 -379
I want to create a Date sequence as follows:
firstyear <- seq(as.Date('2000-01-01'),by='8 day',length=46)
then append the next year in the date sequence like 'first year', until the year 2017.
Lastly, the sequence contains 46*18 elements, shown visually like this:
2000-01-01
2000-01-09
...
2000-12-26
2001-01-01
...
2001-12-26
...
2017-12-26
How can I generate this Date sequence compactly?
Using sapply
a=c(2000:2017)
yourlist=as.Date(sapply(a,function(x) seq(as.Date(paste0(as.character(x),'-01-01')),by='8 day',length=46)),origin='1970-01-01')
You can create a function which will vary your date generation for you. Notice that I've transformed the output to a data.frame to preserve dates in "native" form.
yearSequence <- function(x) {
data.frame(variable = seq(as.Date(sprintf('%s-01-01', x)), by = '8 day', length = 46))
}
You can apply the function to the years you want.
out <- sapply(2000:2017, FUN = yearSequence, simplify = FALSE)
Combine result as a data.frame.
result <- do.call(rbind, out)
> head(result)
variable
1 2000-01-01
2 2000-01-09
3 2000-01-17
4 2000-01-25
5 2000-02-02
6 2000-02-10
> tail(result)
variable
823 2017-11-17
824 2017-11-25
825 2017-12-03
826 2017-12-11
827 2017-12-19
828 2017-12-27
I have a vector of dates:
dates <- seq(as.Date('2017-01-01'), as.Date('2017-12-31'), by = 'days')
I want to create a data frame where this vector is repeated for n rows. Can anyone tell me how I might be able to accomplish this? Any help is greatly appreciated.
Thanks for the suggestions so far. Unfortunately, I think my intention was unclear in my original question. I would like each of n rows in the data frame to contain the vector of dates so that the final data frame would look something like this:
1 2017-01-01 2017-01-02.....2017-12-31
2 2017-01-01 2017-01-02.....2017-12-31
3 2017-01-01 2017-01-02.....2017-12-31
.
.
.
n 2017-01-01 2017-01-02.....2017-12-31
You can use rep to repeat the vector and then coerce to a dataframe. For example, repeating 10 times
num_repeat <- 10
dates <- data.frame(rep(
seq(as.Date('2017-01-01'), as.Date('2017-12-31'), by = 'days'),
times = num_repeat))
As the question asker is hoping to fill n rows, wouldn't it make more sense to specify length.out rather than times?
set.seed(1)
dtf <- data.frame(A=letters[sample(1:27, 1000, TRUE)])
dtf$B <- rep(dates, length.out=nrow(dtf))
tail(dtf)
# A B
# 995 d 2017-09-22
# 996 u 2017-09-23
# 997 r 2017-09-24
# 998 h 2017-09-25
# 999 f 2017-09-26
# 1000 h 2017-09-27
We use replicate to do this
n <- 5
out <- do.call(rbind, replicate(n, as.data.frame(as.list(dates)),
simplify = FALSE))
names(out) <- paste0('V', seq_along(out))
dim(out)
#[1] 5 365
out[1:3, 1:3]
# V1 V2 V3
#1 2017-01-01 2017-01-02 2017-01-03
#2 2017-01-01 2017-01-02 2017-01-03
#3 2017-01-01 2017-01-02 2017-01-03
out[1:3, 362:365]
# V362 V363 V364 V365
#1 2017-12-28 2017-12-29 2017-12-30 2017-12-31
#2 2017-12-28 2017-12-29 2017-12-30 2017-12-31
#3 2017-12-28 2017-12-29 2017-12-30 2017-12-31
I have created the following 2 dummy datasets as follows:
id<-c(8,8,50,87,141,161,192,216,257,282)
date<-c("2011-03-03","2011-12-12","2010-08-18","2009-04-28","2010-11-29","2012-04-02","2013-01-08","2007-01-22","2009-06-03","2009-12-02")
data<-data.frame(cbind(id,date))
id<-c(3,8,11,11,11,11,11,11,19,19,19,19,19,50,50,50,50,50,87,87,87,87,87,87,282,282,282,282,282,282,282,282,282,282,288,288,288,288,288,288,288,288,288,288,288,288,288)
date<-c("2010-11-04","2011-02-25","2009-07-26","2009-07-27","2009-08-09","2009-08-10","2009-08-30","2004-01-20","2006-02-13","2006-07-18","2007-04-20","2008-05-12","2008-05-29","2009-06-10","2010-08-17","2010-08-15","2011-05-13","2011-06-08","2007-08-09","2008-01-19","2008-02-19","2009-04-28","2009-05-16","2009-05-20","2005-05-14","2007-04-15","2007-07-25","2007-10-12","2007-10-23","2007-10-27","2007-11-20","2009-11-28","2012-08-16","2012-08-16","2008-11-17","2009-10-23","2009-10-27","2009-10-27","2009-10-27","2009-10-27","2009-10-28","2010-06-15","2010-06-17","2010-06-23","2010-07-27","2010-07-27","2010-07-28")
ns<-data.frame(cbind(id,date))
Note that only some of the id in data are included in ns and viceversa.
For each of the values in data$id I am trying to find if there is a ns$date that is 14 days before the data$date where data$id==ns$id and report the number of days difference.
The output I need is a vector/column ("received") of the same number of rows of data, with a TRUE/FALSE whre ns$date[ns$id==data$id] is less than 14 days before the respective data$date and a similar vector with the actual number of days where "received" is TRUE. I hope this makes sense now.
This is where I got so far
# convert dates
data$date <- ymd(data$date)
ns$date <- ymd(ns$date)
# left join datasets
tmp <- merge(data, ns, by="id", all.x=TRUE)
#NOTE THAT this will automatically rename data$date as date.x and tmp$date as date.y
# create variable to say when there is a date difference less than 14 days
tmp$received <- with(tmp, difftime(date.x, date.y, units="days")<14&difftime(date.x, date.y, units="days")>0)
#create a variable that reports the days difference
tmp$dif<-ifelse(tmp$received==TRUE,difftime(tmp$date.x,tmp$date.y, units="days"),NA)
This link Find if date is within 14 days if id matches between datasets in R provides an idea but the result does not include the number of days difference in tmp$dif.
In the result table I need only the lowest difference for each data$id for those cases were tmp$received was TRUE.
Hope this makes more sense now? If not please let me know what needs further clarification.
M
PS: as requested I added what the desired output should look like (same number of rows of data = 10 - no rows for data in ns not in data). Should have thought this might help earlier.
id date received dif
1 8 2011-03-03 TRUE 6
2 8 2011-12-12 FALSE NA
3 50 2010-08-18 TRUE 1
4 87 2009-04-28 TRUE 0
5 141 2010-11-29 NA NA
6 161 2012-04-02 NA NA
7 192 2013-01-08 NA NA
8 216 2007-01-22 NA NA
9 257 2009-06-03 NA NA
10 282 2009-12-02 TRUE 4
Here's a data.table approach
Converting to data.table objects
library(data.table)
setkey(setDT(data), id)
setkey(setDT(ns), id)
Merging
ns <- ns[data]
Converting to Date class
ns[, c("date", "date.1") := lapply(.SD, as.Date), .SDcols = c("date", "date.1")]
Computing days differences and TRUE/FALSE
ns[, `:=`(timediff = date.1 - date,
Logical = (date.1 - date) < 14)]
Taking only the rows we are interested in
res <- ns[is.na(timediff) | timediff >= 0, list(received = any(Logical), dif = timediff[Logical]), by = list(id, date.1)]
Sorting by id and date
res[, id := as.numeric(as.character(id))]
setkey(res, id, date.1)
Subsetting by minimum dstance
res[, list(diff = min(dif)), by = list(id, date.1, received)]
# id date.1 received diff
# 1: 8 2011-03-03 TRUE 6 days
# 2: 8 2011-12-12 FALSE NA days
# 3: 50 2010-08-18 TRUE 1 days
# 4: 87 2009-04-28 TRUE 0 days
# 5: 141 2010-11-29 NA NA days
# 6: 161 2012-04-02 NA NA days
# 7: 192 2013-01-08 NA NA days
# 8: 216 2007-01-22 NA NA days
# 9: 257 2009-06-03 NA NA days
# 10: 282 2009-12-02 TRUE 4 days
So my data looks like this
dat<-data.frame(
subjid=c("a","a","a","b","b","c","c","d","e"),
type=c("baseline","first","second","baseline","first","baseline","first","baseline","baseline"),
date=c("2013-02-07","2013-02-27","2013-04-30","2013-03-03","2013-05-23","2013-01-02","2013-07-23","2013-03-29","2013-06-03"))
i.e)
subjid type date
1 a baseline 2013-02-07
2 a first 2013-02-27
3 a second 2013-04-30
4 b baseline 2013-03-03
5 b first 2013-05-23
6 c baseline 2013-01-02
7 c first 2013-07-23
8 d baseline 2013-03-29
9 e baseline 2013-06-03
and I'm trying to make a variable "elapsedtime" that denotes the time elapsed from the baseline date to first and second round interview dates (so that elapsedtime=0 for baselines). Note that it varies individually whether they have taken further interviews.
I tried to reshape the data so that I could subtract each dates but my brain isn't really functioning today--or is there another way?
Please help and thank you.
Screaming out for ave:
I'll throw an NA value in there just for good measure:
dat<-data.frame(
subjid=c("a","a","a","b","b","c","c","d","e"),
type=c("baseline","first","second","baseline","first","baseline","first","baseline","baseline"),
date=c("2013-02-07","NA","2013-04-30","2013-03-03","2013-05-23","2013-01-02","2013-07-23","2013-03-29","2013-06-03"))
And you should probably sort the data to be on the safe side:
dat$type <- ordered(dat$type,levels=c("baseline","first","second","third") )
dat <- dat[order(dat$subjid,dat$type),]
Turn your date into a proper Date object:
dat$date <- as.Date(dat$date)
Then calculate the differences:
dat$elapsed <- ave(as.numeric(dat$date),dat$subjid,FUN=function(x) x-x[1] )
# subjid type date elapsed
#1 a baseline 2013-02-07 0
#2 a first <NA> NA
#3 a second 2013-04-30 82
#4 b baseline 2013-03-03 0
#5 b first 2013-05-23 81
#6 c baseline 2013-01-02 0
#7 c first 2013-07-23 202
#8 d baseline 2013-03-29 0
#9 e baseline 2013-06-03 0
This makes no assumptions that baseline is the always at position 1:
dat$date <- as.Date(dat$date)
dat$elapesed <- unlist(by(dat, dat$subjid, FUN=function(x) {
as.numeric(x$date - x[x$type=="baseline",]$date)
}))