Using ifelse on two data frames [closed] - r

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I'm trying to create a new column in my data frame using an ifelse condition like this:
Let's assume two data frames A and B, both having date and time columns.
If date in A matches a date in B and the date in A's time equals matching date in B's time or matching date in B's time is lower than next row's time in A, Then TRUE else FALSE.
I hope this is clear enough.. So far I have tried something like this
A %>% mutate(DFT = ifelse(Dayt == B$date & Tyme == B$time |
Tyme > Time[which(Dayt == B$date & Tyme ==B$time) +1],
B[which(which(Dayt == B$date & Tyme ==B$time) +1])], NA))
This code may not work but I hope it gives an idea of what I'm trying to achieve. Any help would be appreciated.

reproducible example
set.seed(1)
A = data.frame(
date=seq(as.Date("2017/1/1"), as.Date("2017/1/10"), "days"))
B = data.frame(date=seq(as.Date("2017/1/2"), as.Date("2017/1/9"), "days"))
A$time <- sample(1:3,length(A$date),TRUE)
B$time <- sample(1:3,length(B$date),TRUE)
A
date time
1: 2017-01-01 1
2: 2017-01-02 2
3: 2017-01-03 2
4: 2017-01-04 3
5: 2017-01-05 1
6: 2017-01-06 3
7: 2017-01-07 3
8: 2017-01-08 2
9: 2017-01-09 2
10: 2017-01-10 1
B
date time
1: 2017-01-02 1
2: 2017-01-03 1
3: 2017-01-04 3
4: 2017-01-05 2
5: 2017-01-06 3
6: 2017-01-07 2
7: 2017-01-08 3
8: 2017-01-09 3
solution
Here a solution , First I merge by date then I filter using the time condition.
library(data.table)
setDT(A)
setDT(B)
merge(A,B,by="date")[time.x==time.y | time.y==c(tail(time.x,-1),NA)]
date time.x time.y
1: 2017-01-04 3 3
2: 2017-01-06 3 3
3: 2017-01-07 3 2

Related

Create date variable that is either 31 days after a certain date, or is the maximum date per ID

I have data that looks something like this:
ID Event Date
A 0 2015-01-01
A 0 2015-02-01
A 1 2015-03-30
B 0 2016-02-28
B 0 2016-03-30
B 0 2016-04-30
C 0 2015-01-01
I'd like to create a variable called "Date2" so that if someone's Event is 1, their new date is 31 days after the corresponding date in which their Event==1. However, if an individual never has Event==1 (as in individuals B and C), I would like their dates set as the last date observed. My desired output is as follows:
ID Event Date Date2
A 0 2015-01-01 2015-05-01
A 0 2015-02-01 2015-05-01
A 1 2015-03-31 2015-05-01
B 0 2016-02-28 2016-04-30
B 0 2016-03-31 2016-04-30
B 0 2016-04-30 2016-04-30
C 0 2015-01-01 2015-01-01
So far, I have tried:
setDT(data)
data[, Date2 := max(Date)]
data[data[Event == 1, .I[1], by=c("ID")]$V1, Date2:= as.Date(Date[which(Event == 1)], format="%Y-%m-%d") + 31]
While Date2 for whomever has Event==1 is correct, my Date2 for all others ends up being the maximum Date from the entire data set, so 2016-04-30, in this case.
Would appreciate any help.
Thank you!!
Assuming that there is only one row where Event==1L, you can use if in j as follows:
data[, Date2 := if (any(Event==1L)) Date[Event==1] + 31L else max(Date), by=.(ID)]
I think you just need to set the by parameter of your data.table to ID and add the columns Event and Date to the required computation Date2. So doing this
setDT(data)
data[,.(Event,Date,Date2=if(sum(Event)!=0) {Date[.N]+31} else {Date[.N]}),by=ID]
The result is
ID Event Date Date2
1: A 0 2015-01-01 2015-05-01
2: A 0 2015-02-01 2015-05-01
3: A 1 2015-03-31 2015-05-01
4: B 0 2016-02-28 2016-04-30
5: B 0 2016-03-30 2016-04-30
6: B 0 2016-04-30 2016-04-30
7: C 0 2015-01-01 2015-01-01
Let's create a function that rips out what we want for a specific ID, then we'll apply it piece-wise to the data frame, and shove it all back together at the end :)
First, so we're explicit, let's make sure the Date column is stored as type Date. If you're unsure, use class(df$Date) to check.
df$Date <- as.Date(df$Date)
Now for the fun stuff.
The function
date_adder <- function(
df
){
#Look for a 1 in the dataset
event_match <- match(1, df$Event)
#If we found a match
if(!is.na(event_match)){
return(df$Date[event_match] + 31)
} else { #If there was no match
#Find the biggest date they have on record
#I took last as 'biggest'
#If you want last in data frame, use nrow(df) instead
last_element <- which.max(df$Date)
return(df$Date[last_element])
}
}
This function takes advantage of a data frame for a specific ID, i.e. it just has a list of Events and Dates. If it finds an event, it adds 31 to that day and returns it, otherwise it returns the latest date it can fine (I've left a comment for if this were not your intention).
To make this function usable, just execute it like any other code.
List of IDs and Date2s
date_list <- plyr::ddply(df, "ID", date_adder)
This takes advantage of a function from the plyr package that applies a function to subsets of a data frame. here I subset on ID and apply our date_adder function. So for each ID it does what I've described up top. It returns a data frame like this.
ID V1
1 A 2015-04-30
2 B 2016-04-30
3 C 2015-01-01
I assume this is right, as 2015-04-30 is 31 days later, not the first of May like you indicated.
Piece it together
df$Date2 <- date_list[match(df$ID, date_list$ID),2]
With all that done we just sew it together based on the matching IDs. And voila, you've got a solution :)
ID Event Date Date2
1 A 0 2015-01-01 2015-04-30
2 A 0 2015-02-01 2015-04-30
3 A 1 2015-03-30 2015-04-30
4 B 0 2016-02-28 2016-04-30
5 B 0 2016-03-30 2016-04-30
6 B 0 2016-04-30 2016-04-30
7 C 0 2015-01-01 2015-01-01

Convert daily to weekly data and deal with holidays

I have a data table containing daily data. From this data table I want to extract weekly data points obtained each Wednesday. If Wednesday is a holiday, i.e. not available in the data table, the next available data point should be taken.
Here a MWE:
library(data.table)
df <- data.table(date=as.Date(c("2012-06-25","2012-06-26","2012-06-27","2012-06-28","2012-06-29","2012-07-02","2012-07-03","2012-07-05","2012-07-06","2012-07-09","2012-07-10","2012-07-11","2012-07-12","2012-07-13","2012-07-16","2012-07-17","2012-07-18","2012-07-19","2012-07-20")))
df[,weekday:=strftime(date,'%u')]
with output:
date weekday
1: 2012-06-25 1
2: 2012-06-26 2
3: 2012-06-27 3
4: 2012-06-28 4
5: 2012-06-29 5
6: 2012-07-02 1
7: 2012-07-03 2
8: 2012-07-05 4 #here the 4th of July was skipped
9: 2012-07-06 5
10: 2012-07-09 1
11: 2012-07-10 2
12: 2012-07-11 3
13: 2012-07-12 4
14: 2012-07-13 5
15: 2012-07-16 1
16: 2012-07-17 2
17: 2012-07-18 3
18: 2012-07-19 4
19: 2012-07-20 5
My desired result, in this case would be:
date weekday
2012-06-27 3
2012-07-05 4
2012-07-11 3
2012-07-18 3
Is there a more efficient way of obtaining this than going week-by-week via for loop and checking whether the Wednesday data point is included in the data or not? I feel that there must be a better way, so any advice would be highly appreciated!
Working solution (following Imo's suggestion):
df[,weekday:=wday(date)] #faster way to get weekdays, careful: numbers increased by 1 vs strftime
df[,numweek:=floor(as.numeric(date-date[1])/7+1)] #get continuous week numbers extending over end of years
df[df[,.I[which.min(abs(weekday-4.25))],by=.(numweek)]$V1] #gets result
Here is one method using a join on a data.table that finds the position (using .I) of the closest value to 3 (that is not 2, using which.min(abs(as.integer(weekday)-3.25))) by week using.
df[df[, .I[which.min(abs(as.integer(weekday)-3.25))], by=week(date)]$V1]
date weekday
1: 2012-06-27 3
2: 2012-07-05 4
3: 2012-07-11 3
4: 2012-07-18 3
Note that if your real data spans years, then you need to use by=.(week(date), year(date)).
Note also that there is a data.table function wday that will returns the integer day of the week directly. It is 1 greater than the character integer value returned by strftime, so an adjustment would be required if you wanted to use it directly.
From your data.table with a single variable, you'd do
df[, weekday := wday(date)]
df[df[, .I[which.min(abs(weekday-4.25))], by=week(date)]$V1]
date weekday
1: 2012-06-27 4
2: 2012-07-05 5
3: 2012-07-11 4
4: 2012-07-18 4
Note that the dates match those above.

How to do a BETWEEN merge the data.table way?

I have two data.tables that are each 5-10GB in size. They look similar to the following.
library(data.table)
A <- data.table(
person = c(1,1,1,2,3,3,3,3,4,4),
datetime = c(
'2015-04-06 14:22:18',
'2015-04-07 02:55:32',
'2015-11-21 10:16:05',
'2015-10-03 13:37:29',
'2015-02-26 23:51:56',
'2015-05-16 18:21:44',
'2015-06-02 04:07:43',
'2015-11-28 15:22:36',
'2015-01-19 04:10:22',
'2015-01-24 02:18:11'
)
)
B <- data.table(
person = c(1,1,3,4,4,5),
datetime2 = c(
'2015-04-06 14:24:59',
'2015-11-28 15:22:36',
'2015-06-02 04:07:43',
'2015-01-19 06:10:22',
'2015-01-24 02:18:18',
'2015-04-06 14:22:18'
)
)
A$datetime <- as.POSIXct(A$datetime)
B$datetime2 <- as.POSIXct(B$datetime2)
The idea is to find rows in B where the datetime is within 0-10 minutes of a matching row in A (matching is done by person) and mark them in A. The question is how can I do it most efficiently using data.table?
One plan is to join the two data tables based on [I]person[/I] only, then calculate the time difference and find rows where the time difference is between 0 and 600 seconds, and finally outer join the latter with A:
setkey(A,person)
AB <- A[B,.(datetime,
datetime2,
diff = difftime(datetime2, datetime, units = "secs"))
, by = .EACHI]
M <- AB[diff < 600 & diff > 0]
setkey(A, person, datetime)
setkey(M, person, datetime)
M[A,]
Which gives us the correct result:
person datetime datetime2 diff
1: 1 2015-04-06 14:22:18 2015-04-06 14:24:59 161 secs
2: 1 2015-04-07 02:55:32 <NA> NA secs
3: 1 2015-11-21 10:16:05 <NA> NA secs
4: 2 2015-10-03 13:37:29 <NA> NA secs
5: 3 2015-02-26 23:51:56 <NA> NA secs
6: 3 2015-05-16 18:21:44 <NA> NA secs
7: 3 2015-06-02 04:07:43 <NA> NA secs
8: 3 2015-11-28 15:22:36 <NA> NA secs
9: 4 2015-01-19 04:10:22 <NA> NA secs
10: 4 2015-01-24 02:18:11 2015-01-24 02:18:18 7 secs
However, I am not sure if this is the most efficient way. Specifically, I am using AB[diff < 600 & diff > 0] which I assume will run a vector search not a binary search, but I cannot think of how to do it using a binary search.
Also, I am not sure if converting to POSIXct is the most efficient way of calculating time differences.
Any ideas on how to improve efficiency are high appreciated.
data.table's rolling join is perfect for this task:
B[, datetime := datetime2]
setkey(A,person,datetime)
setkey(B,person,datetime)
B[A,roll=-600]
person datetime2 datetime
1: 1 2015-04-06 14:24:59 1428319338
2: 1 NA 1428364532
3: 1 NA 1448090165
4: 2 NA 1443868649
5: 3 NA 1424983916
6: 3 NA 1431789704
7: 3 2015-06-02 04:07:43 1433207263
8: 3 NA 1448713356
9: 4 NA 1421629822
10: 4 2015-01-24 02:18:18 1422055091
The only difference with your expected output is that it checks timedifference as less or equal to 10 minutes (<=). If that is bad for you you can just delete equal matches

data.table outer join based on groups in R

I have a data with the following columns:
CaseID, Time, Value.
The 'time' column values are not at regular intervals of 1. I am trying to add the missing values of time with 'NA' for the rest of the columns except CaseID.
Case Value Time
1 100 07:52:00
1 110 07:53:00
1 120 07:55:00
2 10 08:35:00
2 11 08:36:00
2 12 08:38:00
Desired output:
Case Value Time
1 100 07:52:00
1 110 07:53:00
1 NA 07:54:00
1 120 07:55:00
2 10 08:35:00
2 11 08:36:00
2 NA 08:37:00
2 12 08:38:00
I tried dt[CJ(unique(CaseID),seq(min(Time),max(Time),"min"))] but it gives the following error:
Error in vecseq(f__, len__, if (allow.cartesian || notjoin) NULL else as.integer(max(nrow(x), :
Join results in 9827315 rows; more than 9620640 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
I cannot able to make it work..any help would be appreciated.
Like this??
dt[,Time:=as.POSIXct(Time,format="%H:%M:%S")]
result <- dt[,list(Time=seq(min(Time),max(Time),by="1 min")),by=Case]
setkey(result,Case,Time)
setkey(dt,Case,Time)
result <- dt[result][,Time:=format(Time,"%H:%M:%S")]
result
# Case Value Time
# 1: 1 100 07:52:00
# 2: 1 110 07:53:00
# 3: 1 NA 07:54:00
# 4: 1 120 07:55:00
# 5: 2 10 08:35:00
# 6: 2 11 08:36:00
# 7: 2 NA 08:37:00
# 8: 2 12 08:38:00
Another way:
dt[, Time := as.POSIXct(Time, format = "%H:%M:%S")]
setkey(dt, Time)
dt[, .SD[J(seq(min(Time), max(Time), by='1 min'))], by=Case]
We group by Case and join on Time on each group using .SD (hence setting key on Time). From here you can use format() as shown above.

Fastest way for filling-in missing dates for data.table

I am loading a data.table from CSV file that has date, orders, amount etc. fields.
The input file occasionally does not have data for all dates. For example, as shown below:
> NADayWiseOrders
date orders amount guests
1: 2013-01-01 50 2272.55 149
2: 2013-01-02 3 64.04 4
3: 2013-01-04 1 18.81 0
4: 2013-01-05 2 77.62 0
5: 2013-01-07 2 35.82 2
In the above 03-Jan and 06-Jan do not have any entries.
Would like to fill the missing entries with default values (say, zero for orders, amount etc.), or carry the last vaue forward (e.g, 03-Jan will reuse 02-Jan values and 06-Jan will reuse the 05-Jan values etc..)
What is the best/optimal way to fill-in such gaps of missing dates data with such default values?
The answer here suggests using allow.cartesian = TRUE, and expand.grid for missing weekdays - it may work for weekdays (since they are just 7 weekdays) - but not sure if that would be the right way to go about dates as well, especially if we are dealing with multi-year data.
The idiomatic data.table way (using rolling joins) is this:
setkey(NADayWiseOrders, date)
all_dates <- seq(from = as.Date("2013-01-01"),
to = as.Date("2013-01-07"),
by = "days")
NADayWiseOrders[J(all_dates), roll=Inf]
date orders amount guests
1: 2013-01-01 50 2272.55 149
2: 2013-01-02 3 64.04 4
3: 2013-01-03 3 64.04 4
4: 2013-01-04 1 18.81 0
5: 2013-01-05 2 77.62 0
6: 2013-01-06 2 77.62 0
7: 2013-01-07 2 35.82 2
Here is how you fill in the gaps within subgroup
# a toy dataset with gaps in the time series
dt <- as.data.table(read.csv(textConnection('"group","date","x"
"a","2017-01-01",1
"a","2017-02-01",2
"a","2017-05-01",3
"b","2017-02-01",4
"b","2017-04-01",5')))
dt[,date := as.Date(date)]
# the desired dates by group
indx <- dt[,.(date=seq(min(date),max(date),"months")),group]
# key the tables and join them using a rolling join
setkey(dt,group,date)
setkey(indx,group,date)
dt[indx,roll=TRUE]
#> group date x
#> 1: a 2017-01-01 1
#> 2: a 2017-02-01 2
#> 3: a 2017-03-01 2
#> 4: a 2017-04-01 2
#> 5: a 2017-05-01 3
#> 6: b 2017-02-01 4
#> 7: b 2017-03-01 4
#> 8: b 2017-04-01 5
Not sure if it's the fastest, but it'll work if there are no NAs in the data:
# just in case these aren't Dates.
NADayWiseOrders$date <- as.Date(NADayWiseOrders$date)
# all desired dates.
alldates <- data.table(date=seq.Date(min(NADayWiseOrders$date), max(NADayWiseOrders$date), by="day"))
# merge
dt <- merge(NADayWiseOrders, alldates, by="date", all=TRUE)
# now carry forward last observation (alternatively, set NA's to 0)
require(xts)
na.locf(dt)

Resources