Subset data.table by evaluating multiple columns - r

How to return 1 row for each unique name by most recent (latest) Type?
DataTable with 6 rows:
example <- data.table(c("Bob","May","Sue","Bob","Sue","Bob"),
c("A","A","A","A","B","B"),
as.Date(c("2010/01/01", "2010/01/01", "2010/01/01",
"2012/01/01", "2012/01/11", "2014/01/01")))
setnames(example,c("Name","Type","Date"))
setkey(example,Name,Date)
Should return 5 rows:
# 1: Bob A 2012-01-01
# 2: Bob B 2014-01-01
# 3: May A 2010-01-01
# 4: Sue A 2010-01-01
# 5: Sue B 2012-01-11

Since you've already sorted by Name and Date, you can use unique (which calls unique.data.table) function on the columns Name and Type, with fromLast = TRUE.
require(data.table) ## >= v1.9.3
unique(example, by=c("Name", "Type"), fromLast=TRUE)
# Name Type Date
# 1: Bob A 2012-01-01
# 2: Bob B 2014-01-01
# 3: May A 2010-01-01
# 4: Sue A 2010-01-01
# 5: Sue B 2012-01-11
This'll pick the last row for each Name,Type group. Hope this helps.
PS: As #mso points out, this needs 1.9.3 because the fromLast argument was implemented only in 1.9.3 (available from github).

Following versions of #Arun answer work:
unique(example[rev(order(Name,Date))], by=c("Name", "Type"), fromLast=TRUE)[order(Name,Date)]
Name Type Date
1: Bob A 2012-01-01
2: Bob B 2014-01-01
3: May A 2010-01-01
4: Sue A 2010-01-01
5: Sue B 2012-01-11
unique(example[order(Name, Date, decreasing=T)], by=c("Name","Type"))[order(Name, Date)]
Name Type Date
1: Bob A 2012-01-01
2: Bob B 2014-01-01
3: May A 2010-01-01
4: Sue A 2010-01-01
5: Sue B 2012-01-11

Related

R data.table apply date by variable last

I have a data.table in R. I have to decrement date from last variable within by group. So in the example below, the date "2012-01-21" should be the 10th row when id = "A" and then decrement until the 1st row. and then for id="B" the date should be "2012-01-21" for 5th row and then decrement by 1 until it reaches first row. Basically the the decrement should start from last value by "id". How could I accomplish in R data.table?
The code below does the opposite. The date starts from 1st row and decrements, how would you start the date by last row and then decrement.
end<- as.Date("2012-01-21")
dt <- data.table(id = c(rep("A",10),rep("B",5)),sales=10+rnorm(15))
dtx <- dt[,date := seq(end,by = -1,length.out = .N),by=list(id)]
> dtx
id sales date
1: A 12.008514 2012-01-21
2: A 10.904740 2012-01-20
3: A 9.627039 2012-01-19
4: A 11.363810 2012-01-18
5: A 8.533913 2012-01-17
6: A 10.041074 2012-01-16
7: A 11.006845 2012-01-15
8: A 10.775066 2012-01-14
9: A 9.978509 2012-01-13
10: A 8.743829 2012-01-12
11: B 8.434640 2012-01-21
12: B 9.489433 2012-01-20
13: B 10.011354 2012-01-19
14: B 8.681002 2012-01-18
15: B 9.264915 2012-01-17
We could reverse the sequence generated above.
library(data.table)
dt[,date := rev(seq(end,by = -1,length.out = .N)),id]
dt
# id sales date
# 1: A 10.886312 2012-01-12
# 2: A 9.803543 2012-01-13
# 3: A 9.063694 2012-01-14
# 4: A 9.762628 2012-01-15
# 5: A 8.764109 2012-01-16
# 6: A 11.095826 2012-01-17
# 7: A 8.735148 2012-01-18
# 8: A 9.227285 2012-01-19
# 9: A 12.024336 2012-01-20
#10: A 9.976514 2012-01-21
#11: B 8.488753 2012-01-17
#12: B 9.141837 2012-01-18
#13: B 11.435365 2012-01-19
#14: B 10.817839 2012-01-20
#15: B 8.427098 2012-01-21
Similarly,
dt[,date := seq(end - .N + 1,by = 1,length.out = .N),id]

Convert daily to weekly data and deal with holidays

I have a data table containing daily data. From this data table I want to extract weekly data points obtained each Wednesday. If Wednesday is a holiday, i.e. not available in the data table, the next available data point should be taken.
Here a MWE:
library(data.table)
df <- data.table(date=as.Date(c("2012-06-25","2012-06-26","2012-06-27","2012-06-28","2012-06-29","2012-07-02","2012-07-03","2012-07-05","2012-07-06","2012-07-09","2012-07-10","2012-07-11","2012-07-12","2012-07-13","2012-07-16","2012-07-17","2012-07-18","2012-07-19","2012-07-20")))
df[,weekday:=strftime(date,'%u')]
with output:
date weekday
1: 2012-06-25 1
2: 2012-06-26 2
3: 2012-06-27 3
4: 2012-06-28 4
5: 2012-06-29 5
6: 2012-07-02 1
7: 2012-07-03 2
8: 2012-07-05 4 #here the 4th of July was skipped
9: 2012-07-06 5
10: 2012-07-09 1
11: 2012-07-10 2
12: 2012-07-11 3
13: 2012-07-12 4
14: 2012-07-13 5
15: 2012-07-16 1
16: 2012-07-17 2
17: 2012-07-18 3
18: 2012-07-19 4
19: 2012-07-20 5
My desired result, in this case would be:
date weekday
2012-06-27 3
2012-07-05 4
2012-07-11 3
2012-07-18 3
Is there a more efficient way of obtaining this than going week-by-week via for loop and checking whether the Wednesday data point is included in the data or not? I feel that there must be a better way, so any advice would be highly appreciated!
Working solution (following Imo's suggestion):
df[,weekday:=wday(date)] #faster way to get weekdays, careful: numbers increased by 1 vs strftime
df[,numweek:=floor(as.numeric(date-date[1])/7+1)] #get continuous week numbers extending over end of years
df[df[,.I[which.min(abs(weekday-4.25))],by=.(numweek)]$V1] #gets result
Here is one method using a join on a data.table that finds the position (using .I) of the closest value to 3 (that is not 2, using which.min(abs(as.integer(weekday)-3.25))) by week using.
df[df[, .I[which.min(abs(as.integer(weekday)-3.25))], by=week(date)]$V1]
date weekday
1: 2012-06-27 3
2: 2012-07-05 4
3: 2012-07-11 3
4: 2012-07-18 3
Note that if your real data spans years, then you need to use by=.(week(date), year(date)).
Note also that there is a data.table function wday that will returns the integer day of the week directly. It is 1 greater than the character integer value returned by strftime, so an adjustment would be required if you wanted to use it directly.
From your data.table with a single variable, you'd do
df[, weekday := wday(date)]
df[df[, .I[which.min(abs(weekday-4.25))], by=week(date)]$V1]
date weekday
1: 2012-06-27 4
2: 2012-07-05 5
3: 2012-07-11 4
4: 2012-07-18 4
Note that the dates match those above.

Using ifelse on two data frames [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I'm trying to create a new column in my data frame using an ifelse condition like this:
Let's assume two data frames A and B, both having date and time columns.
If date in A matches a date in B and the date in A's time equals matching date in B's time or matching date in B's time is lower than next row's time in A, Then TRUE else FALSE.
I hope this is clear enough.. So far I have tried something like this
A %>% mutate(DFT = ifelse(Dayt == B$date & Tyme == B$time |
Tyme > Time[which(Dayt == B$date & Tyme ==B$time) +1],
B[which(which(Dayt == B$date & Tyme ==B$time) +1])], NA))
This code may not work but I hope it gives an idea of what I'm trying to achieve. Any help would be appreciated.
reproducible example
set.seed(1)
A = data.frame(
date=seq(as.Date("2017/1/1"), as.Date("2017/1/10"), "days"))
B = data.frame(date=seq(as.Date("2017/1/2"), as.Date("2017/1/9"), "days"))
A$time <- sample(1:3,length(A$date),TRUE)
B$time <- sample(1:3,length(B$date),TRUE)
A
date time
1: 2017-01-01 1
2: 2017-01-02 2
3: 2017-01-03 2
4: 2017-01-04 3
5: 2017-01-05 1
6: 2017-01-06 3
7: 2017-01-07 3
8: 2017-01-08 2
9: 2017-01-09 2
10: 2017-01-10 1
B
date time
1: 2017-01-02 1
2: 2017-01-03 1
3: 2017-01-04 3
4: 2017-01-05 2
5: 2017-01-06 3
6: 2017-01-07 2
7: 2017-01-08 3
8: 2017-01-09 3
solution
Here a solution , First I merge by date then I filter using the time condition.
library(data.table)
setDT(A)
setDT(B)
merge(A,B,by="date")[time.x==time.y | time.y==c(tail(time.x,-1),NA)]
date time.x time.y
1: 2017-01-04 3 3
2: 2017-01-06 3 3
3: 2017-01-07 3 2

How to do a BETWEEN merge the data.table way?

I have two data.tables that are each 5-10GB in size. They look similar to the following.
library(data.table)
A <- data.table(
person = c(1,1,1,2,3,3,3,3,4,4),
datetime = c(
'2015-04-06 14:22:18',
'2015-04-07 02:55:32',
'2015-11-21 10:16:05',
'2015-10-03 13:37:29',
'2015-02-26 23:51:56',
'2015-05-16 18:21:44',
'2015-06-02 04:07:43',
'2015-11-28 15:22:36',
'2015-01-19 04:10:22',
'2015-01-24 02:18:11'
)
)
B <- data.table(
person = c(1,1,3,4,4,5),
datetime2 = c(
'2015-04-06 14:24:59',
'2015-11-28 15:22:36',
'2015-06-02 04:07:43',
'2015-01-19 06:10:22',
'2015-01-24 02:18:18',
'2015-04-06 14:22:18'
)
)
A$datetime <- as.POSIXct(A$datetime)
B$datetime2 <- as.POSIXct(B$datetime2)
The idea is to find rows in B where the datetime is within 0-10 minutes of a matching row in A (matching is done by person) and mark them in A. The question is how can I do it most efficiently using data.table?
One plan is to join the two data tables based on [I]person[/I] only, then calculate the time difference and find rows where the time difference is between 0 and 600 seconds, and finally outer join the latter with A:
setkey(A,person)
AB <- A[B,.(datetime,
datetime2,
diff = difftime(datetime2, datetime, units = "secs"))
, by = .EACHI]
M <- AB[diff < 600 & diff > 0]
setkey(A, person, datetime)
setkey(M, person, datetime)
M[A,]
Which gives us the correct result:
person datetime datetime2 diff
1: 1 2015-04-06 14:22:18 2015-04-06 14:24:59 161 secs
2: 1 2015-04-07 02:55:32 <NA> NA secs
3: 1 2015-11-21 10:16:05 <NA> NA secs
4: 2 2015-10-03 13:37:29 <NA> NA secs
5: 3 2015-02-26 23:51:56 <NA> NA secs
6: 3 2015-05-16 18:21:44 <NA> NA secs
7: 3 2015-06-02 04:07:43 <NA> NA secs
8: 3 2015-11-28 15:22:36 <NA> NA secs
9: 4 2015-01-19 04:10:22 <NA> NA secs
10: 4 2015-01-24 02:18:11 2015-01-24 02:18:18 7 secs
However, I am not sure if this is the most efficient way. Specifically, I am using AB[diff < 600 & diff > 0] which I assume will run a vector search not a binary search, but I cannot think of how to do it using a binary search.
Also, I am not sure if converting to POSIXct is the most efficient way of calculating time differences.
Any ideas on how to improve efficiency are high appreciated.
data.table's rolling join is perfect for this task:
B[, datetime := datetime2]
setkey(A,person,datetime)
setkey(B,person,datetime)
B[A,roll=-600]
person datetime2 datetime
1: 1 2015-04-06 14:24:59 1428319338
2: 1 NA 1428364532
3: 1 NA 1448090165
4: 2 NA 1443868649
5: 3 NA 1424983916
6: 3 NA 1431789704
7: 3 2015-06-02 04:07:43 1433207263
8: 3 NA 1448713356
9: 4 NA 1421629822
10: 4 2015-01-24 02:18:18 1422055091
The only difference with your expected output is that it checks timedifference as less or equal to 10 minutes (<=). If that is bad for you you can just delete equal matches

Fastest way for filling-in missing dates for data.table

I am loading a data.table from CSV file that has date, orders, amount etc. fields.
The input file occasionally does not have data for all dates. For example, as shown below:
> NADayWiseOrders
date orders amount guests
1: 2013-01-01 50 2272.55 149
2: 2013-01-02 3 64.04 4
3: 2013-01-04 1 18.81 0
4: 2013-01-05 2 77.62 0
5: 2013-01-07 2 35.82 2
In the above 03-Jan and 06-Jan do not have any entries.
Would like to fill the missing entries with default values (say, zero for orders, amount etc.), or carry the last vaue forward (e.g, 03-Jan will reuse 02-Jan values and 06-Jan will reuse the 05-Jan values etc..)
What is the best/optimal way to fill-in such gaps of missing dates data with such default values?
The answer here suggests using allow.cartesian = TRUE, and expand.grid for missing weekdays - it may work for weekdays (since they are just 7 weekdays) - but not sure if that would be the right way to go about dates as well, especially if we are dealing with multi-year data.
The idiomatic data.table way (using rolling joins) is this:
setkey(NADayWiseOrders, date)
all_dates <- seq(from = as.Date("2013-01-01"),
to = as.Date("2013-01-07"),
by = "days")
NADayWiseOrders[J(all_dates), roll=Inf]
date orders amount guests
1: 2013-01-01 50 2272.55 149
2: 2013-01-02 3 64.04 4
3: 2013-01-03 3 64.04 4
4: 2013-01-04 1 18.81 0
5: 2013-01-05 2 77.62 0
6: 2013-01-06 2 77.62 0
7: 2013-01-07 2 35.82 2
Here is how you fill in the gaps within subgroup
# a toy dataset with gaps in the time series
dt <- as.data.table(read.csv(textConnection('"group","date","x"
"a","2017-01-01",1
"a","2017-02-01",2
"a","2017-05-01",3
"b","2017-02-01",4
"b","2017-04-01",5')))
dt[,date := as.Date(date)]
# the desired dates by group
indx <- dt[,.(date=seq(min(date),max(date),"months")),group]
# key the tables and join them using a rolling join
setkey(dt,group,date)
setkey(indx,group,date)
dt[indx,roll=TRUE]
#> group date x
#> 1: a 2017-01-01 1
#> 2: a 2017-02-01 2
#> 3: a 2017-03-01 2
#> 4: a 2017-04-01 2
#> 5: a 2017-05-01 3
#> 6: b 2017-02-01 4
#> 7: b 2017-03-01 4
#> 8: b 2017-04-01 5
Not sure if it's the fastest, but it'll work if there are no NAs in the data:
# just in case these aren't Dates.
NADayWiseOrders$date <- as.Date(NADayWiseOrders$date)
# all desired dates.
alldates <- data.table(date=seq.Date(min(NADayWiseOrders$date), max(NADayWiseOrders$date), by="day"))
# merge
dt <- merge(NADayWiseOrders, alldates, by="date", all=TRUE)
# now carry forward last observation (alternatively, set NA's to 0)
require(xts)
na.locf(dt)

Resources