R data.table apply date by variable last - r

I have a data.table in R. I have to decrement date from last variable within by group. So in the example below, the date "2012-01-21" should be the 10th row when id = "A" and then decrement until the 1st row. and then for id="B" the date should be "2012-01-21" for 5th row and then decrement by 1 until it reaches first row. Basically the the decrement should start from last value by "id". How could I accomplish in R data.table?
The code below does the opposite. The date starts from 1st row and decrements, how would you start the date by last row and then decrement.
end<- as.Date("2012-01-21")
dt <- data.table(id = c(rep("A",10),rep("B",5)),sales=10+rnorm(15))
dtx <- dt[,date := seq(end,by = -1,length.out = .N),by=list(id)]
> dtx
id sales date
1: A 12.008514 2012-01-21
2: A 10.904740 2012-01-20
3: A 9.627039 2012-01-19
4: A 11.363810 2012-01-18
5: A 8.533913 2012-01-17
6: A 10.041074 2012-01-16
7: A 11.006845 2012-01-15
8: A 10.775066 2012-01-14
9: A 9.978509 2012-01-13
10: A 8.743829 2012-01-12
11: B 8.434640 2012-01-21
12: B 9.489433 2012-01-20
13: B 10.011354 2012-01-19
14: B 8.681002 2012-01-18
15: B 9.264915 2012-01-17

We could reverse the sequence generated above.
library(data.table)
dt[,date := rev(seq(end,by = -1,length.out = .N)),id]
dt
# id sales date
# 1: A 10.886312 2012-01-12
# 2: A 9.803543 2012-01-13
# 3: A 9.063694 2012-01-14
# 4: A 9.762628 2012-01-15
# 5: A 8.764109 2012-01-16
# 6: A 11.095826 2012-01-17
# 7: A 8.735148 2012-01-18
# 8: A 9.227285 2012-01-19
# 9: A 12.024336 2012-01-20
#10: A 9.976514 2012-01-21
#11: B 8.488753 2012-01-17
#12: B 9.141837 2012-01-18
#13: B 11.435365 2012-01-19
#14: B 10.817839 2012-01-20
#15: B 8.427098 2012-01-21
Similarly,
dt[,date := seq(end - .N + 1,by = 1,length.out = .N),id]

Related

R: time series monthly max adjusted by group

I have a df like that (head):
date Value
1: 2016-12-31 169361280
2: 2017-01-01 169383153
3: 2017-01-02 169494585
4: 2017-01-03 167106852
5: 2017-01-04 166750164
6: 2017-01-05 164086438
I would like to calculate a ratio, for that reason I need the max of every period. The max it´s normally the last day of the month but sometime It could be some days after and before (28,29,30,31,01,02).
In order to calculate it properly I would like to assign to my reference date (the last day of the month) the max value of this group of days to be sure that the ratio reflects what it supossed to.
This could be a reproducible example:
Start<-as.Date("2016-12-31")
End<-Sys.Date()
window<-data.table(seq(Start,End,by='1 day'))
dt<-cbind(window,rep(rnorm(nrow(window))))
colnames(dt)<-c("date","value")
# Create a Dateseq
DateSeq <- function(st, en, freq) {
st <- as.Date(as.yearmon(st))
en <- as.Date(as.yearmon(en))
as.Date(as.yearmon(seq(st, en, by = paste(as.character(12/freq),
"months"))), frac = 1)
}
# df to be fulfilled with the group max.
Value.Max.Month<-data.frame(DateSeq(Start,End,12))
colnames(Value.Max.Month)<-c("date")
date
1 2016-12-31
2 2017-01-31
3 2017-02-28
4 2017-03-31
5 2017-04-30
6 2017-05-31
7 2017-06-30
8 2017-07-31
9 2017-08-31
10 2017-09-30
11 2017-10-31
12 2017-11-30
13 2017-12-31
14 2018-01-31
15 2018-02-28
16 2018-03-31
You could use data.table:
library(lubridate)
library(zoo)
Start <- as.Date("2016-12-31")
End <- Sys.Date()
window <- data.table(seq(Start,End,by='1 day'))
dt <- cbind(window,rep(rnorm(nrow(window))))
colnames(dt) <- c("date","value")
dt <- data.table(dt)
dt[,period := as.Date(as.yearmon(date)) %m+% months(1) - 1,][, maximum:=max(value), by=period][, unique(maximum), by=period]
In the first expression we create a new column called period. Then we group by this new column and look for the maximum in value. In the last expression we just output these unique rows.
Notice that to get the last day of each period we add one month using lubridate and then substract 1 day.
The output is:
period V1
1: 2016-12-31 -0.7832116
2: 2017-01-31 2.1988660
3: 2017-02-28 1.6644812
4: 2017-03-31 1.2464980
5: 2017-04-30 2.8268820
6: 2017-05-31 1.7963104
7: 2017-06-30 1.3612476
8: 2017-07-31 1.7325457
9: 2017-08-31 2.7503439
10: 2017-09-30 2.4369036
11: 2017-10-31 2.4544802
12: 2017-11-30 3.1477730
13: 2017-12-31 2.8461506
14: 2018-01-31 1.8862944
15: 2018-02-28 1.8946470
16: 2018-03-31 0.7864341

How to calculate difference in time between variable rows in R?

I am looking to calculate differences in time for different groups based on beginning work times and end work times. How can I tell R to calculate difftime between two rows based on their labels couched in a group? Below is a sample data set:
library(data.table)
latemail <- function(N, st="2012/01/01", et="2012/02/01") {
st <- as.POSIXct(as.Date(st))
et <- as.POSIXct(as.Date(et))
dt <- as.numeric(difftime(et,st,unit="sec"))
ev <- sort(runif(N, 0, dt))
rt <- st + ev
}
#create our data frame
set.seed(42)
dt = latemail(20)
work = setDT(as.data.frame(dt))
work[,worker:= stringi::stri_rand_strings(2, 5)]
work[,dt:= as.POSIXct(as.character(work$dt), tz = "GMT")]
work[,status:=NA]
#order
setorder(work, worker, dt)
#add work times
work$status[1] = "start"
work$status[5] = "end"
work$status[6] = "start"
work$status[10] = "end"
work$status[11] = "start"
work$status[15] = "end"
work$status[16] = "start"
work$status[20] = "end"
table looks like this now:
dt worker status
1: 2012-01-04 23:11:31 VOuRp start
2: 2012-01-09 15:53:16 VOuRp NA
3: 2012-01-15 02:56:45 VOuRp NA
4: 2012-01-16 21:12:26 VOuRp NA
5: 2012-01-20 16:27:31 VOuRp end
6: 2012-01-22 15:34:05 VOuRp start
7: 2012-01-23 15:01:18 VOuRp NA
8: 2012-01-29 03:36:56 VOuRp NA
9: 2012-01-29 20:11:02 VOuRp NA
10: 2012-01-31 02:48:01 VOuRp end
11: 2012-01-04 10:24:38 u8zw5 start
12: 2012-01-08 17:02:20 u8zw5 NA
13: 2012-01-14 23:33:35 u8zw5 NA
14: 2012-01-15 12:23:52 u8zw5 NA
15: 2012-01-18 03:53:15 u8zw5 end
16: 2012-01-21 03:48:08 u8zw5 start
17: 2012-01-23 02:01:10 u8zw5 NA
18: 2012-01-26 12:51:10 u8zw5 NA
19: 2012-01-29 18:23:46 u8zw5 NA
20: 2012-01-29 22:22:14 u8zw5 end
Answer I'm looking for:
ultimately I would like to get the bottom values (labeled worker 1 and worker 2 just because wasn't sure how to do the parallel of set.seed() for stringi). The following code gives me the first row for worker 1, but I'd like each shift for each worker:
difftime(as.POSIXct("2012-01-20 16:27:31"), as.POSIXct("2012-01-04 23:11:31"), units = "hours")
Work time time difference in hours
worker 1 377.2667 hours
worker 2 . . . .
In this example I have an even set of values between workers, but assuming I have variable rows between different workers what would that look like? I'm assuming some sort of difftime formula? I would perfer a data table solution as I am working with large data.
Here is a solution using data.table:
work[status %in% c("start", "end"),
time.diff := ifelse(status == "start",
difftime(shift(dt, fill = NA, type = "lead"), dt, units = "hours"), NA),
by = worker][status == "start", sum(time.diff), worker]
we get:
worker V1
1: VOuRp 580.4989
2: u8zw5 540.0453
>
where V1 has the sum of all hours from start-end interval for each worker.
Let's explain it step by step for better understanding.
STEP 1. Select all rows with start or end status:
work.se <- work[status %in% c("start", "end")]
dt worker status
1: 2012-01-04 23:11:31 VOuRp start
2: 2012-01-20 16:27:31 VOuRp end
3: 2012-01-22 15:34:05 VOuRp start
4: 2012-01-31 02:48:01 VOuRp end
5: 2012-01-04 10:24:38 u8zw5 start
6: 2012-01-18 03:53:15 u8zw5 end
7: 2012-01-21 03:48:08 u8zw5 start
8: 2012-01-29 22:22:14 u8zw5 end
>
STEP 2: Create a function for calculating the time differences between the current row and the next one. This function will be invoked inside the data.table object. We use the shift function from the same package:
getDiff <- function(x) {
difftime(shift(x, fill = NA, type = "lead"), x, units = "hours")
}
getDiff computes the time difference from the next record (within the group) and the current one. It assigns NA for the last row because there is no next value. Then we exclude the NA values from the calculation.
STEP 3: Invoke it within the data.table syntax:
work.result <- work.se[, time.diff := ifelse(status == "start",
getDiff(dt), NA), by = worker]
we get this:
dt worker status time.diff
1: 2012-01-04 23:11:31 VOuRp start 377.2667
2: 2012-01-20 16:27:31 VOuRp end NA
3: 2012-01-22 15:34:05 VOuRp start 203.2322
4: 2012-01-31 02:48:01 VOuRp end NA
5: 2012-01-04 10:24:38 u8zw5 start 329.4769
6: 2012-01-18 03:53:15 u8zw5 end NA
7: 2012-01-21 03:48:08 u8zw5 start 210.5683
8: 2012-01-29 22:22:14 u8zw5 end NA
STEP 4: Sum the non-NA values for time.diff column for each worker:
> work.result[status == "start", sum(time.diff), worker]
worker V1
1: VOuRp 580.4989
2: u8zw5 540.0453
>
data.table object can be concatenated via [] appended, therefore it can be consolidated into one single sentence for the last part:
work.se[, time.diff := ifelse(status == "start",
getDiff(dt), NA), by = worker][status == "start", sum(time.diff), worker]
FINAL: Putting all together into one single sentence:
work[status %in% c("start", "end"),
time.diff := ifelse(status == "start",
difftime(shift(dt, fill = NA, type = "lead"), dt, units = "hours"), NA),
by = worker][status == "start", sum(time.diff), worker]
Check this link for data.table basic syntax.
I hope this would help, please let us know if it is what you wanted

Identifying a cluster of low transit speeds in GPS tracking data

I'm working with a GPS tracking dataset, and I've been playing around with filtering the dataset based on speed and time of day. The species I am working with becomes inactive around dusk, during which it rests on the ocean's surface, but then resumes activity once night has fallen. For each animal in the dataset, I would like to remove all data points after it initially becomes inactive around dusk (21:30). But because each animal becomes inactive at different times, I cannot simply filter out all the data points occurring after 21:30.
My data looks like this...
AnimalID Latitude Longitude Speed Date
99B 50.86190 -129.0875 5.6 2015-05-14 21:26:00
99B 50.86170 -129.0875 0.6 2015-05-14 21:32:00
99B 50.86150 -129.0810 0.5 2015-05-14 21:33:00
99B 50.86140 -129.0800 0.3 2015-05-14 21:40:00
99C.......
Essentially, I want to find a cluster of GPS positions (say, a minimum of 5), occurring after 21:30:00, that all have speeds of <0.8. I then want to delete all points after this point (including the identified cluster).
Does anyone know a way of identifying clusters of points in R? Or is this type of filtering WAY to complex?
Using data.table, you can use a rolling forward/backwards max to find the max of the next five or previous five entries by animal ID. Then, filter out any that don't meet the criteria. For example:
library(data.table)
set.seed(40)
DT <- data.table(Speed = runif(1:1000), AnimalID = rep(c("A","B"), each = 500))
DT[ , FSpeed := Reduce(pmax,shift(Speed,0:4, type = "lead", fill = 1)), by = .(AnimalID)] #0 + 4 forward
DT[ , BSpeed := Reduce(pmax,shift(Speed,0:4, type = "lag", fill = 1)), by = .(AnimalID)] #0 + 4 backwards
DT[FSpeed < 0.5 | BSpeed < 0.5] #min speed
Speed AnimalID FSpeed BSpeed
1: 0.220509197 A 0.4926640 0.8897597
2: 0.225883211 A 0.4926640 0.8897597
3: 0.264809801 A 0.4926640 0.6648507
4: 0.184270587 A 0.4926640 0.6589303
5: 0.492664002 A 0.4926640 0.4926640
6: 0.472144689 A 0.4721447 0.4926640
7: 0.254635219 A 0.7409803 0.4926640
8: 0.281538568 A 0.7409803 0.4926640
9: 0.304875597 A 0.7409803 0.4926640
10: 0.059605991 A 0.7409803 0.4721447
11: 0.132069268 A 0.2569604 0.9224052
12: 0.256960449 A 0.2569604 0.9224052
13: 0.005059727 A 0.8543111 0.2569604
14: 0.191478376 A 0.8543111 0.2569604
15: 0.170969244 A 0.4398143 0.7927442
16: 0.059577719 A 0.4398143 0.7927442
17: 0.439814267 A 0.4398143 0.7927442
18: 0.307714603 A 0.9912536 0.4398143
19: 0.075750773 A 0.9912536 0.4398143
20: 0.100589403 A 0.9912536 0.4398143
21: 0.032957748 A 0.4068012 0.7019594
22: 0.080091554 A 0.4068012 0.7019594
23: 0.406801193 A 0.9761119 0.4068012
24: 0.057445020 A 0.9761119 0.4068012
25: 0.308382143 A 0.4516870 0.9435490
26: 0.451686996 A 0.4516870 0.9248595
27: 0.221964923 A 0.4356419 0.9248595
28: 0.435641917 A 0.5363373 0.4516870
29: 0.237658906 A 0.5363373 0.4516870
30: 0.324597512 A 0.9710011 0.4356419
31: 0.357198893 B 0.4869905 0.9226573
32: 0.486990475 B 0.4869905 0.9226573
33: 0.115922994 B 0.4051843 0.9226573
34: 0.010581766 B 0.9338841 0.4869905
35: 0.003976893 B 0.9338841 0.4869905
36: 0.405184342 B 0.9338841 0.4051843
37: 0.412468699 B 0.4942280 0.9113595
38: 0.402063509 B 0.4942280 0.9113595
39: 0.494228013 B 0.8254665 0.4942280
40: 0.123264949 B 0.8254665 0.4942280
41: 0.251132449 B 0.4960371 0.9475821
42: 0.496037128 B 0.8845043 0.4960371
43: 0.250853014 B 0.3561290 0.9858652
44: 0.356129033 B 0.3603769 0.8429552
45: 0.225943145 B 0.7028077 0.3561290
46: 0.360376907 B 0.7159759 0.3603769
47: 0.169606203 B 0.3438164 0.9745535
48: 0.343816363 B 0.4396962 0.9745535
49: 0.067265545 B 0.9641856 0.3438164
50: 0.439696243 B 0.9641856 0.4396962
51: 0.024403516 B 0.3730828 0.9902976
52: 0.373082846 B 0.4713596 0.9902976
53: 0.290466668 B 0.9689225 0.3730828
54: 0.471359568 B 0.9689225 0.4713596
55: 0.402111615 B 0.4902595 0.8045104
56: 0.490259530 B 0.8801029 0.4902595
57: 0.477884140 B 0.4904800 0.6696598
58: 0.490480001 B 0.8396014 0.4904800
Speed AnimalID FSpeed BSpeed
This shows all the clusters where either the following or previous four (+ the anchor cell) all have a max speed below our min speed (in this case 0.5)
In your code, just run DT <- as.data.table(myDF) where myDF is the name of the data.frame you are using.
For this analysis, we assume that GPS measurements are measured at constant intervals. I also am throwing out the first 4 and last 4 observations by setting fill=1. You should set fill= to your max speed.

Bi-weekly binning with data table

Looking to add a bi-weekly date column to a data table. I have a working solution but it seems messy. Also, I have the feeling rolling joins should do the trick, but I'm not sure how. Are there any better solutions to creating a grouping for bi-weekly dates?
# Mock data table
dt <- data.table(value = runif(20), date = seq(as.Date("2015-01-01"), as.Date("2015-01-20"), by = "days"))
# Bi-weekly dates starting with most recent date and working backwards
bidates <- data.table(bi = seq(dt[, max(date)], dt[, min(date)], by = -14))
# Expand out bi-weekly dates to match up with every date in that range
bidates <- bidates[, seq(bi - 13, bi, by = "days"), by = bi]
# Key and merge
setkey(dt, date)
setkey(bidates, V1)
dt[bidates, bi := i.bi]
Here's how you can use rolling joins:
bis = dt[, .(date = seq(max(date), min(date), by = -14))][, bi := date]
setkey(bis, date)
setkey(dt, date)
bis[dt, roll = -Inf]
# date bi value
# 1: 2015-01-01 2015-01-06 0.2433854
# 2: 2015-01-02 2015-01-06 0.5454916
# 3: 2015-01-03 2015-01-06 0.3334531
# 4: 2015-01-04 2015-01-06 0.9134877
# 5: 2015-01-05 2015-01-06 0.4557901
# 6: 2015-01-06 2015-01-06 0.3459536
# 7: 2015-01-07 2015-01-20 0.8024527
# 8: 2015-01-08 2015-01-20 0.1833166
# 9: 2015-01-09 2015-01-20 0.1024560
#10: 2015-01-10 2015-01-20 0.4052751
#11: 2015-01-11 2015-01-20 0.9564279
#12: 2015-01-12 2015-01-20 0.6413953
#13: 2015-01-13 2015-01-20 0.7614291
#14: 2015-01-14 2015-01-20 0.2176500
#15: 2015-01-15 2015-01-20 0.3352939
#16: 2015-01-16 2015-01-20 0.4847095
#17: 2015-01-17 2015-01-20 0.8450636
#18: 2015-01-18 2015-01-20 0.8513685
#19: 2015-01-19 2015-01-20 0.2012410
#20: 2015-01-20 2015-01-20 0.3847956
Starting from version 1.9.5+ you don't need to set the keys and can do:
bis[dt, roll = -Inf, on = 'date']

Subset data.table by evaluating multiple columns

How to return 1 row for each unique name by most recent (latest) Type?
DataTable with 6 rows:
example <- data.table(c("Bob","May","Sue","Bob","Sue","Bob"),
c("A","A","A","A","B","B"),
as.Date(c("2010/01/01", "2010/01/01", "2010/01/01",
"2012/01/01", "2012/01/11", "2014/01/01")))
setnames(example,c("Name","Type","Date"))
setkey(example,Name,Date)
Should return 5 rows:
# 1: Bob A 2012-01-01
# 2: Bob B 2014-01-01
# 3: May A 2010-01-01
# 4: Sue A 2010-01-01
# 5: Sue B 2012-01-11
Since you've already sorted by Name and Date, you can use unique (which calls unique.data.table) function on the columns Name and Type, with fromLast = TRUE.
require(data.table) ## >= v1.9.3
unique(example, by=c("Name", "Type"), fromLast=TRUE)
# Name Type Date
# 1: Bob A 2012-01-01
# 2: Bob B 2014-01-01
# 3: May A 2010-01-01
# 4: Sue A 2010-01-01
# 5: Sue B 2012-01-11
This'll pick the last row for each Name,Type group. Hope this helps.
PS: As #mso points out, this needs 1.9.3 because the fromLast argument was implemented only in 1.9.3 (available from github).
Following versions of #Arun answer work:
unique(example[rev(order(Name,Date))], by=c("Name", "Type"), fromLast=TRUE)[order(Name,Date)]
Name Type Date
1: Bob A 2012-01-01
2: Bob B 2014-01-01
3: May A 2010-01-01
4: Sue A 2010-01-01
5: Sue B 2012-01-11
unique(example[order(Name, Date, decreasing=T)], by=c("Name","Type"))[order(Name, Date)]
Name Type Date
1: Bob A 2012-01-01
2: Bob B 2014-01-01
3: May A 2010-01-01
4: Sue A 2010-01-01
5: Sue B 2012-01-11

Resources