Identifying a cluster of low transit speeds in GPS tracking data - r

I'm working with a GPS tracking dataset, and I've been playing around with filtering the dataset based on speed and time of day. The species I am working with becomes inactive around dusk, during which it rests on the ocean's surface, but then resumes activity once night has fallen. For each animal in the dataset, I would like to remove all data points after it initially becomes inactive around dusk (21:30). But because each animal becomes inactive at different times, I cannot simply filter out all the data points occurring after 21:30.
My data looks like this...
AnimalID Latitude Longitude Speed Date
99B 50.86190 -129.0875 5.6 2015-05-14 21:26:00
99B 50.86170 -129.0875 0.6 2015-05-14 21:32:00
99B 50.86150 -129.0810 0.5 2015-05-14 21:33:00
99B 50.86140 -129.0800 0.3 2015-05-14 21:40:00
99C.......
Essentially, I want to find a cluster of GPS positions (say, a minimum of 5), occurring after 21:30:00, that all have speeds of <0.8. I then want to delete all points after this point (including the identified cluster).
Does anyone know a way of identifying clusters of points in R? Or is this type of filtering WAY to complex?

Using data.table, you can use a rolling forward/backwards max to find the max of the next five or previous five entries by animal ID. Then, filter out any that don't meet the criteria. For example:
library(data.table)
set.seed(40)
DT <- data.table(Speed = runif(1:1000), AnimalID = rep(c("A","B"), each = 500))
DT[ , FSpeed := Reduce(pmax,shift(Speed,0:4, type = "lead", fill = 1)), by = .(AnimalID)] #0 + 4 forward
DT[ , BSpeed := Reduce(pmax,shift(Speed,0:4, type = "lag", fill = 1)), by = .(AnimalID)] #0 + 4 backwards
DT[FSpeed < 0.5 | BSpeed < 0.5] #min speed
Speed AnimalID FSpeed BSpeed
1: 0.220509197 A 0.4926640 0.8897597
2: 0.225883211 A 0.4926640 0.8897597
3: 0.264809801 A 0.4926640 0.6648507
4: 0.184270587 A 0.4926640 0.6589303
5: 0.492664002 A 0.4926640 0.4926640
6: 0.472144689 A 0.4721447 0.4926640
7: 0.254635219 A 0.7409803 0.4926640
8: 0.281538568 A 0.7409803 0.4926640
9: 0.304875597 A 0.7409803 0.4926640
10: 0.059605991 A 0.7409803 0.4721447
11: 0.132069268 A 0.2569604 0.9224052
12: 0.256960449 A 0.2569604 0.9224052
13: 0.005059727 A 0.8543111 0.2569604
14: 0.191478376 A 0.8543111 0.2569604
15: 0.170969244 A 0.4398143 0.7927442
16: 0.059577719 A 0.4398143 0.7927442
17: 0.439814267 A 0.4398143 0.7927442
18: 0.307714603 A 0.9912536 0.4398143
19: 0.075750773 A 0.9912536 0.4398143
20: 0.100589403 A 0.9912536 0.4398143
21: 0.032957748 A 0.4068012 0.7019594
22: 0.080091554 A 0.4068012 0.7019594
23: 0.406801193 A 0.9761119 0.4068012
24: 0.057445020 A 0.9761119 0.4068012
25: 0.308382143 A 0.4516870 0.9435490
26: 0.451686996 A 0.4516870 0.9248595
27: 0.221964923 A 0.4356419 0.9248595
28: 0.435641917 A 0.5363373 0.4516870
29: 0.237658906 A 0.5363373 0.4516870
30: 0.324597512 A 0.9710011 0.4356419
31: 0.357198893 B 0.4869905 0.9226573
32: 0.486990475 B 0.4869905 0.9226573
33: 0.115922994 B 0.4051843 0.9226573
34: 0.010581766 B 0.9338841 0.4869905
35: 0.003976893 B 0.9338841 0.4869905
36: 0.405184342 B 0.9338841 0.4051843
37: 0.412468699 B 0.4942280 0.9113595
38: 0.402063509 B 0.4942280 0.9113595
39: 0.494228013 B 0.8254665 0.4942280
40: 0.123264949 B 0.8254665 0.4942280
41: 0.251132449 B 0.4960371 0.9475821
42: 0.496037128 B 0.8845043 0.4960371
43: 0.250853014 B 0.3561290 0.9858652
44: 0.356129033 B 0.3603769 0.8429552
45: 0.225943145 B 0.7028077 0.3561290
46: 0.360376907 B 0.7159759 0.3603769
47: 0.169606203 B 0.3438164 0.9745535
48: 0.343816363 B 0.4396962 0.9745535
49: 0.067265545 B 0.9641856 0.3438164
50: 0.439696243 B 0.9641856 0.4396962
51: 0.024403516 B 0.3730828 0.9902976
52: 0.373082846 B 0.4713596 0.9902976
53: 0.290466668 B 0.9689225 0.3730828
54: 0.471359568 B 0.9689225 0.4713596
55: 0.402111615 B 0.4902595 0.8045104
56: 0.490259530 B 0.8801029 0.4902595
57: 0.477884140 B 0.4904800 0.6696598
58: 0.490480001 B 0.8396014 0.4904800
Speed AnimalID FSpeed BSpeed
This shows all the clusters where either the following or previous four (+ the anchor cell) all have a max speed below our min speed (in this case 0.5)
In your code, just run DT <- as.data.table(myDF) where myDF is the name of the data.frame you are using.
For this analysis, we assume that GPS measurements are measured at constant intervals. I also am throwing out the first 4 and last 4 observations by setting fill=1. You should set fill= to your max speed.

Related

R data.table apply date by variable last

I have a data.table in R. I have to decrement date from last variable within by group. So in the example below, the date "2012-01-21" should be the 10th row when id = "A" and then decrement until the 1st row. and then for id="B" the date should be "2012-01-21" for 5th row and then decrement by 1 until it reaches first row. Basically the the decrement should start from last value by "id". How could I accomplish in R data.table?
The code below does the opposite. The date starts from 1st row and decrements, how would you start the date by last row and then decrement.
end<- as.Date("2012-01-21")
dt <- data.table(id = c(rep("A",10),rep("B",5)),sales=10+rnorm(15))
dtx <- dt[,date := seq(end,by = -1,length.out = .N),by=list(id)]
> dtx
id sales date
1: A 12.008514 2012-01-21
2: A 10.904740 2012-01-20
3: A 9.627039 2012-01-19
4: A 11.363810 2012-01-18
5: A 8.533913 2012-01-17
6: A 10.041074 2012-01-16
7: A 11.006845 2012-01-15
8: A 10.775066 2012-01-14
9: A 9.978509 2012-01-13
10: A 8.743829 2012-01-12
11: B 8.434640 2012-01-21
12: B 9.489433 2012-01-20
13: B 10.011354 2012-01-19
14: B 8.681002 2012-01-18
15: B 9.264915 2012-01-17
We could reverse the sequence generated above.
library(data.table)
dt[,date := rev(seq(end,by = -1,length.out = .N)),id]
dt
# id sales date
# 1: A 10.886312 2012-01-12
# 2: A 9.803543 2012-01-13
# 3: A 9.063694 2012-01-14
# 4: A 9.762628 2012-01-15
# 5: A 8.764109 2012-01-16
# 6: A 11.095826 2012-01-17
# 7: A 8.735148 2012-01-18
# 8: A 9.227285 2012-01-19
# 9: A 12.024336 2012-01-20
#10: A 9.976514 2012-01-21
#11: B 8.488753 2012-01-17
#12: B 9.141837 2012-01-18
#13: B 11.435365 2012-01-19
#14: B 10.817839 2012-01-20
#15: B 8.427098 2012-01-21
Similarly,
dt[,date := seq(end - .N + 1,by = 1,length.out = .N),id]

How do I prevent {data.table}foverlaps from feeding NA's into its any(...) call when executing on large datatables?

First of all, a similar problem:
Foverlaps error: Error in if (any(x[[xintervals[2L]]] - x[[xintervals[1L]]] < 0L)) stop
The story
I'm trying to count how many times fluor emissions (measured every 1 minute) overlap with a given event. An emission is said to overlap with a given event when the emission time is 10 minutes before or 30 minutes after the time of the event. In total we consider three events : AC, CO and MT.
The data
Edit 1:
Here are two example datasets that allow the execution of the code below.
The code runs just fine for these sets. Once I have data that generates the error I'll make a second edit. Note that event.GN in the example dataset below is a data.table instead of a list
emissions.GN <- data.table(date.time=seq(ymd_hms("2016-01-01 00:00:00"), by="min",length.out = 1000000))
event.GN <- data.table(dat=seq(ymd_hms("2016-01-01 00:00:00"), by="15 mins", length.out = 26383))
Edit 2:
I created a csv file containing the data event.GN that generates the error. The file has 26383 rows of one variable dat but only about 14000 are necessary to generate the error.
Edit 3:
Up until the dat "2017-03-26 00:25:20" the function works fine. Right after adding the next record with dat "2017-03-26 01:33:46" the error occurs. I noticed that between those points there is more than 60 minutes. This means that between those two event times one or several emission records won't have corresponding events. This in turn will generate NA's that somehow get caught up in the any() call of the foverlaps function. Am I looking in the right direction?
The fluor emissions are stored in a large datatable (~1 million rows) called emissions.GN. Note that only the date.time (POSIXct) variable is relevant to my problem.
example of emissions.GN:
date.time fluor hall period dt
1: 2016-01-01 00:17:04 0.3044254 GN [2016-01-01,2016-02-21] -16.07373
2: 2016-01-01 00:17:04 0.4368381 GN [2016-01-01,2016-02-21] -16.07373
3: 2016-01-01 00:18:04 0.5655382 GN [2016-01-01,2016-02-21] -16.07395
4: 2016-01-01 00:19:04 0.6542259 GN [2016-01-01,2016-02-21] -16.07417
5: 2016-01-01 00:21:04 0.6579384 GN [2016-01-01,2016-02-21] -16.07462
The data of the three events is stored in three smaller datatables (~20 thousand records) contained in a list called events.GN. Note that only the dat (POSIXct) variable is relevant to my problem.
example of AC events (CO and MT are analogous):
events.GN[["AC"]]
dat hall numevt txtevt
1: 2016-01-01 00:04:54 GN 321 PHASE 1 CHANGEMENT D'ANODE (Position anode #1I)
2: 2016-01-01 00:09:21 GN 321 PHASE 1 CHANGEMENT D'ANODE (Position anode #1I)
3: 2016-01-01 00:38:53 GN 321 PHASE 1 CHANGEMENT D'ANODE (Position anode #1I)
4: 2016-01-01 02:30:33 GN 321 PHASE 1 CHANGEMENT D'ANODE (Position anode #1I)
5: 2016-01-01 02:34:11 GN 321 PHASE 1 CHANGEMENT D'ANODE (Position anode #1I)
The function
I have written a function that applies foverlaps on a given (large) x datatable and a given (small) y datatable. The function returns a datatable with two columns. The first column yid contains the indices of emissions.GN observations that overlap at least once with an event. The second column N contains the overlap count (i.e. the number of times an overlap occurs for that particular index). The index of emissions that have zero overlaps are omitted from the result.
# A function to compute the number of times an emission record falls between the defined starting point and end point of an event.
find_index_and_count <- function(hall,event, lower.margin=10, upper.margin=30){
# Define start and stop variables of the large emission dataset hall to be zero, i.e. each record is a single time point, not an interval.
hall$start <- hall$date.time
hall$stop <- hall$date.time
# Define the start and stop variables of the small event datatables equal to the defined margins oof 10 and 30 minutes respectively
event$start <- event$dat-minutes(lower.margin)
event$stop <- event$dat+minutes(upper.margin)
# Set they key of both datasets to be start and stop
setkey(hall,start,stop)
setkey(event,start,stop)
# Returns the index the of the emission record that falls N times within an event time interval. The call to na.omit is necessary to remove NA's introduced by x records that don't fall within any y interval.
foverlaps(event,hall,nomatch = NA, which = TRUE)[, .N, by=yid] %>% na.omit
}
The function executes succesfully for the events AC and CO
The function gives the desired result as discribed above when called on the events AC and CO:
find_index_and_count(emissions.GN,events.GN[["AC"]])
yid N
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 2
---
find_index_and_count(emissions.GN,events.GN[["CO"]])
yid N
1: 3 1
2: 4 1
3: 5 1
4: 6 1
5: 7 1
---
The function returns an error when called on the MT event
The following function call results in the error below:
find_index_and_count(emissions.GN,events.GN[["MT"]])
Error in if (any(x[[xintervals[2L]]] - x[[xintervals[1L]]] < 0L)) stop("All entries in column ", : missing value where TRUE/FALSE needed
5.foverlaps(event, hall, nomatch = NA, which = TRUE)
4.eval(lhs, parent, parent)
3.eval(lhs, parent, parent)
2.foverlaps(event, hall, nomatch = NA, which = TRUE)[, .N, by = yid] %>% na.omit
1.find_index_and_count(emissions.GN, events.GN[["MT"]])
I assume the function returns an NA whenever a record in x (emissions.FN) has no overlap with any of the events in y (events.FN[["AC"]] etc.).
I don't understand why the function fails on the event MT when it works just fine for AC and CO. The data are exactly the same with the exception of the values and slightly different number of records.
What I have tried so far
Firstly, In the similar problem linked above, someone pointed out the following idea:
This often indicates an NA value being fed to the any function, so it returns NA and that's not a legal logical value. – Carl Witthoft May 7 '15 at 13:50
Hence , I modified the call to foverlaps to return 0 instead of NA whener no overlap between x and y is found, like this:
foverlaps(event,hall,nomatch = 0, which = TRUE)[, .N, by=yid] %>% na.omit
This did not change anything (the function works for AC and CO but fails for MT).
Secondly, I made absolutely sure that none of my datatables contained NA's.
More information
If required I can provide the SQL code that generates the emissions.FN data and all the events.FN data. Note that because all the events.FN date has the same origin, there should be no diffirences (other than the values) between the data of the events AC, CO and MT.
If anything else is required, please do feel free to ask !
I'm trying to count how many times fluor emissions (measured every 1 minute) overlap with a given event. An emission is said to overlap with a given event when the emission time is 10 minutes before or 30 minutes after the time of the event.
Just addressing this objective (since I don't know foverlaps well.)...
event.GN[, n :=
emissions.GN[.SD[, .(d_dn = dat - 10*60, d_up = dat + 30*60)], on=.(date.time >= d_dn, date.time <= d_up),
.N
, by=.EACHI]$N
]
dat n
1: 2016-01-01 00:00:00 31
2: 2016-01-01 00:15:00 41
3: 2016-01-01 00:30:00 41
4: 2016-01-01 00:45:00 41
5: 2016-01-01 01:00:00 41
---
26379: 2016-10-01 18:30:00 41
26380: 2016-10-01 18:45:00 41
26381: 2016-10-01 19:00:00 41
26382: 2016-10-01 19:15:00 41
26383: 2016-10-01 19:30:00 41
To check/verify one of these counts...
> # dat from 99th event...
> my_d <- event.GN[99, {print(.SD); dat}]
dat n
1: 2016-01-02 00:30:00 41
>
> # subsetting to overlapping emissions
> emissions.GN[date.time %between% (my_d + c(-10*60, 30*60))]
date.time
1: 2016-01-02 00:20:00
2: 2016-01-02 00:21:00
3: 2016-01-02 00:22:00
4: 2016-01-02 00:23:00
5: 2016-01-02 00:24:00
6: 2016-01-02 00:25:00
7: 2016-01-02 00:26:00
8: 2016-01-02 00:27:00
9: 2016-01-02 00:28:00
10: 2016-01-02 00:29:00
11: 2016-01-02 00:30:00
12: 2016-01-02 00:31:00
13: 2016-01-02 00:32:00
14: 2016-01-02 00:33:00
15: 2016-01-02 00:34:00
16: 2016-01-02 00:35:00
17: 2016-01-02 00:36:00
18: 2016-01-02 00:37:00
19: 2016-01-02 00:38:00
20: 2016-01-02 00:39:00
21: 2016-01-02 00:40:00
22: 2016-01-02 00:41:00
23: 2016-01-02 00:42:00
24: 2016-01-02 00:43:00
25: 2016-01-02 00:44:00
26: 2016-01-02 00:45:00
27: 2016-01-02 00:46:00
28: 2016-01-02 00:47:00
29: 2016-01-02 00:48:00
30: 2016-01-02 00:49:00
31: 2016-01-02 00:50:00
32: 2016-01-02 00:51:00
33: 2016-01-02 00:52:00
34: 2016-01-02 00:53:00
35: 2016-01-02 00:54:00
36: 2016-01-02 00:55:00
37: 2016-01-02 00:56:00
38: 2016-01-02 00:57:00
39: 2016-01-02 00:58:00
40: 2016-01-02 00:59:00
41: 2016-01-02 01:00:00
date.time

Take unique values of a column and add each in unique column in same row as `by` in data.table

Apologies in advance...I couldn't articulate a better title.
Here is the problem:
I am working with a data.table and have grouped rows using 'by'. This results in the same number of rows as the unique values of the column of interest. For each unique 'by' value (in this example, 'lat_lon'), I want to take the unique values in another column (ID) and add them to the same row as the unique by column.
Here is an example:
lat_lon ID
1: 42.04166667_-80.4375 26D25
2: 42.04166667_-80.4375 26D26
3: 42.04166667_-80.3125 26D34
4: 42.04166667_-80.3125 26D35
5: 42.04166667_-80.3125 26D36
6: 42.125_-80.1875 26D41
7: 42.125_-80.1875 27C46
8: 42.125_-80.1875 27D42
9: 42.04166667_-80.1875 26D43
10: 42.04166667_-80.1875 26D45
11: 42.04166667_-80.1875 27D44
12: 42.04166667_-80.1875 27D46
13: 42.29166667_-79.8125 27B76
14: 42.20833333_-80.0625 27C53
15: 42.20833333_-80.0625 27C54
16: 42.125_-80.0625 27C55
17: 42.125_-80.0625 27C56
18: 42.125_-80.0625 27D51
19: 42.125_-80.0625 27D52
What I really want is this:
lat_lon ID.1 ID.2 ID.3 ID.4 ID.5 ID.6 ID.7 ID.8 ID.9 ID.10
42.04166667_-80.4375 26D25 26D26 NA NA NA NA NA NA NA NA
42.04166667_-80.3125 26D34 26D35 26D36 NA NA NA NA NA NA NA
...
42.125_-80.0625 27C55 27C56 27D51 27D52 NA NA NA NA NA NA
Thank you for your patience and helpful comments.
For a data.table solution, adding a idx column (rn) first then pivot using dcast.data.table would help:
dcast.data.table(dat[, rn := paste0("ID.", seq_len(.N)), by=.(lat_lon)],
lat_lon ~ rn, value.var="ID")
# lat_lon ID.1 ID.2 ID.3 ID.4
# 1: 42.04166667_-80.1875 26D43 26D45 27D44 27D46
# 2: 42.04166667_-80.3125 26D34 26D35 26D36 NA
# 3: 42.04166667_-80.4375 26D25 26D26 NA NA
# 4: 42.125_-80.0625 27C55 27C56 27D51 27D52
# 5: 42.125_-80.1875 26D41 27C46 27D42 NA
# 6: 42.20833333_-80.0625 27C53 27C54 NA NA
# 7: 42.29166667_-79.8125 27B76 NA NA NA
data:
dat <- fread("lat_lon ID
42.04166667_-80.4375 26D25
42.04166667_-80.4375 26D26
42.04166667_-80.3125 26D34
42.04166667_-80.3125 26D35
42.04166667_-80.3125 26D36
42.125_-80.1875 26D41
42.125_-80.1875 27C46
42.125_-80.1875 27D42
42.04166667_-80.1875 26D43
42.04166667_-80.1875 26D45
42.04166667_-80.1875 27D44
42.04166667_-80.1875 27D46
42.29166667_-79.8125 27B76
42.20833333_-80.0625 27C53
42.20833333_-80.0625 27C54
42.125_-80.0625 27C55
42.125_-80.0625 27C56
42.125_-80.0625 27D51
42.125_-80.0625 27D52")
This is a departure from data.table (not that it can't be done there I'm sure but I'm less familiar) into the tidyverse
require(tidyr)
require(dplyr)
wide_data <- dat %>% group_by(lat_lon) %>% mutate(IDno = paste0("ID.",row_number())) %>% spread(IDno, ID)
This assumes that there are no duplicated lines with an ID repeated for a lat_lon. You could add distinct() to the chain before the grouping if this isn't the case

rbindlist and nested data.table, different behavior with/without using get

I am loading some JSON data using jsonlite which is resulting in some nested data similar (in structure) to the toy data.table dt constructed below. I want to be able to use rbindlist to bind the nested data.tables together.
Setup:
> dt <- data.table(a=c("abc", "def", "ghi"), b=runif(3))
> dt[, c:=list(list(data.table(d=runif(4), e=runif(4))))]
> dt
a b c
1: abc 0.2623218 <data.table>
2: def 0.7092507 <data.table>
3: ghi 0.2795103 <data.table>
Using the NSE built into data.table, I can do:
> rbindlist(dt[, c])
d e
1: 0.8420476 0.26878325
2: 0.1704087 0.59654706
3: 0.6023655 0.42590380
4: 0.9528841 0.06121386
5: 0.8420476 0.26878325
6: 0.1704087 0.59654706
7: 0.6023655 0.42590380
8: 0.9528841 0.06121386
9: 0.8420476 0.26878325
10: 0.1704087 0.59654706
11: 0.6023655 0.42590380
12: 0.9528841 0.06121386
which is exactly what I expect/want. Furthermore, the original dt remains unmodified:
> dt
a b c
1: abc 0.2623218 <data.table>
2: def 0.7092507 <data.table>
3: ghi 0.2795103 <data.table>
However, when manipulating the data.table within a function I generally want to use get with string column names:
> rbindlist(dt[, get("c")])
V1 V2
1: 0.8420476 0.26878325
2: 0.1704087 0.59654706
3: 0.6023655 0.42590380
4: 0.9528841 0.06121386
5: 0.8420476 0.26878325
6: 0.1704087 0.59654706
7: 0.6023655 0.42590380
8: 0.9528841 0.06121386
9: 0.8420476 0.26878325
10: 0.1704087 0.59654706
11: 0.6023655 0.42590380
12: 0.9528841 0.06121386
Now the column names have been lost and replaced by the default "V1" and "V2" values. Is there a way to retain the names?
In the development version (v1.9.5) the problem is worse than simply lost names though. After executing the statement: rbindlist(dt[, get("c")]) the entire data.table becomes corrupt:
> dt
Error in FUN(X[[3L]], ...) :
Invalid column: it has dimensions. Can't format it. If it's the result of data.table(table()), use as.data.table(table()) instead.
To be clear, the lost names issue happens in both v1.9.4 (installed from CRAN) and v1.9.5 (installed from github), but the corrupt data.table issue seems to affect v1.9.5 only (as of today - July 8, 2015).
If I were able to stick with the NSE version of things everything runs smoothly. My issue is that sticking with the NSE version would involve writing multiple NSE functions calling each other which seems to get messy pretty fast.
Are there any (non-NSE-based) known work-arounds? Also, is this a known issue?
This must have been fixed in last 5 years since this Q was asked. Now I am getting expected results.
> library(data.table)
data.table 1.13.3 IN DEVELOPMENT built 2020-11-17 18:11:47 UTC; jan using 4 threads (see ?getDTthreads). Latest news: r-datatable.com
> dt <- data.table(a=c("abc", "def", "ghi"), b=runif(3))
> dt[, c:=list(list(data.table(d=runif(4), e=runif(4))))]
> dt
a b c
1: abc 0.2416624 <data.table[4x2]>
2: def 0.0222938 <data.table[4x2]>
3: ghi 0.3510681 <data.table[4x2]>
> rbindlist(dt[, c])
d e
1: 0.5485731 0.32366420
2: 0.5457945 0.45173251
3: 0.6796699 0.03783026
4: 0.4442776 0.03121024
5: 0.5485731 0.32366420
6: 0.5457945 0.45173251
7: 0.6796699 0.03783026
8: 0.4442776 0.03121024
9: 0.5485731 0.32366420
10: 0.5457945 0.45173251
11: 0.6796699 0.03783026
12: 0.4442776 0.03121024
> rbindlist(dt[, get("c")])
d e
1: 0.5485731 0.32366420
2: 0.5457945 0.45173251
3: 0.6796699 0.03783026
4: 0.4442776 0.03121024
5: 0.5485731 0.32366420
6: 0.5457945 0.45173251
7: 0.6796699 0.03783026
8: 0.4442776 0.03121024
9: 0.5485731 0.32366420
10: 0.5457945 0.45173251
11: 0.6796699 0.03783026
12: 0.4442776 0.03121024
> dt
a b c
1: abc 0.2416624 <data.table[4x2]>
2: def 0.0222938 <data.table[4x2]>
3: ghi 0.3510681 <data.table[4x2]>

Subset data.table by evaluating multiple columns

How to return 1 row for each unique name by most recent (latest) Type?
DataTable with 6 rows:
example <- data.table(c("Bob","May","Sue","Bob","Sue","Bob"),
c("A","A","A","A","B","B"),
as.Date(c("2010/01/01", "2010/01/01", "2010/01/01",
"2012/01/01", "2012/01/11", "2014/01/01")))
setnames(example,c("Name","Type","Date"))
setkey(example,Name,Date)
Should return 5 rows:
# 1: Bob A 2012-01-01
# 2: Bob B 2014-01-01
# 3: May A 2010-01-01
# 4: Sue A 2010-01-01
# 5: Sue B 2012-01-11
Since you've already sorted by Name and Date, you can use unique (which calls unique.data.table) function on the columns Name and Type, with fromLast = TRUE.
require(data.table) ## >= v1.9.3
unique(example, by=c("Name", "Type"), fromLast=TRUE)
# Name Type Date
# 1: Bob A 2012-01-01
# 2: Bob B 2014-01-01
# 3: May A 2010-01-01
# 4: Sue A 2010-01-01
# 5: Sue B 2012-01-11
This'll pick the last row for each Name,Type group. Hope this helps.
PS: As #mso points out, this needs 1.9.3 because the fromLast argument was implemented only in 1.9.3 (available from github).
Following versions of #Arun answer work:
unique(example[rev(order(Name,Date))], by=c("Name", "Type"), fromLast=TRUE)[order(Name,Date)]
Name Type Date
1: Bob A 2012-01-01
2: Bob B 2014-01-01
3: May A 2010-01-01
4: Sue A 2010-01-01
5: Sue B 2012-01-11
unique(example[order(Name, Date, decreasing=T)], by=c("Name","Type"))[order(Name, Date)]
Name Type Date
1: Bob A 2012-01-01
2: Bob B 2014-01-01
3: May A 2010-01-01
4: Sue A 2010-01-01
5: Sue B 2012-01-11

Resources