I am looking to calculate differences in time for different groups based on beginning work times and end work times. How can I tell R to calculate difftime between two rows based on their labels couched in a group? Below is a sample data set:
library(data.table)
latemail <- function(N, st="2012/01/01", et="2012/02/01") {
st <- as.POSIXct(as.Date(st))
et <- as.POSIXct(as.Date(et))
dt <- as.numeric(difftime(et,st,unit="sec"))
ev <- sort(runif(N, 0, dt))
rt <- st + ev
}
#create our data frame
set.seed(42)
dt = latemail(20)
work = setDT(as.data.frame(dt))
work[,worker:= stringi::stri_rand_strings(2, 5)]
work[,dt:= as.POSIXct(as.character(work$dt), tz = "GMT")]
work[,status:=NA]
#order
setorder(work, worker, dt)
#add work times
work$status[1] = "start"
work$status[5] = "end"
work$status[6] = "start"
work$status[10] = "end"
work$status[11] = "start"
work$status[15] = "end"
work$status[16] = "start"
work$status[20] = "end"
table looks like this now:
dt worker status
1: 2012-01-04 23:11:31 VOuRp start
2: 2012-01-09 15:53:16 VOuRp NA
3: 2012-01-15 02:56:45 VOuRp NA
4: 2012-01-16 21:12:26 VOuRp NA
5: 2012-01-20 16:27:31 VOuRp end
6: 2012-01-22 15:34:05 VOuRp start
7: 2012-01-23 15:01:18 VOuRp NA
8: 2012-01-29 03:36:56 VOuRp NA
9: 2012-01-29 20:11:02 VOuRp NA
10: 2012-01-31 02:48:01 VOuRp end
11: 2012-01-04 10:24:38 u8zw5 start
12: 2012-01-08 17:02:20 u8zw5 NA
13: 2012-01-14 23:33:35 u8zw5 NA
14: 2012-01-15 12:23:52 u8zw5 NA
15: 2012-01-18 03:53:15 u8zw5 end
16: 2012-01-21 03:48:08 u8zw5 start
17: 2012-01-23 02:01:10 u8zw5 NA
18: 2012-01-26 12:51:10 u8zw5 NA
19: 2012-01-29 18:23:46 u8zw5 NA
20: 2012-01-29 22:22:14 u8zw5 end
Answer I'm looking for:
ultimately I would like to get the bottom values (labeled worker 1 and worker 2 just because wasn't sure how to do the parallel of set.seed() for stringi). The following code gives me the first row for worker 1, but I'd like each shift for each worker:
difftime(as.POSIXct("2012-01-20 16:27:31"), as.POSIXct("2012-01-04 23:11:31"), units = "hours")
Work time time difference in hours
worker 1 377.2667 hours
worker 2 . . . .
In this example I have an even set of values between workers, but assuming I have variable rows between different workers what would that look like? I'm assuming some sort of difftime formula? I would perfer a data table solution as I am working with large data.
Here is a solution using data.table:
work[status %in% c("start", "end"),
time.diff := ifelse(status == "start",
difftime(shift(dt, fill = NA, type = "lead"), dt, units = "hours"), NA),
by = worker][status == "start", sum(time.diff), worker]
we get:
worker V1
1: VOuRp 580.4989
2: u8zw5 540.0453
>
where V1 has the sum of all hours from start-end interval for each worker.
Let's explain it step by step for better understanding.
STEP 1. Select all rows with start or end status:
work.se <- work[status %in% c("start", "end")]
dt worker status
1: 2012-01-04 23:11:31 VOuRp start
2: 2012-01-20 16:27:31 VOuRp end
3: 2012-01-22 15:34:05 VOuRp start
4: 2012-01-31 02:48:01 VOuRp end
5: 2012-01-04 10:24:38 u8zw5 start
6: 2012-01-18 03:53:15 u8zw5 end
7: 2012-01-21 03:48:08 u8zw5 start
8: 2012-01-29 22:22:14 u8zw5 end
>
STEP 2: Create a function for calculating the time differences between the current row and the next one. This function will be invoked inside the data.table object. We use the shift function from the same package:
getDiff <- function(x) {
difftime(shift(x, fill = NA, type = "lead"), x, units = "hours")
}
getDiff computes the time difference from the next record (within the group) and the current one. It assigns NA for the last row because there is no next value. Then we exclude the NA values from the calculation.
STEP 3: Invoke it within the data.table syntax:
work.result <- work.se[, time.diff := ifelse(status == "start",
getDiff(dt), NA), by = worker]
we get this:
dt worker status time.diff
1: 2012-01-04 23:11:31 VOuRp start 377.2667
2: 2012-01-20 16:27:31 VOuRp end NA
3: 2012-01-22 15:34:05 VOuRp start 203.2322
4: 2012-01-31 02:48:01 VOuRp end NA
5: 2012-01-04 10:24:38 u8zw5 start 329.4769
6: 2012-01-18 03:53:15 u8zw5 end NA
7: 2012-01-21 03:48:08 u8zw5 start 210.5683
8: 2012-01-29 22:22:14 u8zw5 end NA
STEP 4: Sum the non-NA values for time.diff column for each worker:
> work.result[status == "start", sum(time.diff), worker]
worker V1
1: VOuRp 580.4989
2: u8zw5 540.0453
>
data.table object can be concatenated via [] appended, therefore it can be consolidated into one single sentence for the last part:
work.se[, time.diff := ifelse(status == "start",
getDiff(dt), NA), by = worker][status == "start", sum(time.diff), worker]
FINAL: Putting all together into one single sentence:
work[status %in% c("start", "end"),
time.diff := ifelse(status == "start",
difftime(shift(dt, fill = NA, type = "lead"), dt, units = "hours"), NA),
by = worker][status == "start", sum(time.diff), worker]
Check this link for data.table basic syntax.
I hope this would help, please let us know if it is what you wanted
Related
I have a data.table in R. I have to decrement date from last variable within by group. So in the example below, the date "2012-01-21" should be the 10th row when id = "A" and then decrement until the 1st row. and then for id="B" the date should be "2012-01-21" for 5th row and then decrement by 1 until it reaches first row. Basically the the decrement should start from last value by "id". How could I accomplish in R data.table?
The code below does the opposite. The date starts from 1st row and decrements, how would you start the date by last row and then decrement.
end<- as.Date("2012-01-21")
dt <- data.table(id = c(rep("A",10),rep("B",5)),sales=10+rnorm(15))
dtx <- dt[,date := seq(end,by = -1,length.out = .N),by=list(id)]
> dtx
id sales date
1: A 12.008514 2012-01-21
2: A 10.904740 2012-01-20
3: A 9.627039 2012-01-19
4: A 11.363810 2012-01-18
5: A 8.533913 2012-01-17
6: A 10.041074 2012-01-16
7: A 11.006845 2012-01-15
8: A 10.775066 2012-01-14
9: A 9.978509 2012-01-13
10: A 8.743829 2012-01-12
11: B 8.434640 2012-01-21
12: B 9.489433 2012-01-20
13: B 10.011354 2012-01-19
14: B 8.681002 2012-01-18
15: B 9.264915 2012-01-17
We could reverse the sequence generated above.
library(data.table)
dt[,date := rev(seq(end,by = -1,length.out = .N)),id]
dt
# id sales date
# 1: A 10.886312 2012-01-12
# 2: A 9.803543 2012-01-13
# 3: A 9.063694 2012-01-14
# 4: A 9.762628 2012-01-15
# 5: A 8.764109 2012-01-16
# 6: A 11.095826 2012-01-17
# 7: A 8.735148 2012-01-18
# 8: A 9.227285 2012-01-19
# 9: A 12.024336 2012-01-20
#10: A 9.976514 2012-01-21
#11: B 8.488753 2012-01-17
#12: B 9.141837 2012-01-18
#13: B 11.435365 2012-01-19
#14: B 10.817839 2012-01-20
#15: B 8.427098 2012-01-21
Similarly,
dt[,date := seq(end - .N + 1,by = 1,length.out = .N),id]
First of all, a similar problem:
Foverlaps error: Error in if (any(x[[xintervals[2L]]] - x[[xintervals[1L]]] < 0L)) stop
The story
I'm trying to count how many times fluor emissions (measured every 1 minute) overlap with a given event. An emission is said to overlap with a given event when the emission time is 10 minutes before or 30 minutes after the time of the event. In total we consider three events : AC, CO and MT.
The data
Edit 1:
Here are two example datasets that allow the execution of the code below.
The code runs just fine for these sets. Once I have data that generates the error I'll make a second edit. Note that event.GN in the example dataset below is a data.table instead of a list
emissions.GN <- data.table(date.time=seq(ymd_hms("2016-01-01 00:00:00"), by="min",length.out = 1000000))
event.GN <- data.table(dat=seq(ymd_hms("2016-01-01 00:00:00"), by="15 mins", length.out = 26383))
Edit 2:
I created a csv file containing the data event.GN that generates the error. The file has 26383 rows of one variable dat but only about 14000 are necessary to generate the error.
Edit 3:
Up until the dat "2017-03-26 00:25:20" the function works fine. Right after adding the next record with dat "2017-03-26 01:33:46" the error occurs. I noticed that between those points there is more than 60 minutes. This means that between those two event times one or several emission records won't have corresponding events. This in turn will generate NA's that somehow get caught up in the any() call of the foverlaps function. Am I looking in the right direction?
The fluor emissions are stored in a large datatable (~1 million rows) called emissions.GN. Note that only the date.time (POSIXct) variable is relevant to my problem.
example of emissions.GN:
date.time fluor hall period dt
1: 2016-01-01 00:17:04 0.3044254 GN [2016-01-01,2016-02-21] -16.07373
2: 2016-01-01 00:17:04 0.4368381 GN [2016-01-01,2016-02-21] -16.07373
3: 2016-01-01 00:18:04 0.5655382 GN [2016-01-01,2016-02-21] -16.07395
4: 2016-01-01 00:19:04 0.6542259 GN [2016-01-01,2016-02-21] -16.07417
5: 2016-01-01 00:21:04 0.6579384 GN [2016-01-01,2016-02-21] -16.07462
The data of the three events is stored in three smaller datatables (~20 thousand records) contained in a list called events.GN. Note that only the dat (POSIXct) variable is relevant to my problem.
example of AC events (CO and MT are analogous):
events.GN[["AC"]]
dat hall numevt txtevt
1: 2016-01-01 00:04:54 GN 321 PHASE 1 CHANGEMENT D'ANODE (Position anode #1I)
2: 2016-01-01 00:09:21 GN 321 PHASE 1 CHANGEMENT D'ANODE (Position anode #1I)
3: 2016-01-01 00:38:53 GN 321 PHASE 1 CHANGEMENT D'ANODE (Position anode #1I)
4: 2016-01-01 02:30:33 GN 321 PHASE 1 CHANGEMENT D'ANODE (Position anode #1I)
5: 2016-01-01 02:34:11 GN 321 PHASE 1 CHANGEMENT D'ANODE (Position anode #1I)
The function
I have written a function that applies foverlaps on a given (large) x datatable and a given (small) y datatable. The function returns a datatable with two columns. The first column yid contains the indices of emissions.GN observations that overlap at least once with an event. The second column N contains the overlap count (i.e. the number of times an overlap occurs for that particular index). The index of emissions that have zero overlaps are omitted from the result.
# A function to compute the number of times an emission record falls between the defined starting point and end point of an event.
find_index_and_count <- function(hall,event, lower.margin=10, upper.margin=30){
# Define start and stop variables of the large emission dataset hall to be zero, i.e. each record is a single time point, not an interval.
hall$start <- hall$date.time
hall$stop <- hall$date.time
# Define the start and stop variables of the small event datatables equal to the defined margins oof 10 and 30 minutes respectively
event$start <- event$dat-minutes(lower.margin)
event$stop <- event$dat+minutes(upper.margin)
# Set they key of both datasets to be start and stop
setkey(hall,start,stop)
setkey(event,start,stop)
# Returns the index the of the emission record that falls N times within an event time interval. The call to na.omit is necessary to remove NA's introduced by x records that don't fall within any y interval.
foverlaps(event,hall,nomatch = NA, which = TRUE)[, .N, by=yid] %>% na.omit
}
The function executes succesfully for the events AC and CO
The function gives the desired result as discribed above when called on the events AC and CO:
find_index_and_count(emissions.GN,events.GN[["AC"]])
yid N
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 2
---
find_index_and_count(emissions.GN,events.GN[["CO"]])
yid N
1: 3 1
2: 4 1
3: 5 1
4: 6 1
5: 7 1
---
The function returns an error when called on the MT event
The following function call results in the error below:
find_index_and_count(emissions.GN,events.GN[["MT"]])
Error in if (any(x[[xintervals[2L]]] - x[[xintervals[1L]]] < 0L)) stop("All entries in column ", : missing value where TRUE/FALSE needed
5.foverlaps(event, hall, nomatch = NA, which = TRUE)
4.eval(lhs, parent, parent)
3.eval(lhs, parent, parent)
2.foverlaps(event, hall, nomatch = NA, which = TRUE)[, .N, by = yid] %>% na.omit
1.find_index_and_count(emissions.GN, events.GN[["MT"]])
I assume the function returns an NA whenever a record in x (emissions.FN) has no overlap with any of the events in y (events.FN[["AC"]] etc.).
I don't understand why the function fails on the event MT when it works just fine for AC and CO. The data are exactly the same with the exception of the values and slightly different number of records.
What I have tried so far
Firstly, In the similar problem linked above, someone pointed out the following idea:
This often indicates an NA value being fed to the any function, so it returns NA and that's not a legal logical value. – Carl Witthoft May 7 '15 at 13:50
Hence , I modified the call to foverlaps to return 0 instead of NA whener no overlap between x and y is found, like this:
foverlaps(event,hall,nomatch = 0, which = TRUE)[, .N, by=yid] %>% na.omit
This did not change anything (the function works for AC and CO but fails for MT).
Secondly, I made absolutely sure that none of my datatables contained NA's.
More information
If required I can provide the SQL code that generates the emissions.FN data and all the events.FN data. Note that because all the events.FN date has the same origin, there should be no diffirences (other than the values) between the data of the events AC, CO and MT.
If anything else is required, please do feel free to ask !
I'm trying to count how many times fluor emissions (measured every 1 minute) overlap with a given event. An emission is said to overlap with a given event when the emission time is 10 minutes before or 30 minutes after the time of the event.
Just addressing this objective (since I don't know foverlaps well.)...
event.GN[, n :=
emissions.GN[.SD[, .(d_dn = dat - 10*60, d_up = dat + 30*60)], on=.(date.time >= d_dn, date.time <= d_up),
.N
, by=.EACHI]$N
]
dat n
1: 2016-01-01 00:00:00 31
2: 2016-01-01 00:15:00 41
3: 2016-01-01 00:30:00 41
4: 2016-01-01 00:45:00 41
5: 2016-01-01 01:00:00 41
---
26379: 2016-10-01 18:30:00 41
26380: 2016-10-01 18:45:00 41
26381: 2016-10-01 19:00:00 41
26382: 2016-10-01 19:15:00 41
26383: 2016-10-01 19:30:00 41
To check/verify one of these counts...
> # dat from 99th event...
> my_d <- event.GN[99, {print(.SD); dat}]
dat n
1: 2016-01-02 00:30:00 41
>
> # subsetting to overlapping emissions
> emissions.GN[date.time %between% (my_d + c(-10*60, 30*60))]
date.time
1: 2016-01-02 00:20:00
2: 2016-01-02 00:21:00
3: 2016-01-02 00:22:00
4: 2016-01-02 00:23:00
5: 2016-01-02 00:24:00
6: 2016-01-02 00:25:00
7: 2016-01-02 00:26:00
8: 2016-01-02 00:27:00
9: 2016-01-02 00:28:00
10: 2016-01-02 00:29:00
11: 2016-01-02 00:30:00
12: 2016-01-02 00:31:00
13: 2016-01-02 00:32:00
14: 2016-01-02 00:33:00
15: 2016-01-02 00:34:00
16: 2016-01-02 00:35:00
17: 2016-01-02 00:36:00
18: 2016-01-02 00:37:00
19: 2016-01-02 00:38:00
20: 2016-01-02 00:39:00
21: 2016-01-02 00:40:00
22: 2016-01-02 00:41:00
23: 2016-01-02 00:42:00
24: 2016-01-02 00:43:00
25: 2016-01-02 00:44:00
26: 2016-01-02 00:45:00
27: 2016-01-02 00:46:00
28: 2016-01-02 00:47:00
29: 2016-01-02 00:48:00
30: 2016-01-02 00:49:00
31: 2016-01-02 00:50:00
32: 2016-01-02 00:51:00
33: 2016-01-02 00:52:00
34: 2016-01-02 00:53:00
35: 2016-01-02 00:54:00
36: 2016-01-02 00:55:00
37: 2016-01-02 00:56:00
38: 2016-01-02 00:57:00
39: 2016-01-02 00:58:00
40: 2016-01-02 00:59:00
41: 2016-01-02 01:00:00
date.time
I have a large panel of firms stock returns and value-weighted S&P500 return. I want to apply a rolling window regression, where I regress the firms returns of the previous twelve months on the value weighted S&P500 return, and extract the standard deviation of the squared residuals.
My code looks as follows
stdev <- matrix(NA,nrow = nrow(ReturnMatrix),ncol = 1)
pb <- winProgressBar(title = "",label = "",min = 1,max =
nrow(ReturnMatrix)-11)
for(i in 1:(nrow(ReturnMatrix)-11))
{
VWRet <- ReturnMatrix$VWReturn[i:(i+11)]
Ret <- ReturnMatrix$Return[i:(i+11)]
if(sum(is.na(Ret)) >= 6)
{
stdev[i+11] <- NA
}
else{
Model <- glm(Ret~VWRet-1)
stdev[i+11] <- sigma(Model)
}
setWinProgressBar(pb,value = i,title = paste0(round(100*
(i/(nrow(ReturnMatrix) - 11)),2)," % Done"))
}
SD <- cbind.data.frame(ReturnMatrix,stdev)
The dataframe ReturnMatrix is very large, it contains 3239065 rows. The variables in the dataframe are PERMNO which is a firm identifier, YearMonth which is the date in YYYYMM format, Return which is the firms return of that month and VWReturn which is the value weighted S&P500 return.
Right now, it takes about 1 hour to run this for loop.
My question is: Is there any way to speed this process up a notch, I have tried using rollapply on zoo(ReturnMatrix), but this only slows it down even more.
Any help would be greatly appreciated.
Here's how to do that with data.table, which should be the fastest way to do what you want. You first need to build a sigma function and then use rollaplyr with .SD.
set.seed(1)
library(data.table)
dt <- data.table(PERMNO=rep(LETTERS[1:3],each=13),
YearMonth=seq.Date(from=Sys.Date(),by="month",length.out =13),
Return=runif(39),VWReturn=runif(39))
#create sigma function
stdev <- function(x) sd(lm(x[, 1]~ x[, 2])$residuals)
#create new column with rollapply
dt[,roll_sd:=rollapplyr(.SD, 12, stdev, by.column = FALSE, fill = NA),
by=.(PERMNO),.SDcols = c("Return", "VWReturn")]
PERMNO YearMonth Return VWReturn roll_sd
1: A 2017-11-19 0.26550866 0.41127443 NA
2: A 2017-12-19 0.37212390 0.82094629 NA
3: A 2018-01-19 0.57285336 0.64706019 NA
4: A 2018-02-19 0.90820779 0.78293276 NA
5: A 2018-03-19 0.20168193 0.55303631 NA
6: A 2018-04-19 0.89838968 0.52971958 NA
7: A 2018-05-19 0.94467527 0.78935623 NA
8: A 2018-06-19 0.66079779 0.02333120 NA
9: A 2018-07-19 0.62911404 0.47723007 NA
10: A 2018-08-19 0.06178627 0.73231374 NA
11: A 2018-09-19 0.20597457 0.69273156 NA
12: A 2018-10-19 0.17655675 0.47761962 0.3181427
13: A 2018-11-19 0.68702285 0.86120948 0.3141638
14: B 2017-11-19 0.38410372 0.43809711 NA
....
I have a data.table dt, which looks like:
> dt[1:20, c("p_date", "p_time")]
p_date p_time
1: 20170422 0916
2: 20170421 1011
3: 20170112 1528
4: 20170318 1111
5: 20170322 0957
6: 20170321 1115
7: 20170304 1532
8: 20170322 1417
9: 20170401 1242
10: 20170321 1812
11: 20170401 1821
12: 20170401 1509
13: 20170320 1655
14: 20170401 1518
15: 20170320 1444
16: 20170401 1712
17: 20170317 1021
18: 20170322 1816
19: 20170305 1056
20: 20170319 1428
I want to find out which date are missing from the values of column p_date of table dt.
Here the date is in the format of yyyymmdd, I want to find out the missing date between the minimum date and the maximum date value present in the list.
The output must be a data.table with one column as the missing date values
How can I do this with data.table in r
You could define a vector of dates between your minimum and your maximum date like this:
dateRangeVec <- range(as.Date(as.character(dt$p_date), format = "%Y%m%d"))
allDatesVec <- format(seq(from = dateRangeVec[1],
to = dateRangeVec[2], 'days'), "%Y%m%d")
You can then filter all the dates that are not in your data table using setdiff:
outDt <- data.table(p_date = setdiff(allDatesVec, dt$p_date))
We can use a join on 'p_date' (after converting the column to Date class) by creating another dataset with the full range of 'p_date'
dt[, p_date := lubridate::ymd(p_date)]
dt1 <- data.table(p_date = seq(min(dt$p_date), max(dt$p_date), by = '1 day'))
dt[dt1, on = 'p_date'][is.na(p_time), p_date]
Or another option is to use anti_join from dplyr
library(dplyr)
anti_join(dt1, dt, on = 'p_date')
I'm working with a GPS tracking dataset, and I've been playing around with filtering the dataset based on speed and time of day. The species I am working with becomes inactive around dusk, during which it rests on the ocean's surface, but then resumes activity once night has fallen. For each animal in the dataset, I would like to remove all data points after it initially becomes inactive around dusk (21:30). But because each animal becomes inactive at different times, I cannot simply filter out all the data points occurring after 21:30.
My data looks like this...
AnimalID Latitude Longitude Speed Date
99B 50.86190 -129.0875 5.6 2015-05-14 21:26:00
99B 50.86170 -129.0875 0.6 2015-05-14 21:32:00
99B 50.86150 -129.0810 0.5 2015-05-14 21:33:00
99B 50.86140 -129.0800 0.3 2015-05-14 21:40:00
99C.......
Essentially, I want to find a cluster of GPS positions (say, a minimum of 5), occurring after 21:30:00, that all have speeds of <0.8. I then want to delete all points after this point (including the identified cluster).
Does anyone know a way of identifying clusters of points in R? Or is this type of filtering WAY to complex?
Using data.table, you can use a rolling forward/backwards max to find the max of the next five or previous five entries by animal ID. Then, filter out any that don't meet the criteria. For example:
library(data.table)
set.seed(40)
DT <- data.table(Speed = runif(1:1000), AnimalID = rep(c("A","B"), each = 500))
DT[ , FSpeed := Reduce(pmax,shift(Speed,0:4, type = "lead", fill = 1)), by = .(AnimalID)] #0 + 4 forward
DT[ , BSpeed := Reduce(pmax,shift(Speed,0:4, type = "lag", fill = 1)), by = .(AnimalID)] #0 + 4 backwards
DT[FSpeed < 0.5 | BSpeed < 0.5] #min speed
Speed AnimalID FSpeed BSpeed
1: 0.220509197 A 0.4926640 0.8897597
2: 0.225883211 A 0.4926640 0.8897597
3: 0.264809801 A 0.4926640 0.6648507
4: 0.184270587 A 0.4926640 0.6589303
5: 0.492664002 A 0.4926640 0.4926640
6: 0.472144689 A 0.4721447 0.4926640
7: 0.254635219 A 0.7409803 0.4926640
8: 0.281538568 A 0.7409803 0.4926640
9: 0.304875597 A 0.7409803 0.4926640
10: 0.059605991 A 0.7409803 0.4721447
11: 0.132069268 A 0.2569604 0.9224052
12: 0.256960449 A 0.2569604 0.9224052
13: 0.005059727 A 0.8543111 0.2569604
14: 0.191478376 A 0.8543111 0.2569604
15: 0.170969244 A 0.4398143 0.7927442
16: 0.059577719 A 0.4398143 0.7927442
17: 0.439814267 A 0.4398143 0.7927442
18: 0.307714603 A 0.9912536 0.4398143
19: 0.075750773 A 0.9912536 0.4398143
20: 0.100589403 A 0.9912536 0.4398143
21: 0.032957748 A 0.4068012 0.7019594
22: 0.080091554 A 0.4068012 0.7019594
23: 0.406801193 A 0.9761119 0.4068012
24: 0.057445020 A 0.9761119 0.4068012
25: 0.308382143 A 0.4516870 0.9435490
26: 0.451686996 A 0.4516870 0.9248595
27: 0.221964923 A 0.4356419 0.9248595
28: 0.435641917 A 0.5363373 0.4516870
29: 0.237658906 A 0.5363373 0.4516870
30: 0.324597512 A 0.9710011 0.4356419
31: 0.357198893 B 0.4869905 0.9226573
32: 0.486990475 B 0.4869905 0.9226573
33: 0.115922994 B 0.4051843 0.9226573
34: 0.010581766 B 0.9338841 0.4869905
35: 0.003976893 B 0.9338841 0.4869905
36: 0.405184342 B 0.9338841 0.4051843
37: 0.412468699 B 0.4942280 0.9113595
38: 0.402063509 B 0.4942280 0.9113595
39: 0.494228013 B 0.8254665 0.4942280
40: 0.123264949 B 0.8254665 0.4942280
41: 0.251132449 B 0.4960371 0.9475821
42: 0.496037128 B 0.8845043 0.4960371
43: 0.250853014 B 0.3561290 0.9858652
44: 0.356129033 B 0.3603769 0.8429552
45: 0.225943145 B 0.7028077 0.3561290
46: 0.360376907 B 0.7159759 0.3603769
47: 0.169606203 B 0.3438164 0.9745535
48: 0.343816363 B 0.4396962 0.9745535
49: 0.067265545 B 0.9641856 0.3438164
50: 0.439696243 B 0.9641856 0.4396962
51: 0.024403516 B 0.3730828 0.9902976
52: 0.373082846 B 0.4713596 0.9902976
53: 0.290466668 B 0.9689225 0.3730828
54: 0.471359568 B 0.9689225 0.4713596
55: 0.402111615 B 0.4902595 0.8045104
56: 0.490259530 B 0.8801029 0.4902595
57: 0.477884140 B 0.4904800 0.6696598
58: 0.490480001 B 0.8396014 0.4904800
Speed AnimalID FSpeed BSpeed
This shows all the clusters where either the following or previous four (+ the anchor cell) all have a max speed below our min speed (in this case 0.5)
In your code, just run DT <- as.data.table(myDF) where myDF is the name of the data.frame you are using.
For this analysis, we assume that GPS measurements are measured at constant intervals. I also am throwing out the first 4 and last 4 observations by setting fill=1. You should set fill= to your max speed.