I have data that looks like
ID CLM_ID Date1 Date2
1 718182 1/1/2014 1/17/2014
1 718184 1/2/2014 1/8/2014
1 885236 1/15/2014 1/17/2014
1 885362 3/20/2014 3/21/2014
2 589963 3/18/2015 3/22/2015
2 589999 2/27/2015 5/9/2015
2 594226 4/11/2015 4/17/2015
2 689959 5/10/2015 6/10/2015
3 656696 5/1/2016 5/5/2016
3 669625 5/6/2016 5/22/2016
4 777777 2/21/2015 3/4/2015
4 778952 2/1/2015 2/28/2015
4 778965 3/1/2015 3/22/2015
I am working on two different problems with this. The first one was answered in a previous post about how to roll dates up (Date roll-up in R) and the second now is that I have intervals that are within intervals and I am trying to get rid of them. So the final outcome should look like
ID CLM_ID Date1 Date2
1 718182 1/1/2014 1/17/2014
1 885362 3/20/2014 3/21/2014
2 589999 2/27/2015 5/9/2015
3 656696 5/1/2016 5/22/2016
4 778952 2/1/2015 3/22/2015
Now I know I will have to create the extended intervals via the date rollup first, but then how do I get rid of these sub-intervals (a term I am making up for intervals within intervals)? I am also looking for a solution that is efficient since I actually have 75,000 records to go through (i.e. I am trying to avoid iterative solutions).
Using non-equi joins from the current development version of data.table, v1.9.7,
require(data.table) # v1.9.7+
dt[dt, .(CLM_IDs = CLM_IDs[.N==1L]), on=.(ID, Date1<=Date1, Date2>=Date2), by=.EACHI]
# ID Date1 Date2 CLM_ID
# 1: 1 2014-01-01 2014-01-17 718182
# 2: 1 2014-03-20 2014-03-21 885362
# 3: 2 2015-02-27 2015-05-09 589999
# 4: 2 2015-05-10 2015-06-10 689959
# 5: 3 2016-05-01 2016-05-05 656696
# 6: 3 2016-05-06 2016-05-22 669625
# 7: 4 2015-02-21 2015-03-04 777777
# 8: 4 2015-02-01 2015-02-28 778952
# 9: 4 2015-03-01 2015-03-22 778965
What this does is, for each row in dt (the one inside of square bracket), it looks up which rows match in dt (on the outside) based on the condition provided to the on argument.
The matching row indices are returned iff the only match is a self-match (since the condition includes equality as well). This is done by CLM_IDs[.N == 1L], where .N holds the number of observations for each group.
"I am also looking for a solution that is efficient ... (i.e. I am trying to avoid iterative solutions)."
"Your assumptions are your windows on the world. Scrub them off every once in a while, or the light won't come in." - Isaac Asimov
Below is a super fast base R iterative solution. It returns the correct results for very large data frames virtually instantly. (it also "rolls-up" the data, so there is no need to carry out two algorithms):
MakeDFSubInt <- function(df, includeCost = FALSE) {
## Sorting the data frame to allow for fast
## creation of the "Contained" logical vector below
tempDF <- df[order(df$ID, df$Date1, df$Date2), ]
UniIDs <- unique(tempDF$ID)
Len <- length(UniIDs)
## Determine starting (i.e. "s") and ending (i.e. "e")
## points of the respective groups of IDs
e <- which(diff(tempDF$ID)==1)
s <- c(1L, e + 1L)
dfLen <- nrow(tempDF)
e <- c(e, dfLen)
## Converting dates to integers so that comparison
## will be faster. Internally dates are stored as
## integers, so this isn't a problem
dte1 <- as.integer(tempDF$Date1)
dte2 <- as.integer(tempDF$Date2)
## Building logical vector in order to quickly create sub-intervals
Contained <- rep(FALSE, dfLen)
BegTime <- Sys.time() ## Included to measure time of for loop execution
for (j in 1:Len) {
Compare <- ifelse(dte2[s[j]] >= (dte1[s[j]+1L]+1L), max(dte2[s[j]], dte2[s[j]+1L]), dte2[s[j]+1L])
for (x in (s[j]+1L):e[j]) {
if (!Contained[x-1L]) {
Contained[x] <- dte2[x-1L] >= (dte1[x]-1L)
} else {
Contained[x] <- Compare >= (dte1[x]-1L)
}
## could use ifelse, but this construct is faster
if (Contained[x]) {
Compare <- max(Compare, dte2[x])
} else {
Compare <- dte2[x]
}
}
}
EndTime <- Sys.time()
TotTime <- EndTime - BegTime
if (printTime) {print(paste(c("for loop execution time was: ", format(TotTime)), collapse = ""))}
## identify sub-intervals
nGrps <- which(!Contained)
## Create New fields for our new DF
ID <- tempDF$ID[nGrps]
CLM_ID <- tempDF$CLM_ID[nGrps]
Date1 <- tempDF$Date1[nGrps]
nGrps <- c(nGrps, dfLen+1L)
## as.Date is converting numbers to dates.
## N.B. This only works if origin is supplied
Date2 <- as.Date(vapply(1L:(length(nGrps) - 1L), function(x) {
max(dte2[nGrps[x]:(nGrps[x+1L]-1L)])}, 1L), origin = "1970-01-01")
## in a related question the OP had, "Cost" was
## included to show how the algorithm would handle
## generic summary information
if (includeCost) {
myCost <- tempDF$Cost
Cost <- vapply(1L:(length(nGrps) - 1L), function(x) sum(myCost[nGrps[x]:(nGrps[x+1L]-1L)]), 100.01)
NewDf <- data.frame(ID,CLM_ID,Date1,Date2,Cost)
} else {
NewDf <- data.frame(ID,CLM_ID,Date1,Date2)
}
NewDf
}
For the example given in the question, we have:
ID <- c(rep(1,4),rep(2,4),rep(3,2),rep(4,3))
CLM_ID <- c(718182, 718184, 885236, 885362, 589963, 589999, 594226, 689959, 656696, 669625, 777777, 778952, 778965)
Date1 <- c("1/1/2014","1/2/2014","1/15/2014","3/20/2014","3/18/2015","2/27/2015","4/11/2015","5/10/2015","5/1/2016","5/6/2016","2/21/2015","2/1/2015","3/1/2015")
Date2 <- c("1/17/2014","1/8/2014","1/17/2014","3/21/2014","3/22/2015","5/9/2015","4/17/2015","6/10/2015","5/5/2016","5/22/2016","3/4/2015","2/28/2015","3/22/2015")
myDF <- data.frame(ID, CLM_ID, Date1, Date2)
myDF$Date1 <- as.Date(myDF$Date1, format = "%m/%d/%Y")
myDF$Date2 <- as.Date(myDF$Date2, format = "%m/%d/%Y")
MakeDFSubInt(myDF)
ID CLM_ID Date1 Date2
1 1 718182 2014-01-01 2014-01-17
2 1 885362 2014-03-20 2014-03-21
3 2 589999 2015-02-27 2015-06-10
4 3 656696 2016-05-01 2016-05-22
5 4 778952 2015-02-01 2015-03-22
From a similar question the OP posted, we can add a Cost field, to show how we would proceed with calculations for this setup.
set.seed(7777)
myDF$Cost <- round(rnorm(13, 450, sd = 100),2)
MakeDFSubInt(myDF, includeCost = TRUE)
ID CLM_ID Date1 Date2 Cost
1 1 718182 2014-01-01 2014-01-17 1164.66
2 1 885362 2014-03-20 2014-03-21 568.16
3 2 589999 2015-02-27 2015-06-10 2019.16
4 3 656696 2016-05-01 2016-05-22 990.14
5 4 778952 2015-02-01 2015-03-22 1578.68
This algorithm scales very well. For data frames the size the OP is looking for, returning the requested DF returns almost instantaneously and for very large data frames, it returns in just seconds.
First we build a function that will generate a random data frame with n rows.
MakeRandomDF <- function(n) {
set.seed(109)
CLM_Size <- ifelse(n < 10^6, 10^6, 10^(ceiling(log10(n))))
numYears <- trunc((6/425000)*n + 5)
StrtYear <- ifelse(numYears > 16, 2000, 2016 - numYears)
numYears <- ifelse(numYears > 16, 16, numYears)
IDs <- sort(sample(trunc(n/100), n, replace = TRUE))
CLM_IDs <- sample(CLM_Size, n)
StrtDate <- as.Date(paste(c(as.character(StrtYear),"-01-01"), collapse = ""))
myPossibleDates <- StrtDate+(0:(numYears*365)) ## "numYears" years of data
Date1 <- sample(myPossibleDates, n, replace = TRUE)
Date2 <- Date1 + sample(1:100, n, replace = TRUE)
Cost <- round(rnorm(n, 850, 100), 2)
tempDF <- data.frame(IDs,CLM_IDs,Date1,Date2,Cost)
tempDF$Date1 <- as.Date(tempDF$Date1, format = "%m/%d/%Y")
tempDF$Date2 <- as.Date(tempDF$Date2, format = "%m/%d/%Y")
tempDF
}
For moderate size DFs (i.e. 75,000 rows)
TestDF <- MakeRandomDF(75000)
system.time(test1 <- MakeDFSubInt(TestDF, includeCost = TRUE, printTime = TRUE))
[1] "for loop execution time was: 0.06500006 secs"
user system elapsed
0.14 0.00 0.14
nrow(test1)
[1] 7618
head(test1)
ID CLM_ID Date1 Date2 Cost
1 1 116944 2010-01-29 2010-01-30 799.90 ## The range of dates for
2 1 515993 2010-02-15 2011-10-12 20836.83 ## each row are disjoint
3 1 408037 2011-12-13 2013-07-21 28149.26 ## as requested by the OP
4 1 20591 2013-07-25 2014-03-11 10449.51
5 1 338609 2014-04-24 2014-07-31 4219.48
6 1 628983 2014-08-03 2014-09-11 2170.93
For very large DFs (i.e. > 500,000 rows)
TestDF2 <- MakeRandomDF(500000)
system.time(test2 <- MakeDFSubInt(TestDF2, includeCost = TRUE, printTime = TRUE))
[1] "for loop execution time was: 0.3679998 secs"
user system elapsed
1.19 0.03 1.21
nrow(test2)
[1] 154839
head(test2)
ID CLM_ID Date1 Date2 Cost
1 1 71251 2004-04-19 2004-06-29 2715.69 ## The range of dates for
2 1 601676 2004-07-05 2004-09-23 2675.04 ## each row are disjoint
3 1 794409 2004-12-28 2005-04-05 1760.63 ## as requested by the OP
4 1 424671 2005-06-03 2005-08-20 1973.67
5 1 390353 2005-09-16 2005-11-06 785.81
6 1 496611 2005-11-21 2005-11-24 904.09
system.time(test3 <- MakeDFSubInt(TestDF3, includeCost = TRUE, printTime = TRUE))
[1] "for loop execution time was: 0.6930001 secs"
user system elapsed
2.68 0.08 2.79 ## 1 million rows in under 3 seconds!!!
nrow(test3)
[1] 413668
Explanation
The main part of the algorithm is generating the Contained logical vector that is used to determine the sub-intervals of continuous dates. Generation of this vector relies on the fact that the data frame is sorted, first by ID, second by Date1, and finally by Date2. We begin by locating the starting and ending rows of each group of IDs. For example, with the example provided by the OP we have:
myDF
ID CLM_ID Date1 Date2
1 1 718182 2014-01-01 2014-01-17 ## <- 1 s[1]
2 1 718184 2014-01-02 2014-01-08
3 1 885236 2014-01-15 2014-01-17
4 1 885362 2014-03-20 2014-03-21 ## <- 4 e[1]
5 2 589963 2015-03-18 2015-03-22 ## <- 5 s[2]
6 2 589999 2015-02-27 2015-05-09
7 2 594226 2015-04-11 2015-04-17
8 2 689959 2015-05-10 2015-06-10 ## <- 8 e[2]
9 3 656696 2016-05-01 2016-05-05 ## <- 9 s[3]
10 3 669625 2016-05-06 2016-05-22 ## <- 10 e[3]
11 4 777777 2015-02-21 2015-03-04 ## <- 11 s[4]
12 4 778952 2015-02-01 2015-02-28
13 4 778965 2015-03-01 2015-03-22 ## <- 13 e[4]
Below is the code that generates s and e.
## Determine starting (i.e. "s") and ending (i.e. "e")
## points of the respective groups of IDs
e <- which(diff(tempDF$ID)==1)
s <- c(1L, e + 1L)
dfLen <- nrow(tempDF)
e <- c(e, dfLen)
s
1 5 9 11
e
4 8 10 13
Now, we loop over each group and begin populating the logical vector Contained. If the date range for a particular row overlaps (or is a continuance of) the date range above it, we set that particular index of Contained to TRUE. This is why the first row in each group is set to FALSE since there is nothing above to compare it to. As we are doing this, we are updating the largest date to compare against moving forward, hence the Compare variable. It should be noted that it isn't necessarily true that Date2[n] < Date2[n+1L], this is why Compare <- max(Compare, dte2[x]) for a succession of TRUEs. The result for our example is give below.
ID CLM_ID Date1 Date2 Contained
1 1 718182 2014-01-01 2014-01-17 FALSE
2 1 718184 2014-01-02 2014-01-08 TRUE ## These two rows are contained
3 1 885236 2014-01-15 2014-01-17 TRUE ## in the date range 1/1 - 1/17
4 1 885362 2014-03-20 2014-03-21 FALSE ## This row isn't
6 2 589999 2015-02-27 2015-05-09 FALSE
5 2 589963 2015-03-18 2015-03-22 TRUE
7 2 594226 2015-04-11 2015-04-17 TRUE
8 2 689959 2015-05-10 2015-06-10 TRUE ## N.B. 5/10 is a continuance of 5/09
9 3 656696 2016-05-01 2016-05-05 FALSE
10 3 669625 2016-05-06 2016-05-22 TRUE
12 4 778952 2015-02-01 2015-02-28 FALSE
11 4 777777 2015-02-21 2015-03-04 TRUE
13 4 778965 2015-03-01 2015-03-22 TRUE
Now we can easily identify the "starting" rows by identifying all rows with a corresponding FALSE. After this, finding summary information is a breeze by simply calculating whatever you are interested in (e.g. max(Date2), sum(Cost)) over each succession of TRUEs and Voila!!
Here is a not-so-pretty solution comparing each row with the dates of all other rows. I corrected the one year 3015 to 2015. The results are different from what you are expecting, though. Either I misunderstood your question, or you misread the data.
Data:
dta <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 4L),
CLM_ID = c(718182L, 718184L, 885236L, 885362L, 589963L, 589999L, 594226L, 689959L, 656696L, 669625L, 777777L, 778952L, 778965L),
Date1 = structure(c(1L, 3L, 2L, 9L, 8L, 6L, 10L, 12L, 11L, 13L, 5L, 4L, 7L), .Label = c("1/1/2014", "1/15/2014", "1/2/2014", "2/1/2015", "2/21/2015", "2/27/2015", "3/1/2015", "3/18/2015", "3/20/2014", "4/11/2015", "5/1/2016", "5/10/2015", "5/6/2016"), class = "factor"),
Date2 = structure(c(1L, 2L, 1L, 4L, 5L, 10L, 7L, 11L, 9L, 8L, 6L, 3L, 5L), .Label = c("1/17/2014", "1/8/2014", "2/28/2015", "3/21/2014", "3/22/2015", "3/4/2015", "4/17/2015", "5/22/2016", "5/5/2016", "5/9/2015", "6/10/2015"), class = "factor")),
.Names = c("ID", "CLM_ID", "Date1", "Date2"), class = "data.frame",
row.names = c(NA, -13L))
Code:
dta$Date1 <- as.Date(dta$Date1, format = "%m/%d/%Y")
dta$Date2 <- as.Date(dta$Date2, format = "%m/%d/%Y")
# Boolean vector to memorize results
keep <- logical(length = nrow(dta))
for(i in 1:nrow(dta)) {
match <- dta[dta$Date1 <= dta$Date1[i] & dta$Date2 >= dta$Date2[i], ]
if(nrow(match) == 1) keep[i] <- TRUE
}
# Result
dta[keep, ]
Related
I've this function to generate monthly ranges, it should consider years where february has 28 or 29 days:
starts ends
1 2017-01-01 2017-01-31
2 2017-02-01 2017-02-28
3 2017-03-01 2017-03-31
It works with:
make_date_ranges(as.Date("2017-01-01"), Sys.Date())
But gives error with:
make_date_ranges(as.Date("2017-01-01"), as.Date("2019-12-31"))
Why?
make_date_ranges(as.Date("2017-01-01"), as.Date("2019-12-31"))
Error in data.frame(starts, ends) :
arguments imply differing number of rows: 38, 36
add_months <- function(date, n){
seq(date, by = paste (n, "months"), length = 2)[2]
}
make_date_ranges <- function(start, end){
starts <- seq(from = start,
to = Sys.Date()-1 ,
by = "1 month")
ends <- c((seq(from = add_months(start, 1),
to = end,
by = "1 month" ))-1,
(Sys.Date()-1))
data.frame(starts,ends)
}
## useage
make_date_ranges(as.Date("2017-01-01"), as.Date("2019-12-31"))
1) First, define start of month, som, and end of month, eom functions which take a Date class object, date string in standard Date format or yearmon object and produce a Date class object giving the start or end of its year/months.
Using those, create a monthly Date series s using the start of each month from the month/year of from to that of to. Use pmax to ensure that the series does not extend before from and pmin so that it does not extend past to.
The input arguments can be strings in standard Date format, Date class objects or yearmon class objects. In the yearmon case it assumes the user wanted the full month for every month. (The if statement can be omitted if you don't need to support yearmon inputs.)
library(zoo)
som <- function(x) as.Date(as.yearmon(x))
eom <- function(x) as.Date(as.yearmon(x), frac = 1)
date_ranges2 <- function(from, to) {
if (inherits(to, "yearmon")) to <- eom(to)
s <- seq(som(from), eom(to), "month")
data.frame(from = pmax(as.Date(from), s), to = pmin(as.Date(to), eom(s)))
}
date_ranges2("2000-01-10", "2000-06-20")
## from to
## 1 2000-01-10 2000-01-31
## 2 2000-02-01 2000-02-29
## 3 2000-03-01 2000-03-31
## 4 2000-04-01 2000-04-30
## 5 2000-05-01 2000-05-31
## 6 2000-06-01 2000-06-20
date_ranges2(as.yearmon("2000-01"), as.yearmon("2000-06"))
## from to
## 1 2000-01-01 2000-01-31
## 2 2000-02-01 2000-02-29
## 3 2000-03-01 2000-03-31
## 4 2000-04-01 2000-04-30
## 5 2000-05-01 2000-05-31
## 6 2000-06-01 2000-06-30
2) This alternative takes the same approach but defines start of month (som) and end of month (eom) functions without using yearmon so that only base R is needed. It takes character strings in standard Date format or Date class inputs and gives the same output as (1).
som <- function(x) as.Date(cut(as.Date(x), "month")) # start of month
eom <- function(x) som(som(x) + 32) - 1 # end of month
date_ranges3 <- function(from, to) {
s <- seq(som(from), as.Date(to), "month")
data.frame(from = pmax(as.Date(from), s), to = pmin(as.Date(to), eom(s)))
}
date_ranges3("2000-01-10", "2000-06-20")
## from to
## 1 2000-01-10 2000-01-31
## 2 2000-02-01 2000-02-29
## 3 2000-03-01 2000-03-31
## 4 2000-04-01 2000-04-30
## 5 2000-05-01 2000-05-31
## 6 2000-06-01 2000-06-20
date_ranges3(som("2000-01-10"), eom("2000-06-20"))
## from to
## 1 2000-01-01 2000-01-31
## 2 2000-02-01 2000-02-29
## 3 2000-03-01 2000-03-31
## 4 2000-04-01 2000-04-30
## 5 2000-05-01 2000-05-31
## 6 2000-06-01 2000-06-30
You don't need to use seq twice -- you can subtract 1 day from the firsts of each month to get the ends, and generate one too many starts, then shift & subset:
make_date_ranges = function(start, end) {
# format(end, "%Y-%m-01") essentially truncates end to
# the first day of end's month; 32 days later is guaranteed to be
# in the subsequent month
starts = seq(from = start, to = as.Date(format(end, '%Y-%m-01')) + 32, by = 'month')
data.frame(starts = head(starts, -1L), ends = tail(starts - 1, -1L))
}
x = make_date_ranges(as.Date("2017-01-01"), as.Date("2019-12-31"))
rbind(head(x), tail(x))
# starts ends
# 1 2017-01-01 2017-01-31
# 2 2017-02-01 2017-02-28
# 3 2017-03-01 2017-03-31
# 4 2017-04-01 2017-04-30
# 5 2017-05-01 2017-05-31
# 6 2017-06-01 2017-06-30
# 31 2019-07-01 2019-07-31
# 32 2019-08-01 2019-08-31
# 33 2019-09-01 2019-09-30
# 34 2019-10-01 2019-10-31
# 35 2019-11-01 2019-11-30
# 36 2019-12-01 2019-12-31
I have a data-set containing vehicle movement, where vehicle time and position is logged with GPS. The issue is that there exist duplicate values (spoofing) for several of the vehicles, and I am not able to identify the true vehicle, outside the guess that the first time the vehicle is introduced it is the true vehicle. My intent is to create a fore-loop that calculates predicted movement from one position to the next , and if the next value is outside this value the row will be deleted. This will also remove outliers, where one position is extremely off for some reason.
Dataset
Vehicle ID Time Lat Long Max Speed (kts)
1 01.01.2013 12:00:00 9.535 18.536 20
1 01.01.2013 12:10:00 9.539 18.539 20
1 01.01.2013 12:20:00 65.535 35.545 20
1 01.01.2013 12:30:00 65.835 35.545 20
1 01.01.2013 12:40:00 9.541 18.542 20
1 01.01.2013 12:50:00 66.135 35.536 20
1 01.01.2013 13:00:00 9.543 18.545 20
2 05.01.2013 17:00:00 13.535 15.536 30
The idea is to run a loop that takes the position from row 1 if the Vehicle ID = Vehicle Id + 1, and calculates the maximum distance possibly traveled by calculating the time between time and time for next row (time + 1) and multiple this by the maximum speed. Then calculate a maximum and minimum latitude and longitude for where the vehicle can be theoretically be at (time + 1) and if the position is outside these maximum values, the row will be deleted and the loop will run the same statement on next row.
Something like this:
if vehicle ID = vehicle ID[n+1],
then (create latmax and latmin) ( time[n+1] - time ) * maximum speed +- latitude &
then (create lonmax and lonmin) ( time[n+1] - time ) * maximum speed +- longitude
then if lat[n+1] > latmax & lat[n+1] < latmin & lon[n+1] > lonmax & lon[n+1] < lonmax (deleterow) if not, do the same at next line
This should result in deleting row 3, 4 and 6 in my sample. For row 8 there is a new vehicle and a large deviation in position is allowed.
This method does not take account in the fact that the earth is circular and distance between latitudes decrease the closer we get to the north or south pole. The best solution would take this into account by solving for this mathematically in the formula, or using distm or similar to calculate the true distance. Implementing cosinus in the formula is properly the easiest method. However the deviation between the outliers and the true position is normalt so large, that the curvature of the earth does not matter in practice for this data-set.
Ok you have 2 problems here, you have a research problem where you need to define the appropriate distance to compare the spoofing pairs in function of coordinates, but foremost you need to define the spoofing pairs and the id of their previous known non-spoofing coordinates. The first problem is a research problem, and I won't go deep into it, but perhaps an internet search on how to calculate distances based on coordinates would help you. A solution to the second problem, the coding part, is proposed below, assuming you want to choose the minimal distance to the latest known non-spoofing position.
First you can take my same example by running this:
dput(df)
structure(list(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L), structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L), .Label = c("01.01.2013", "05.01.2013"
), class = "factor"), structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 3L), .Label = c("12:00:00", "12:10:00", "12:20:00", "12:30:00",
"12:40:00", "12:50:00", "13:00:00", "17:00:00"), class = "factor"),
c(9.535, 9.635, 65.535, 65.835, 9.935, 66.135, 10.235, 13.535,
40.535), c(18.536, 18.636, 35.536, 35.536, 18.936, 35.536,
19.236, 15.536, 40.545), c(20L, 20L, 20L, 20L, 20L, 20L,
20L, 30L, 20L)), .Names = c("Vehicle ID", "date", "Time",
"Lat", "Long", "Max Speed (kts)"), class = "data.frame", row.names = c(NA,
-9L))
My method is to use a series of apply functions. I am also interested if someone knows a more elegant way of doing it other than explicit loops, which might perhaps do the job in fewer steps, but I tend to avoid those.
spoofingtestdb <- df[,1:3]
df$spoofing <- duplicated(spoofingtestdb)|duplicated(spoofingtestdb, fromLast = T)
df$datetime <- dmy_hms(paste0(df$date,"-", df$Time))
df$candidatespreviousposition <- apply(df, 1, function(x) which(df$`Vehicle ID`== x["Vehicle ID"] & !df$spoofing & (as_datetime(df$datetime) < as_datetime(x["datetime"])) ) )
df$latestpreviousposition <- NA
for(i in 1: nrow(df)){
if(length(df$candidatespreviousposition[[i]]>0)) df$latestpreviousposition[[i]] <- df$candidatespreviousposition[[i]][which.max(df$datetime[df$candidatespreviousposition[[i]]])]
}
df$spoofingkey <- paste0(df$`Vehicle ID`, df$datetime)
df$spoofingid <- ifelse(df$spoofing, apply(df, 1, function(x) which(df$spoofingkey==x["spoofingkey"])), NA)
df$lat1 <- apply(df, 1, function(x) df$Lat[x[["spoofingid"]][1]][which(!is.na(df$Lat[x[["spoofingid"]][1]]))] )
df$long1 <- apply(df, 1, function(x) df$Long[x[["spoofingid"]][1]][which(!is.na(df$Long[x[["spoofingid"]][1]]))] )
df$latinit <- apply(df, 1, function(x) df$Lat[x["latestpreviousposition"]])
df$latinit <- ifelse(df$spoofing, df$Lat[df$latestpreviousposition], NA)
df$longinit <- ifelse(df$spoofing, df$Long[df$latestpreviousposition], NA)
getdistance <- function(latinit, longinit, lat, long) {
distance1 <- abs(lat-latinit)+abs(long-longinit)
}
df$distance <- ifelse(df$spoofing, getdistance(df$latinit, df$longinit, df$Lat, df$Long), NA )
df$spoofingnumber <- apply(df, 1, function(x) paste0(x["spoofingid"], collapse=""))
#apply(df, 1, function(x) which(df$spoofingnumber==x["spoofingnumber"]))
df$ismindistance <- apply(df, 1, function(x) x["distance"] == min(df$distance[which(df$spoofingnumber==x["spoofingnumber"])]))
df$tokeep <- ifelse(is.na(df$ismindistance)|df$ismindistance, T, F)
result <- df[df$tokeep,]
result
Here just using a basic distance calculation function. The result is below, as you can see the second row has been deleted in my example, only the minimum distance was kept with respect to the previous known position.
Vehicle ID date Time Lat Long Max Speed (kts) spoofing datetime candidatespreviousposition
1 1 01.01.2013 12:00:00 9.535 18.536 20 FALSE 2013-01-01 12:00:00
2 1 01.01.2013 12:10:00 9.635 18.636 20 FALSE 2013-01-01 12:10:00 1
4 1 01.01.2013 12:30:00 65.835 35.536 20 FALSE 2013-01-01 12:30:00 1, 2
5 1 01.01.2013 12:40:00 9.935 18.936 20 FALSE 2013-01-01 12:40:00 1, 2, 4
6 1 01.01.2013 12:50:00 66.135 35.536 20 FALSE 2013-01-01 12:50:00 1, 2, 4, 5
7 1 01.01.2013 13:00:00 10.235 19.236 20 FALSE 2013-01-01 13:00:00 1, 2, 4, 5, 6
8 2 05.01.2013 17:00:00 13.535 15.536 30 FALSE 2013-01-05 17:00:00
9 1 01.01.2013 12:20:00 40.535 40.545 20 TRUE 2013-01-01 12:20:00 1, 2
latestpreviousposition spoofingkey spoofingid lat1 long1 latinit longinit distance spoofingnumber ismindistance tokeep
1 NA 12013-01-01 12:00:00 NA NA NA NA NA NA TRUE
2 1 12013-01-01 12:10:00 NA NA NA NA NA NA TRUE
4 2 12013-01-01 12:30:00 NA NA NA NA NA NA TRUE
5 4 12013-01-01 12:40:00 NA NA NA NA NA NA TRUE
6 5 12013-01-01 12:50:00 NA NA NA NA NA NA TRUE
7 6 12013-01-01 13:00:00 NA NA NA NA NA NA TRUE
8 NA 22013-01-05 17:00:00 NA NA NA NA NA NA TRUE
9 2 12013-01-01 12:20:00 3, 9 65.535 35.536 9.635 18.636 52.809 c(3, 9) TRUE TRUE
After you choose which distance function is appropriate for you, you can just replace the getdistance() function above.
I have a data frame of users and access times. Access times can be duplicated.
I am trying to create a list of users grouped and named by a given time interval, e.g. year.
timestamp user
1 2013-03-06 01:00:00 1
2 2014-07-06 21:00:00 1
3 2014-07-31 23:00:00 2
4 2014-08-09 17:00:00 2
5 2014-08-14 20:00:00 2
6 2014-08-14 22:00:00 3
7 2014-08-16 15:00:00 3
8 2014-08-19 02:00:00 1
9 2014-12-28 18:00:00 1
10 2015-01-17 17:00:00 1
11 2015-01-22 22:00:00 2
12 2015-01-22 22:00:00 3
13 2015-03-23 15:00:00 4
14 2015-04-05 18:00:00 1
15 2015-04-06 01:00:00 2
My code example already creates a list of users grouped by year.
My problem is that I need to modify the table in this approach, which becomes a problem with my tables of a million entries.
test <- structure(list(timestamp = c("2013-03-06 01:00:00", "2014-07-06 21:00:00",
"2014-07-31 23:00:00", "2014-08-09 17:00:00", "2014-08-14 20:00:00",
"2014-08-14 22:00:00", "2014-08-16 15:00:00", "2014-08-19 02:00:00",
"2014-12-28 18:00:00", "2015-01-17 17:00:00", "2015-01-22 22:00:00",
"2015-01-22 22:00:00", "2015-03-23 15:00:00", "2015-04-05 18:00:00",
"2015-04-06 01:00:00"), user = c(1L, 1L, 2L, 2L, 2L, 3L, 3L,
1L, 1L, 1L, 2L, 3L, 4L, 1L, 2L)), .Names = c("timestamp", "user"
), class = "data.frame", row.names = c(NA, -15L))
require(lubridate)
#Creating "POSIXct" object from string timestamp
timestamp <- lapply(test$timestamp,
function(x)parse_date_time(x, "y-m-d H:M:S"))
test$timestamp <- do.call(c,timestamp)
print(class(test$timestamp))
#Adding column for year
test <- cbind(test,sapply(timestamp, function(x)year(x)))
colnames(test)[3]<- "year"
#Creating list of year time intervals for users
intervals <- names(table(test$year))
users <- lapply(intervals, function(x)test[test$year %in% x,"user"])
names(users) <- intervals
without timestamps
treat the timestamp as a character. Only works if for every timestap, the first 4 digits represent the year.
library(dplyr)
test %>%
group_by( user, substr(timestamp,1,4 ) ) %>%
summarise( )
# user `substr(timestamp, 1, 4)`
# <int> <chr>
# 1 1 2013
# 2 1 2014
# 3 1 2015
# 4 2 2014
# 5 2 2015
# 6 3 2014
# 7 3 2015
# 8 4 2015
dplyr + lubridate
will extract the year from the timestamp
library( dplyr )
library( lubridate )
test %>%
mutate( timestamp = as.POSIXct( timestamp, format = "%Y-%m-%d %H:%M:%S" ) ) %>%
group_by( user, lubridate::year( timestamp ) ) %>%
summarise( )
# # Groups: user [?]
# user `year(timestamp)`
# <int> <dbl>
# 1 1 2013
# 2 1 2014
# 3 1 2015
# 4 2 2014
# 5 2 2015
# 6 3 2014
# 7 3 2015
# 8 4 2015
table
a frequency table is also quickly made
table( test$user, substr( test$timestamp, 1, 4 ) )
# 2013 2014 2015
# 1 1 3 2
# 2 0 3 2
# 3 0 2 1
# 4 0 0 1
there are any more alternatives... pick one
edit
if speed is an issue, ty data.table
dcast(
setDT( test )[, timestamp := as.POSIXct( timestamp, format = "%Y-%m-%d %H:%M:%S" )][, .N, by = list( user, data.table::year(timestamp) )],
user ~ data.table,
value.var = "N")
# user 2013 2014 2015
# 1: 1 1 3 2
# 2: 2 NA 3 2
# 3: 3 NA 2 1
# 4: 4 NA NA 1
Another option using the lightning fast data.table package:
library(data.table)
setDT(test) # make `test` a data.frame 'by reference' (no copy is made at all)
test[, j=.(users=list(unique(user))),
by=.(year=substr(test$timestamp,1,4))]
year users
1: 2013 1
2: 2014 1,2,3
3: 2015 1,2,3,4
Again assuming your test$timestamp column is a character vector - otherwise substitute lubridate::year() as needed.
Update:
Simple change to show grouping instead by month (just as it was mentioned in a comment):
test[, j=.(users=list(unique(user))),
by=.(ym=substr(test$timestamp,1,7))]
ym users
1: 2013-03 1
2: 2014-07 1,2
3: 2014-08 2,3,1
4: 2014-12 1
5: 2015-01 1,2,3
6: 2015-03 4
7: 2015-04 1,2
Or group by day, to help demonstrate how to subset with chaining:
test[, j=.(users=list(unique(user))),
by=.(ymd=substr(test$timestamp,1,11))][ymd>='2014-08-01' & ymd<= '2014-08-21']
ymd users
1: 2014-08-09 2
2: 2014-08-14 2,3
3: 2014-08-16 3
4: 2014-08-19 1
Note for filtering/subsetting, if you are only interested in a subset of dates for a "one off" calculation (and not otherwise saving the whole aggregated set to be stored for other purposes) it will likely be more efficient to do the subset in i of DT[i, j, by] for the "one off" calculation.
You could also use base (stats) function aggregate() as follows:
aggregate( x = test$user,
by = list(year=substr(test$timestamp,1,4)),
FUN = unique )
Result:
year x
1 2013 1
2 2014 1, 2, 3
3 2015 1, 2, 3, 4
Above working on assumption that your timestamp column is initially just a character vector exactly as included in your structured example data. In which case you may directly substr out the year with substr(test$timestamp,1,4) avoiding the need to first convert to dates.
However, if you have the timestamp column already as a date, simply substitute the lubridate::year() function you demonstrated in your attempted solution.
I'm trying to calculate how long one person stays in a homeless shelter using R. The homeless shelter has two different types of check-ins, one for overnight and another for a long-term. I would like to shape the data to get an EntryDate and ExitDate for every stay which does not have at least a one day break.
Here are what the data currently look like:
PersonalID EntryDate ExitDate
1 2016-12-01 2016-12-02
1 2016-12-03 2016-12-04
1 2016-12-16 2016-12-17
1 2016-12-17 2016-12-18
1 2016-12-18 2016-12-19
2 2016-10-01 2016-10-20
2 2016-10-21 2016-10-22
3 2016-09-01 2016-09-02
3 2016-09-20 2016-09-21
Ultimately, I'm trying to get the above date to represent continuous ranges to calculate total length of stay by participant.
For example, the above data would become:
PersonalID EntryDate ExitDate
1 2016-12-01 2016-12-04
1 2016-12-16 2016-12-19
2 2016-10-01 2016-10-22
3 2016-09-01 2016-09-02
3 2016-09-20 2016-09-21
Here is an ugly solution. It is probably possible to do something more clean... But it works. This solution should alaso be debugged with real data (I have added one line to your exaple to have more different situations)
d <- read.table(text = '
PersonalID EntryDate ExitDate
1 2016-12-01 2016-12-02
1 2016-12-03 2016-12-04
1 2016-12-16 2016-12-17
1 2016-12-17 2016-12-18
1 2016-12-18 2016-12-19
2 2016-10-01 2016-10-20
2 2016-10-21 2016-10-22
3 2016-09-01 2016-09-02
3 2016-09-20 2016-09-21
4 2016-09-20 2016-09-21
', header = TRUE)
#' transorm in Date format
d$EntryDate <- as.Date(as.character(d$EntryDate))
d$ExitDate <- as.Date(as.character(d$ExitDate))
summary(d)
#' Reorder to be sure that the ExitDate / Entry date are in chronological order
d <- d[order(d$PersonalID, d$EntryDate),]
#' Add a column that will store the number of days between one exit and the next entry
d$nbdays <- 9999
# Split to have a list with dataframe for each ID
d <- split(d, d$PersonalID)
d
for(i in 1:length(d)) {
# Compute number of days between one exit and the next entry (only if there are
# more than one entry)
if(nrow(d[[i]])>1) {
d[[i]][-1,"nbdays"] <- d[[i]][2:nrow(d[[i]]),"EntryDate"] -
d[[i]][1:(nrow(d[[i]])-1),"ExitDate"]
}
x <- d[[i]] # store a copy of the data to lighten the syntax
# Entry dates for which the previous exit is higher than 1 day (including the first one)
entr <- x[x$nbdays>1,"EntryDate"]
# Exit dates just before cases where nbdays are > 1 and includes the last exit date.
# We use unique to avoid picking 2 times the last exit
whichexist <- unique(c(c(which(x$nbdays > 1)-1)[-1],nrow(x)))
exit <- x[whichexist,"ExitDate"]
d[[i]] <- data.frame(
PersonalID = x[1,1],
EntryDate = entr,
ExitDate = exit
)
}
# paste the elements of this list into one data.frame
do.call(rbind, d)
Here a solution using dplyr.
library(dplyr)
d = structure(list(PersonalID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 3L,
3L), EntryDate = structure(c(17136, 17138, 17151, 17152, 17153,
17075, 17095, 17045, 17064), class = "Date"), ExitDate = structure(c(17137,
17139, 17152, 17153, 17154, 17094, 17096, 17046, 17065), class = "Date")), class = "data.frame", .Names = c("PersonalID",
"EntryDate", "ExitDate"), row.names = c(NA, -9L))
First create a temporary dataframe to hold all the dates between entry and exit date:
d2 = d %>%
rowwise() %>%
do(data.frame(PersonalID = .$PersonalID, Present = seq(.$EntryDate, .$ExitDate, by = 'day'))) %>%
unique %>% ## remove double dates when exit and re-entry occur on the same day
ungroup()
Then look for all the consecutive dates with some inpiration from https://stackoverflow.com/a/14868742/827766
d2 %>%
group_by(PersonalID) %>%
mutate(delta = c(1, diff(as.Date(Present)))) %>%
group_by(PersonalID, stay = cumsum(delta!=1)) %>%
summarize(EntryDate = min(Present), ExitDate = max(Present)) %>%
subset(select = -c(stay))
This question already has an answer here:
Aggregate by week in R
(1 answer)
Closed 7 years ago.
I have Date (column B) and Total (column A) variables - how can I create a new variable in R that sums the previous seven days' worth of Totals?
In Excel, I have the following formula:
=SUMIFS($A:$A,$B:$B, ">="&$B20-7,$B:$B,"<"&$B20)
and I just don't know how to convert this to work in R. Suggestions?
This will do it too, advanced, but short - essentially a one-liner.
# Initialze some data
date <- seq(as.Date("2001-01-01"),as.Date("2001-01-31"),"days")
tot <- trunc(rnorm(31,100,20))
df <- data.frame(date,tot)
# Now compute week sum by summing a subsetted df for each date
df$wktot <- sapply(df$date,function(x)sum(df[difftime(df$date,x,,"days") %in% 0:-6,]$tot))
Changed the variable names to match the posed problem.
It also handles the data in any order and multiple entries per day.
Edited to add comments and make it fit in a window.
If there is one total per day, this function may help:
rollSums <- function(totals, roll) {
res <- c()
for(i in 1:(length(totals)-roll)) {
res <- c(res, sum(totals[0:(roll-1)+i]))
}
res
}
df1
Total Date
1 3 2015-01-01
2 8 2015-01-01
3 4 2015-01-02
4 7 2015-01-03
5 6 2015-01-04
6 1 2015-01-04
7 10 2015-01-05
8 9 2015-01-06
9 2 2015-01-07
10 5 2015-01-08
rollSums(df1$Total, 3)
[1] 15 19 17 14 17 20 21
rollSums(df1$Total, 4)
[1] 22 25 18 24 26 22
It will take two arguments, the vector with the totals and how many days you'd like in each sum.
Data
dput(df1)
structure(list(Total = c(3L, 8L, 4L, 7L, 6L, 1L, 10L, 9L, 2L,
5L), Date = structure(c(16436, 16436, 16437, 16438, 16439, 16439,
16440, 16441, 16442, 16443), class = "Date")), .Names = c("Total",
"Date"), row.names = c(NA, -10L), class = "data.frame")
Update
In case you run into a situation with multiple values on the same day, here's a solution. Surprisingly, #MikeWise has a one-liner that can do all of this. See other answer.
grouped.roll <- function(DF, Values, Group, roll) {
totals <- eval(substitute(with(DF, tapply(Values, Group, sum))))
newsums <- rollSums(totals, roll)
data.frame(Group=names(totals), Sums=c(rep(NA, roll), newsums))
}
It uses the rollSums that I used earlier. It will spit out NAs until the desired day grouping begins. That may be the only advantage over the other answer. But they could easily edit that in, I'm sure. Just providing more options for reference.
grouped.roll(df1, Total, Date, 3)
Group Sums
1 2015-01-01 NA
2 2015-01-02 NA
3 2015-01-03 NA
4 2015-01-04 22
5 2015-01-05 18
6 2015-01-06 24
7 2015-01-07 26
8 2015-01-08 21