How to delete consecutive occurrences of an event in a R dataframe? - r

I have a R dataframe containing date info about generic events:
id;start_date;end_date.
Sometimes the same event may occur the same day (1) or at a distance of one day (2), for example:
(1)
1001;2016-05-07;2016-05-11
1001;2016-05-11;2016-05-14
(2)
1001;2016-05-07;2016-05-11
1001;2016-05-12;2016-05-14
In the first case the event "1001" ends and restarts the same day, while in the second case that event ends on 2017-05-11 and starts again the day after. I'd like to delete the second occurrence of the event in both cases.
If the second occurrence is at a distance of two or more days, it's ok to preserve the second occurrence. How can I do this in R?
Thank you in advance.

Partial solution with my guess of how data look like:
library(data.table)
dat <- data.table(id = c(1001,1001,1001,1001),
start_date = as.Date(c("2016-05-07", "2016-05-11", "2016-05-07", "2016-05-12")),
end_date = as.Date(c("2016-05-11", "2016-05-14", "2016-05-11", "2016-05-14")))
dat2 <- data.table(id = c(dat$id, NA),
start_date = c(dat$start_date, NA),
end_date = c(as.Date(NA), dat$end_date))
dat2[, dif := end_date - start_date]
Then you can just remove rows with dif <= 0 I guess.
I've used the data.table package, but you can just do dat2$dif <- dat2$end_date - dat2$start_date.

Related

If statements to fulfill conditions relating to dates

I'm trying to manipulate the data with these dates.
Condition: if start time and end time overlaps with previous date from 21:00 to 9:00 next day, then the correct date would be the previous date.
For example here: Obs 3: the start time 12/5/2020 9:27 and end time 12/5/2020 12:13 overlaps with 12/4/2020 21:00 to 12/5/2020 10:00. So the correct date should be 12/04/2020.
With this logic, the next step is to combine hrs_slept for example Obs 6 and 7 because they have the same dates, which would become:
Condition: for each ID, if there are same dates, hrs_slept need to be combined:
original:
start_time <- c("2020-12-02 23:01:00","2020-12-04 04:17:00","2020-12-05 09:27:00","2020-12-06 06:38:00","2020-12-06 11:42:00","2020-12-08 01:22:00","2020-12-08 05:33:00", "2020-12-08 16:59:00","2020-12-09 03:15:00")
end_time <- c("2020-12-03 01:03:00","2020-12-04 09:42:00","2020-12-05 12:13:00","2020-12-06 09:16:00","2020-12-06 14:50:00","2020-12-08 04:24:00","2020-12-08 06:37:00","2020-12-08 18:05:00","2020-12-09 06:21:00")
hrs_sleep <- c(2.03,5.42,2.77,2.63,3.13,3.03,1.07,1.10,3.10)
a <- data.frame(start_time,end_time,hrs_sleep)%>%
mutate(ID=104)
a$start <- as.POSIXct(a$start)
a$end <- as.POSIXct(a$end)
Desired:
start_date <- c("2020-12-02","2020-12-03","2020-12-04","2020-12-05","2020-12-06","2020-12-07","2020-12-08")
hrs_sleep_new <- c(2.03,5.42,2.77,2.63,3.13,4.10,4.20)
b <- data.frame(start_date,hrs_sleep_new)%>%
mutate(ID=104)
b$start_date <- as.POSIXct(b$start_date)
I hope I made the logic clear. Essentially, I just need some sort of if statements for the two conditions written in bold. And I have a lot more observations with different IDs so these conditions need to be for each ID.
I appreciate all the help there is! Thanks!!!
Sorry, but I'm not sure I understand the conditions. However, try this code:
a <- a %>% mutate(start_date = as.Date(ifelse(hour(start) < 10, date(start)-1,date(start)), origin = "1970-01-01")) # Create the dates
a %>% group_by(ID, start_date) %>% summarise(hrs_sleep_new = sum(hrs_sleep)) # Sum the sleep hours

R How to Split given Time Periods in interval of 30 days in R

I have data with Order Id, Start Date & End Date. I have to split both the Start and End dates into intervals of 30 days, and derive two new variables “split start date” and “split end date”.
Example: The below example illustrates how split dates are created when the Start Date is “01/05/2017” and the End Date is “06/07/2017”
Suppose, an order have start and end dates as below
see the image for example
What is the code for this problem in R ?
Here is a solution which should generalize to multiple order id's. I have created a sample data with two order id's. The basic idea is to calculate the number of intervals between start_date and end_date. Then we repeat the row for each order id by the number of intervals, and also create a sequence to determine which interval we are in. This is the purpose of creating functions f and g and the use of Map.
The remaining is just vector manipulations where we define split_start_date and split_end_date. The last statement is to ensure that split_end_date does not exceed end_date.
df <- data.frame(
order_id = c(1, 2),
start_date = c(as.Date("2017-05-01"), as.Date("2017-08-01")),
end_date = c(as.Date("2017-07-06"), as.Date("2017-09-15"))
)
df$diff_days <- as.integer(df$end_date - df$start_date)
df$num_int <- ceiling(df$diff_days / 30)
f <- function(rowindex) {
rep(rowindex, each = df[rowindex, "num_int"])
}
g <- function(rowindex) {
1:df[rowindex, "num_int"]
}
rowindex_rep <- unlist(Map(f, 1:nrow(df)))
df2 <- df[rowindex_rep, ]
df2$seq <- unlist(Map(g, 1:nrow(df)))
df3 <- df2
df3$split_start_date <- df3$start_date + (df3$seq - 1) * 30
df3$split_end_date <- df3$split_start_date + 29
df3[which(df3$seq == df3$num_int), ]$split_end_date <-
df3[which(df3$seq == df3$num_int), ]$end_date

R how to avoid a loop. Counting weekends between two dates in a row for each row in a dataframe

I have two columns of dates. Two example dates are:
Date1= "2015-07-17"
Date2="2015-07-25"
I am trying to count the number of Saturdays and Sundays between the two dates each of which are in their own column (5 & 7 in this example code). I need to repeat this process for each row of my dataframe. The end results will be one column that represents the number of Saturdays and Sundays within the date range defined by two date columns.
I can get the code to work for one row:
sum(weekdays(seq(Date1[1,5],Date2[1,7],"days")) %in% c("Saturday",'Sunday')*1))
The answer to this will be 3. But, if I take out the "1" in the row position of date1 and date2 I get this error:
Error in seq.Date(Date1[, 5], Date2[, 7], "days") :
'from' must be of length 1
How do I go line by line and have one vector that lists the number of Saturdays and Sundays between the two dates in column 5 and 7 without using a loop? Another issue is that I have 2 million rows and am looking for something with a little more speed than a loop.
Thank you!!
map2* functions from the purrr package will be a good way to go. They take two vector inputs (eg two date columns) and apply a function in parallel. They're pretty fast too (eg previous post)!
Here's an example. Note that the _int requests an integer vector back.
library(purrr)
# Example data
d <- data.frame(
Date1 = as.Date(c("2015-07-17", "2015-07-28", "2015-08-15")),
Date2 = as.Date(c("2015-07-25", "2015-08-14", "2015-08-20"))
)
# Wrapper function to compute number of weekend days between dates
n_weekend_days <- function(date_1, date_2) {
sum(weekdays(seq(date_1, date_2, "days")) %in% c("Saturday",'Sunday'))
}
# Iterate row wise
map2_int(d$Date1, d$Date2, n_weekend_days)
#> [1] 3 4 2
If you want to add the results back to your original data frame, mutate() from the dplyr package can help:
library(dplyr)
d <- mutate(d, end_days = map2_int(Date1, Date2, n_weekend_days))
d
#> Date1 Date2 end_days
#> 1 2015-07-17 2015-07-25 3
#> 2 2015-07-28 2015-08-14 4
#> 3 2015-08-15 2015-08-20 2
Here is a solution that uses dplyr to clean things up. It's not too difficult to use with to assign the columns in the dataframe directly.
Essentially, use a reference date, calculate the number of full weeks (by floor or ceiling). Then take the difference between the two. The code does not include cases in which the start date or end data fall on Saturday or Sunday.
# weekdays(as.Date(0,"1970-01-01")) -> "Friday"
require(dplyr)
startDate = as.Date(0,"1970-01-01") # this is a friday
df <- data.frame(start = "2015-07-17", end = "2015-07-25")
df$start <- as.Date(df$start,"", format = "%Y-%m-%d", origin="1970-01-01")
df$end <- as.Date(df$end, format = "%Y-%m-%d","1970-01-01")
# you can use with to define the columns directly instead of %>%
df <- df %>%
mutate(originDate = startDate) %>%
mutate(startDayDiff = as.numeric(start-originDate), endDayDiff = as.numeric(end-originDate)) %>%
mutate(startWeekDiff = floor(startDayDiff/7),endWeekDiff = floor(endDayDiff/7)) %>%
mutate(NumSatsStart = startWeekDiff + ifelse(startDayDiff %% 7>=1,1,0),
NumSunsStart = startWeekDiff + ifelse(startDayDiff %% 7>=2,1,0),
NumSatsEnd = endWeekDiff + ifelse(endDayDiff %% 7 >= 1,1,0),
NumSunsEnd = endWeekDiff + ifelse(endDayDiff %% 7 >= 2,1,0)
) %>%
mutate(NumSats = NumSatsEnd - NumSatsStart, NumSuns = NumSunsEnd - NumSunsStart)
Dates are number of days since 1970-01-01, a Thursday.
So the following is the number of Saturdays or Sundays since that date
f <- function(d) {d <- as.numeric(d); r <- d %% 7; 2*(d %/% 7) + (r>=2) + (r>=3)}
For the number of Saturdays or Sundays between two dates, just subtract, after decrementing the start date to have an inclusive count.
g <- function(d1, d2) f(d2) - f(d1-1)
These are all vectorized functions so you can just call directly on the columns.
# Example data, as in Simon Jackson's answer
d <- data.frame(
Date1 = as.Date(c("2015-07-17", "2015-07-28", "2015-08-15")),
Date2 = as.Date(c("2015-07-25", "2015-08-14", "2015-08-20"))
)
As follows
within(d, end_days<-g(Date1,Date2))
# Date1 Date2 end_days
# 1 2015-07-17 2015-07-25 3
# 2 2015-07-28 2015-08-14 4
# 3 2015-08-15 2015-08-20 2

Number of overlapping intervals over time

Let's say I have a set of, partly overlapping, intervals
require(lubridate)
date1 <- as.POSIXct("2000-03-08 01:59:59")
date2 <- as.POSIXct("2001-02-29 12:00:00")
date3 <- as.POSIXct("1999-03-08 01:59:59")
date4 <- as.POSIXct("2002-02-29 12:00:00")
date5 <- as.POSIXct("2000-03-08 01:59:59")
date6 <- as.POSIXct("2004-02-29 12:00:00")
int1 <- new_interval(date1, date2)
int2 <- new_interval(date3, date4)
int3 <- new_interval(date5, date6)
Does anyone have an idea how one could construct a time series plot that provides, for every point in time, the number of overlapping intervals at that point?
So, for instance, to take the above example: For a given date in January 2000, the function I'm looking for would return the value "1" (the date is only within int2) while for a date in January 2001, it would return "3" (since that date is within int1, int2 and int3). Etc.
Any ideas?
Here's one way using foverlaps() function using data.table package:
Please install the development version 1.9.5 by following the installation instructions as a bug that affects overlap joins on numeric types has been fixed there.
require(data.table) ## 1.9.5+
intervals = data.table(start = c(date1, date3, date5),
end = c(date2, date4, date6))
# assuming your query is:
query = as.POSIXct(c("2000-01-01 00:00:00", "2001-01-01 00:00:00"))
We'll construct the query data.table with both start and end intervals as well:
querydt = data.table(start=query, end=query) # identical start,end
Then we can use foverlaps() as follows:
setkeyv(intervals, c("start", "end"))
ans = foverlaps(querydt, intervals, which=TRUE, nomatch=0L, type="within")
# xid yid
# 1: 1 1
# 2: 2 1
# 3: 2 2
# 4: 2 3
We first set key - which sorts the data.table intervals by the columns provided, in increasing order, and marks those columns as the key columns on which we want to perform the overlap join.
Then we use foverlaps() to find which intervals in querydt overlaps (falls type=within) with intervals. In this case, querydt consists of just points as start and end points are identical. This returns all matching indices (nomatch=0L removes all rows with no matches and which=TRUE returns indices instead of merged result) for those rows in querydt that falls within intervals.
Now all we have to do is to aggregate by xid and count the number of observations to get the count:
ans[, .N, by=xid]
# xid N
# 1: 1 1
# 2: 2 3
Check ?foverlaps for more info.

Using data in one data.frame to generate values for a new column in another data.frame in R

I have two dataframes, one which contains a timestamp and air_temperature
air_temp time_stamp
85.1 1396335600
85.4 1396335860
And another, which contains startTime, endTime, location coordinates, and a canonical name.
startTime endTime location.lat location.lon name
1396334278 1396374621 37.77638 -122.4176 Work
1396375256 1396376369 37.78391 -122.4054 Work
For each row in the first data frame, I want to identify which time range in the second data frame it lies in, i.e if the timestamp 1396335600, is between the startTime 1396334278, and endTime 1396374621, add the location and name value to the row in the first data.frame.
The start and end time in the second data frame don't overlap, and are linearly increasing. However they are not perfectly continuous, so if the timestamp falls between two time bands, I need to mark the location as NA. If it does fit between the start and end times, I want to add the location.lat, location.lon, and name columns to the first data frame.
Appreciate your help.
Try this. Not tested.
newdata <- data2[data1$timestamp>=data2$startTime & data1$timestamp<=data2$endTime ,3:5]
data1 <- cbind(data1[data1$timestamp>=data2$startTime & data1$timestamp<=data2$endTime,],newdata)
This won't return any values if timestamp isn't between startTime and endTime, so in theory your returned dataset could be shorter than the original. Just in case I treated data1 with the same TRUE FALSE vector as data2 so they will be the same length.
Interesting problem... Turned out to be more complicated than I originally thought!!
Step1: Set up the data!
DF1 <- read.table(text="air_temp time_stamp
85.1 1396335600
85.4 1396335860",header=TRUE)
DF2 <- read.table(text="startTime endTime location.lat location.lon name
1396334278 1396374621 37.77638 -122.4176 Work
1396375256 1396376369 37.78391 -122.4054 Work",header=TRUE)
Step2: For each time_stamp in DF1 compute appropriate index in DF2:
index <- sapply(DF1$time_stamp,
function(i) {
dec <- which(i >= DF2$startTime & i <= DF2$endTime)
ifelse(length(dec) == 0, NA, dec)
}
)
index
Step3: Merge the two data frames:
DF1 <- cbind(DF1,DF2[index,3:5])
row.names(DF1) <- 1:nrow(DF1)
DF1
Hope this helps!!
rowidx <- sapply(dfrm1$time_stamp, function(x) which( dfrm2$startTime <= x & dfrm2$endTime >= x)
cbind(dfrm1$time_stamp. dfrm2[ rwoidx, c("location.lat","location.lon","name")]
Mine's not test either and looks substantially similar to CCurtis, so give him the check if it works.

Resources