Flag strange observations (rows) within lubridate::interval class object - r

Referring to my previous question here:
Flag rows with interval overlap in r
I have got a dataframe with some location information (1 = location A , 4 = location B)
:
df <- data.frame(stringsAsFactors=FALSE,
date = c("2018-09-02", "2018-09-02", "2018-09-02", "2018-09-02",
"2018-09-02", "2018-09-02", "2018-09-02", "2018-09-02",
"2018-09-02"),
ID = c("18101276-aa", "18101276-aa", "18102843-aa", "18102843-aa", "18102843-ab",
"18102843-aa", "18104148-aa", "18104148-ab", "18104148-ab"),
location = c(1L, 1L, 1L, 4L, 4L, 1L, 1L, 1L, 4L),
Start = c(111300L, 143400L, 030000L, 034900L, 064400L, 070500L, 060400L,
075100L, 081600L),
End = c(111459L, 143759L, 033059L, 035359L, 064759L, 070559L, 060459L,
81559L, 83559L),
start_hour_minute = c(1113L, 1434L, 0300L, 0349L, 0644L, 0705L, 0604L, 0751L, 0816L),
end_hour_minute = c(1114L, 1437L, 0330L, 0353L, 0647L, 0705L, 0604L, 0815L, 0835L))
Here, we have some observations (row 8 and 9) that an individual jump between two locations in a minute (it is not possible!). I was wondering, how can I flag these strange location shifts within my interval?
I am using lubridate::interval() as recommended to make an interval class object:
data_out <- df %>%
# Get the hour, minute, and second values as standalone numerics.
mutate(
date = ymd(date),
Start_Hour = floor(Start / 10000),
Start_Minute = floor((Start - Start_Hour*10000) / 100),
Start_Second = (Start - Start_Hour*10000) - Start_Minute*100,
End_Hour = floor(End / 10000),
End_Minute = floor((End - End_Hour*10000) / 100),
End_Second = (End - End_Hour*10000) - End_Minute*100,
# Use the hour, minute, second values to create a start-end timestamp.
Start_TS = ymd_hms(date + hours(Start_Hour) + minutes(Start_Minute) + seconds(Start_Second)),
End_TS = ymd_hms(date + hours(End_Hour) + minutes(End_Minute) + seconds(End_Second)),
# Create an interval object.
Watch_Interval = interval(start = Start_TS, end = End_TS))

Here's a similar approach.
First, I add padding to the two "...minute" variables so that they are unambiguous (e.g. 0349L in the sample data reads in as an integer 349. This step pads it to become text "0349"). Then I use those, in combination with the date, to get start and end times using lubridate:ymd_hm. (I presume there are no intervals that span midnight; if so, you'd typically see a negative interval of time between the start and end. You could add a step to catch this and increment the end_time to be the next day.)
Then I sort by ID and start time, and group by ID. This limits the subsequent steps so they only calculate time_elapsed and suspicious within records for a single individual at a time. In this case a record is flagged as suspicious if the location has changed from the prior record, but less than 10 minutes have passed.
library(lubridate); library(dplyr); library(stringr)
df2 <- df %>%
# Add lead padding zero to variables containing "minute"
mutate_at(vars(contains("minute")), funs(str_pad(., width = 4, pad = "0"))) %>%
# convert to time stamps
mutate(start_time = ymd_hm(paste(date, start_hour_minute)),
end_time = ymd_hm(paste(date, end_hour_minute))) %>%
# Sort and look separated at each individual
arrange(ID, start_time) %>%
group_by(ID) %>%
# Did location change while too little time passed?
mutate(time_elapsed = (start_time - lag(end_time)) / dminutes(1),
suspicious = (location != lag(location) & time_elapsed < 10)) %>%
ungroup()
> df2 %>% select(date, ID, location, start_time:suspicious)
# A tibble: 9 x 7
date ID location start_time end_time time_elapsed suspicious
<chr> <chr> <int> <dttm> <dttm> <dbl> <lgl>
1 2018-09-02 181012… 1 2018-09-02 11:13:00 2018-09-02 11:14:00 NA NA
2 2018-09-02 181012… 1 2018-09-02 14:34:00 2018-09-02 14:37:00 200 FALSE
3 2018-09-02 181028… 1 2018-09-02 03:00:00 2018-09-02 03:30:00 NA NA
4 2018-09-02 181028… 4 2018-09-02 03:49:00 2018-09-02 03:53:00 19 FALSE
5 2018-09-02 181028… 1 2018-09-02 07:05:00 2018-09-02 07:05:00 192 FALSE
6 2018-09-02 181028… 4 2018-09-02 06:44:00 2018-09-02 06:47:00 NA NA
7 2018-09-02 181041… 1 2018-09-02 06:04:00 2018-09-02 06:04:00 NA NA
8 2018-09-02 181041… 1 2018-09-02 07:51:00 2018-09-02 08:15:00 NA NA
9 2018-09-02 181041… 4 2018-09-02 08:16:00 2018-09-02 08:35:00 1 TRUE

I don't know if I got it right, but the code below will flag the jump in location + time difference less than or smaller than 1 minute. It will flag row 9 in your example data. If you want to tag both rows 8 and 9, you can make a new column containing the next location (using dplyr::lead(location)) and playing with the condition inside FLAG.
data_out <- df %>%
# Get the hour, minute, and second values as standalone numerics.
mutate(
date = ymd(date),
Start_Hour = floor(Start / 10000),
Start_Minute = floor((Start - Start_Hour*10000) / 100),
Start_Second = (Start - Start_Hour*10000) - Start_Minute*100,
End_Hour = floor(End / 10000),
End_Minute = floor((End - End_Hour*10000) / 100),
End_Second = (End - End_Hour*10000) - End_Minute*100,
# Use the hour, minute, second values to create a start-end timestamp.
Start_TS = ymd_hms(date + hours(Start_Hour) + minutes(Start_Minute) + seconds(Start_Second)),
End_TS = ymd_hms(date + hours(End_Hour) + minutes(End_Minute) + seconds(End_Second)),
Previous_End = lag(End_TS),
Previous_Loc = lag(location),
Timediff = lubridate::minutes(Start_TS - Previous_End),
FLAG = ifelse(!(location == Previous_Loc)&(Timediff <= minutes(1)), 1, 0)
)
EDIT
The snippet below won't flag cases where IDs change from one row to the next
data_out <- df %>%
# Get the hour, minute, and second values as standalone numerics.
mutate(
date = ymd(date),
Start_Hour = floor(Start / 10000),
Start_Minute = floor((Start - Start_Hour*10000) / 100),
Start_Second = (Start - Start_Hour*10000) - Start_Minute*100,
End_Hour = floor(End / 10000),
End_Minute = floor((End - End_Hour*10000) / 100),
End_Second = (End - End_Hour*10000) - End_Minute*100,
# Use the hour, minute, second values to create a start-end timestamp.
Start_TS = ymd_hms(date + hours(Start_Hour) + minutes(Start_Minute) + seconds(Start_Second)),
End_TS = ymd_hms(date + hours(End_Hour) + minutes(End_Minute) + seconds(End_Second)),
Previous_ID = lag(ID),
Previous_End = lag(End_TS),
Previous_Loc = lag(location),
Timediff = lubridate::minutes(Start_TS - Previous_End),
FLAG = ifelse(
!((location == Previous_Loc)&!(ID == Previous_ID))&(Timediff <= minutes(1)), 1, 0)
)

Related

How can you graph multiple overlapping 18 month periods with daily data?

I am doing an exploratory data analysis for data that is collected at the daily level over many years. The relevant time period is about 18 - 20 months from the same date each year. What I would like to do is visually inspect these 18 month periods one on top of the other. I can do this as below by adding data for each geom_point() call. I would like to avoid calling that one time for each period
min ex:
library(tidyverse)
minex <- data.frame(dts = seq((mdy('01/01/2010')), mdy('11/10/2013'), by = 'days'))
minex$day <- as.numeric(minex$dts - min(minex$dts))
minex$MMDD <- paste0(month(minex$dts), "-", day(minex$dts))
minex$v1 <- 20 + minex$day^0.4 -cos(2*pi*minex$day/365) + rnorm(nrow(minex), 0, 0.3)
ggplot(filter(minex, dts %in% seq((mdy('11/10/2013') - (365 + 180)), mdy('11/10/2013'), by =
'days')), aes(day, v1)) +
geom_point() +
geom_point(data = filter(minex, dts %in% seq((mdy('11/10/2012') - (365 + 180)),
mdy('11/10/2012'), by = 'days')), aes(day+365, v1), color = 'red')
Since you have overlapping spans of time, I think we can lapply over your end dates, mutate the data a little, then use normal ggplot2 aesthetics to color them.
spans <- bind_rows(lapply(mdy("11/10/2010", "11/10/2011", "11/10/2012", "11/10/2013"), function(end) {
filter(minex, between(dts, end - (365 + 180), end)) %>%
mutate(day = day - min(day), end = end)
}))
ggplot(spans, aes(day, v1)) +
geom_point(aes(color = factor(end)))
You can see the range of each with a quick summary:
spans %>%
group_by(end) %>%
summarize(startdate = min(dts), enddate = max(dts))
# # A tibble: 4 x 3
# end startdate enddate
# <date> <date> <date>
# 1 2010-11-10 2010-01-01 2010-11-10
# 2 2011-11-10 2010-05-14 2011-11-10
# 3 2012-11-10 2011-05-15 2012-11-10
# 4 2013-11-10 2012-05-14 2013-11-10

Comparing intervals across multiple time series in R

I have two concurrent time series A and B, both containing events defined by start and end times - here is a sample:
A.df <- structure(list(A.eventid = 1:53,
A.start = structure(c(1563219814.52, 1563219852.37, 1563220313.16, 1563220472.66, 1563220704.35, 1563220879.51, 1563221108.24, 1563221158.33, 1563221387.43, 1563221400.7, 1563221602.34, 1563221828.33, 1563222165.52, 1563222314.2, 1563222557.28, 1563222669.44, 1563222905.52, 1563223091.62, 1563223237.19, 1563223273.64, 1563223580.14, 1563223908.66, 1563224093.27, 1563224497.41, 1563224554.64, 1563224705.57, 1563225011.55, 1563225192.59, 1563225305.14, 1563225414.38, 1563225432.21, 1563225898.61, 1563226034.51, 1563226110.18, 1563226206.49, 1563226528.13, 1563226570.18, 1563226788.53, 1563227026.21, 1563227502.2, 1563227709.3, 1563227832.51, 1563228127.44, 1563228188.4, 1563228293.59, 1563228558.39, 1563228680.32, 1563228819.44, 1563229208.51, 1563229282.14, 1563229528.52, 1563229959.21, 1563230268.65), class = c("POSIXct", "POSIXt")),
A.end = structure(c(1563219846.43, 1563220304.39, 1563220470.68, 1563220702.37, 1563220877.5, 1563221102.18, 1563221151.47, 1563221379.63, 1563221389.22, 1563221600.32, 1563221819.27, 1563222157.29, 1563222312.23, 1563222555.25, 1563222667.42, 1563222894.56, 1563223079.44, 1563223230.39, 1563223273.24, 1563223578.14, 1563223900.48, 1563224089.24, 1563224493.45, 1563224550.37, 1563224699.47, 1563225005.13, 1563225188.17, 1563225293.21, 1563225412.17, 1563225417.46, 1563225894.44, 1563226025.2, 1563226108.13, 1563226204.37, 1563226517.59, 1563226562.41, 1563226780.59, 1563227022.28, 1563227493.57, 1563227705.52, 1563227830.38, 1563228125.49, 1563228184.21, 1563228286.39, 1563228546.47, 1563228677.67, 1563228816.5, 1563229198.68, 1563229273.54, 1563229526.53, 1563229952.57, 1563230257.16, 1563230742.25), class = c("POSIXct", "POSIXt"))),
row.names = 1:53, class = "data.frame")
B.df <- structure(list(B.eventid = 1:52,
B.start = structure(c(1563221811.888, 1563222153.835, 1563222156.013, 1563222220.14, 1563222289.692, 1563222305.607, 1563222611.565, 1563222631.139, 1563222636.867, 1563222763.565, 1563222774.301, 1563222848.507, 1563222849.957, 1563222853.513, 1563223225.656, 1563223302.539, 1563223326.153, 1563223328.934, 1563223590.144, 1563223592.904, 1563224035.038, 1563224692.704, 1563226451.642, 1563226454.731, 1563226819.701, 1563226824.685, 1563227278.677, 1563227770.247, 1563227773.907, 1563227800.529, 1563227804.663, 1563227809.749, 1563227813.237, 1563227819.043, 1563227829.781, 1563227973.727, 1563229396.472, 1563229454.515, 1563229473.079, 1563229488.669, 1563229521.413, 1563229542.954, 1563229553.595, 1563229565.988, 1563229569.095, 1563229618.857, 1563229791.585, 1563229936.355, 1563230339.141, 1563230734.677, 1563231667.173, 1563231978.567), class = c("POSIXct", "POSIXt")),
B.end = structure(c(1563221815.058, 1563222154.295, 1563222158.633, 1563222222.07, 1563222289.872, 1563222308.617, 1563222614.265, 1563222633.509, 1563222640.367, 1563222769.045, 1563222774.801, 1563222848.677, 1563222850.237, 1563222856.103, 1563223226.166, 1563223305.339, 1563223328.763, 1563223333.234, 1563223591.454, 1563223593.084, 1563224043.618, 1563224695.234, 1563226454.622, 1563226456.771, 1563226822.551, 1563226827.225, 1563227282.067, 1563227771.787, 1563227774.477, 1563227802.199, 1563227806.653, 1563227811.569, 1563227817.897, 1563227823.643, 1563227830.351, 1563227978.177, 1563229401.282, 1563229457.905, 1563229478.359, 1563229492.439, 1563229527.723, 1563229545.694, 1563229558.975, 1563229568.658, 1563229571.255, 1563229621.117, 1563229792.055, 1563229952.055, 1563230344.351, 1563230739.647, 1563231672.983, 1563231979.987), class = c("POSIXct", "POSIXt"))),
row.names = 1:52, class = "data.frame")
Events in series A are longer, while events in B are shorter.
I've drawn a schematic to help explain:
For each A event during which ≥ 4 B events occur, I'd like to compare (also shown on the schematic):
X = the mean interval between B events occurring during the A event
with
Y = the interval between the last B event occuring during the A event, and the first B event occurring after the A event
My issues are with the calculation of X and Y.
To calculate X, I tried using foverlaps to group B events by the A events in which they occur. But, this excludes B events occurring within gaps between A events.
Also, my attempts to calculate the mean intervals between grouped B events using mutate and lag failed, as I couldn't restrict lag to working only within the groups (i.e. it calculated intervals between groups as well).
Finally, I'm not sure how to efficiently identify the start/end of the Y interval to calculate its duration.
I was thinking my R/coding was improving, but this has me floundering a bit - any help would be very much appreciated!
Assuming your B-events are in chronological order, do not overlap eachother and only fall within a maximum of 1 A.event...
Explanation and in-between-output are commented in code below.
I could not verify the output, since you provided no desired/expected output in your question. Results look plausible to me on first glance..
library(data.table)
setDT(A.df); setDT(B.df)
#get time to next B
B.df[, time.to.next.B := shift(B.start, type = "lead") - B.end ][]
#get A-event that the B-events falls into
B.df[ A.df,
A.eventid := i.A.eventid,
on = .(B.start >= A.start, B.end <= A.end )][]
# B.eventid B.start B.end time.to.next.B A.eventid
# 1: 1 2019-07-15 22:16:51 2019-07-15 22:16:55 338.777 secs 11
# 2: 2 2019-07-15 22:22:33 2019-07-15 22:22:34 1.718 secs 12
# 3: 3 2019-07-15 22:22:36 2019-07-15 22:22:38 61.507 secs NA
# 4: 4 2019-07-15 22:23:40 2019-07-15 22:23:42 67.622 secs 13
# 5: 5 2019-07-15 22:24:49 2019-07-15 22:24:49 15.735 secs 13
# 6: 6 2019-07-15 22:25:05 2019-07-15 22:25:08 302.948 secs 13
# ...
#summarise by A.eventid, get number of B-events, and B.eventid of last B-event
#only get A-eventis's with 4 or more B-events
ans <- B.df[ !is.na( A.eventid),
.( B.events = .N,
last.B.eventid = max( B.eventid ),
next.B.eventid = max( B.eventid ) + 1,
mean.B.interval.within.A = mean( time.to.next.B[ B.eventid != max( B.eventid ) ] ) ),
by = .(A.eventid) ][ B.events >= 4, ]
# A.eventid B.events last.B.eventid next.B.eventid mean.B.interval.within.A
# 1: 16 5 14 15 20.879500 secs
# 2: 41 8 35 36 6.097714 secs
# 3: 50 4 40 41 26.239000 secs
# 4: 51 7 48 49 62.953500 secs
#now find the needed intervals using an update joins
ans[ B.df, start_time := i.B.end, on = .(last.B.eventid = B.eventid)]
ans[ B.df, end_time := i.B.start, on = .(next.B.eventid = B.eventid)]
# A.eventid B.events last.B.eventid next.B.eventid mean.B.interval.within.A start_time end_time
# 1: 16 5 14 15 20.879500 secs 2019-07-15 22:34:16 2019-07-15 22:40:25
# 2: 41 8 35 36 6.097714 secs 2019-07-15 23:57:10 2019-07-15 23:59:33
# 3: 50 4 40 41 26.239000 secs 2019-07-16 00:24:52 2019-07-16 00:25:21
# 4: 51 7 48 49 62.953500 secs 2019-07-16 00:32:32 2019-07-16 00:38:59
X <- ans$mean.B.interval.within.A
# Time differences in secs
# [1] 20.879500 6.097714 26.239000 62.953500
Y <- ans$end_time - ans$start_time
# Time differences in secs
# [1] 369.553 143.376 28.974 387.086
I tried to come up with a possible solution, minus the part of the average calculation, which should be obvious. First I renamed the column names, which makes it easier to join the data sets:
A.df = A.df %>%
rename_all(funs(str_replace(., "A.", ""))) %>%
mutate(type="A")
B.df = B.df %>%
rename_all(funs(str_replace(., "B.", ""))) %>%
mutate(type="B")
Then the overall data, sorted by time, is:
data = bind_rows(A.df, B.df) %>%
arrange(start)
Now I add a column showing the time stamp of the last start of an A event. Forward filling this value will show for each event the time of the last A event.
data = data %>%
mutate(last.A.start=ifelse(type=='A', start, NA)) %>%
tidyr::fill(last.A.start)
Finally, the A events can be removed. As long as the last.A.start is the same, the B events belong to the same A event. Based on these information x and y can be calculated.
data = data %>%
filter(type == "B") %>%
mutate(
duration=end-start, # Not needed.
delta=start - lag(end),
sameA=(last.A.start == lag(last.A.start)),
x=ifelse(sameA, delta, NA),
y=ifelse(sameA, NA, delta)
)
Does this help?
Bests, M

Generate records in an R data frame between two dates

I have a data frame that consists of customers scheduled subscription payments as follows:
CusID <- c(1,2,3)
FromDate <- c(ymd("2019-01-01"), ymd("2019-01-04"), ymd("2019-02-02"))
ToDate <-c(ymd("2019-01-16"), ymd("2019-01-15"), ymd("2019-04-03"))
Amount <- c(5,10,12)
Frequency <- c("Weekly", "Fortnightly", "Monthly")
Input <- data.frame(CusID, Amount, Frequency, FromDate, ToDate)
For each row (customer), I wish to loop from the FromDate to the ToDate and output one row of each data for each scheduled payment that falls between those dates, resulting in the following data frame:
CusID <- c(1,1,1,2,3,3,3)
PaymentDate <- c(ymd("2019-01-01"), ymd("2019-01-08"), ymd("2019-01-15"),
ymd("2019-01-04"),ymd("2019-02-02"),ymd("2019-03-02"),ymd("2019-04-02"))
Amount <- c(5,5,5,10,12,12,12)
Output <- data.frame(CusID, PaymentDate, Amount)
What is an efficient way to achieve this using R (and preferably using dplyr / tidyverse functions)?
In SAS my approach would be to use a DO / WHILE LOOP and OUTPUT statement to write a new line for each scheduled payment. e.g.
data Output;
set Input;
PaymentDate = FromDate;
do while (PaymentDate < ToDate);
Payment = Amount;
PaymentDate = PaymentDate + (7 / 14 / 30 ~ logic based on Frequency);
output;
loop;
run;
(The key here in SAS is the output statement - it explicitly writes a new record each time it is invoked, thus can be used in a loop to write multiple output lines per input line).
Is there an equivalent method available in R, or is a different approach recommended?
Another option using tidyverse
Input %>%
mutate(Frequency = case_when(Frequency == "Weekly" ~ 7L,
Frequency == "Fortnightly" ~ 14L,
Frequency == "Monthly" ~ 30L,
TRUE ~ 0L)) %>%
group_by(CusID) %>%
group_modify(~ {PaymentDate <- seq.Date(from = .x$FromDate, to = .x$ToDate, by = .x$Frequency)
crossing(.x[,1], PaymentDate)})
# A tibble: 7 x 3
# Groups: CusID [3]
CusID PaymentDate Amount
<dbl> <date> <dbl>
1 1 2019-01-01 5
2 1 2019-01-08 5
3 1 2019-01-15 5
4 2 2019-01-04 10
5 3 2019-02-02 12
6 3 2019-03-04 12
7 3 2019-04-03 12
Payment dates are a little different from your expected output because seq.Date adds 30 days taking into account the different number of days in those months.
UPDATE:
Here is a more verbatim solution
Input %>%
mutate(PaymentDate = FromDate,
RFrequency = case_when(Frequency == "Weekly" ~ '1 week',
Frequency == "Fortnightly" ~ '2 weeks',
Frequency == "Monthly" ~ '1 month')) %>%
group_by(CusID, Amount) %>%
expand(PaymentDate = seq.Date(FromDate,ToDate, by = RFrequency))
# A tibble: 7 x 3
# Groups: CusID, Amount [3]
CusID Amount PaymentDate
<dbl> <dbl> <date>
1 1 5 2019-01-01
2 1 5 2019-01-08
3 1 5 2019-01-15
4 2 10 2019-01-04
5 3 12 2019-02-02
6 3 12 2019-03-02
7 3 12 2019-04-02
I tweaked your Input data.frame so that the Frequency values are strings, not factors.
You could create a helper table freq_mapping to convert from your Frequency to the frequency format R likes. This would avoid the 30 day issue that one of the other answers pointed out.
freq_mapping <- data.frame(Frequency=c('Weekly', 'Fortnightly', 'Monthly'),
RFrequency = c('1 week', '2 weeks', '1 month'),
stringsAsFactors = FALSE)
Then merge Input with this:
Input <- Input %>%
inner_join(freq_mapping, by = 'Frequency')
Now you can create the payment dates:
Input$PaymentDate <- Input$FromDate
Input %>%
group_by(CusID) %>%
complete(PaymentDate = seq.Date(FromDate,ToDate, by = RFrequency)) %>%
fill(PaymentDate,Amount) %>%
select(CusID, PaymentDate, Amount)
not so easy problem for me. The solution is not beautiful but it should somehow do the work. You'll see there is a problem for the monthly payment which is not always 30, but otherwise it should work. But nicer solution surely exist.
library(data.table)
Input <- data.frame(CusID, Amount, Frequency, FromDate, ToDate)
Input=data.table(Input)
Input[Frequency=="Weekly",freq:=7][Frequency=="Fortnightly",freq:=14][Frequency=="Monthly",freq:=30]
Input[,Ratio:=(ToDate-FromDate)/freq]
#What is the maximum rows ? for a customer ?
NREP=as.integer(max(ceiling(Input$Ratio)))
Input[,Rep:=1][,PaymentDate:=FromDate]
for(i in 1:NREP){
Inputtemp=copy(Input)
Inputtemp[,FromDate:=FromDate+freq]
Input=rbind(Input,Inputtemp)
}
#Remove invalid rows
Input=unique(Input)
Input=Input[!(FromDate>ToDate),]
setorder(Input,CusID)
Input=Input[,c("CusID","FromDate","Amount")]
setnames(Input,"FromDate","PaymentDate")
Input==data.table(Output)
A mashup of Humpelstielzchen and user2474226's answers, to bring all logic into a single dplyr step.
Output <- Input %>%
mutate(PaymentDate = FromDate,
RFrequency = case_when(Frequency == "Weekly" ~ '1 week',
Frequency == "Fortnightly" ~ '2 weeks',
Frequency == "Monthly" ~ '1 month')) %>%
group_by(CusID) %>%
complete(PaymentDate = seq.Date(FromDate,ToDate, by = RFrequency)) %>%
fill(PaymentDate,Amount) %>%
select(CusID, PaymentDate, Amount)

How to get the time difference with minutes and seconds?

I have two columns of time information using minutes and seconds in a data.frame without additional date information, now I want to calculate the difference between these two columns and get a new column for diff_time (end_time-start_time) in either seconds (diff_time1) or in minutes and seconds as expressed in the original variables(diff_time2), how can I calculate this in R?
For example:
start_time end_time diff_time1 diff_time2
12'10" 16'23" 4'13" 253
1'05" 76'20" 75'15" 4515
96'10" 120'22" 24'12" 1452
Assuming that your times are stored as strings, in which case the quote denoting seconds must be escaped:
times <- data.frame(start_time = c("12'10\"", "1'05\"", "96'10\""),
end_time = c("16'23\"", "76'20\"", "120'22\"")
)
Then you can use lubridate::ms to convert to minutes + seconds and do the calculations. You'll need to do some additional text conversions if you want the results for diff_time1 as strings:
library(lubridate)
library(dplyr)
times %>%
mutate(diff_time1 = ms(end_time) - ms(start_time)) %>%
mutate(diff_time2 = as.numeric(diff_time1)) %>%
mutate(diff_time1 = gsub("M ", "'", diff_time1)) %>%
mutate(diff_time1 = gsub("S", "\"", diff_time1))
start_time end_time diff_time1 diff_time2
1 12'10" 16'23" 4'13" 253
2 1'05" 76'20" 75'15" 4515
3 96'10" 120'22" 24'12" 1452
You can store separate the minutes and seconds and store them as difftime objects, which can be added and subtracted:
library(tidyverse)
df <- structure(list(start_time = c("12'10\"", "1'05\"", "96'10\""),
end_time = c("16'23\"", "76'20\"", "120'22\"")), class = "data.frame", row.names = c(NA,
-3L), .Names = c("start_time", "end_time"))
df %>%
separate(start_time, c('start_min', 'start_sec'), convert = TRUE, extra = 'drop') %>%
separate(end_time, c('end_min', 'end_sec'), convert = TRUE, extra = 'drop') %>%
mutate(start = as.difftime(start_min, units = 'mins') + as.difftime(start_sec, units = 'secs'),
end = as.difftime(end_min, units = 'mins') + as.difftime(end_sec, units = 'secs'),
diff_time = end - start)
#> start_min start_sec end_min end_sec start end diff_time
#> 1 12 10 16 23 730 secs 983 secs 253 secs
#> 2 1 5 76 20 65 secs 4580 secs 4515 secs
#> 3 96 10 120 22 5770 secs 7222 secs 1452 secs

R Increment Dates by Periods across data frame columns

I'm trying to do some tests around measurement periods in time. I'd like to increment the size of the measurement bins (ie 1 month vs 2 months, etc.).
I have a data frame with a date seq() which works fine my problem is with incrementing the date by a month, week, etc.
df1 <- data.frame(id = 1:20, date1 = seq(as.Date('2012-01-01'),by = 'month', len = 20))
df1$date2 <- df1$date1 + 30
This is obviously wrong if I want the 1st of each month or week. Is there a function or package for this type of issue?
EDIT:
This :
seq( x, by = "month", length.out = 1)
seems to work for individual cells, but won't work for a column as it returns a numeric:
df1$date2 <- sapply(df1$date1, function(x) seq( x, by = "month", length.out = 1))
> head(df1)
id date1 date2
1 1 2012-01-01 15340
2 2 2012-02-01 15371
3 3 2012-03-01 15400
4 4 2012-04-01 15431
5 5 2012-05-01 15461
6 6 2012-06-01 15492
It sounds like you're looking for cut:
df1$date2 <- cut(df1$date1 + as.difftime(31, units='days'), breaks='months')
df1$date3 <- cut(df1$date2 + as.difftime(1, units='weeks'), breaks='weeks')
There might be more elegant solutions but this should work -
df1$date2 <- as.Date(
paste(
ifelse(
strftime(df1$date1,'%m') == 12,
as.integer(strftime(df1$date1,'%Y')) + 1,
as.integer(strftime(df1$date1,'%Y'))
),
ifelse(
strftime(df1$date1,'%m') == 12,
1,
as.integer(strftime(df1$date1,'%m')) + 1
),
1,
sep = "-"
),
"%Y-%m-%d"
)

Resources