Converting time in integer to MIN:SEC - r

I am using this code to find a difference between two times:
station_data.avg$duration[i] = if_else(station_data.avg$swath[i] != 0, round(
difftime(station_data.avg$end[i], station_data.avg$start[i], units = "mins"),
3
), 0)
But the output is 3.116667 and I want the output to be in the format Min:sec so 3:18
I tried
station_data.avg$duration[i]= as.character(times(station_data.avg$duration[i] / (24 * 60 )))
and was hoping that would work but it did not

You can use the chron package to convert fraction of the minute (ie, x.25 indicating 25% of a minute) into x.15 indicating out of 60 seconds (15/60 = 25). An example is below, but if you edit your question to make it reproducible, I can provide more specific help.
Data
a <- Sys.time()
b <- Sys.time() + 60 * 3 + 15 # add 3 min 15 seconds
Code
difftime(b, a, units = "min")
# Time difference of 3.250006 mins
chron::times(as.numeric(difftime(b, a, units = "days")))
# [1] 00:03:15
Note the change to units = "days" in this context.
You could further parse this out by wrapping this in lubridate::hms:
lubridate::hms(
chron::times(as.numeric(difftime(b, a, units = "days")))
)
# [1] "3M 15S"

Related

Calculate average of a subset of a vector only when subset values meet a condition in R?

I have a daily curve x and I am trying to approximate the average peak and offpeak values of x:
https://ibb.co/Fq1Byzk
I have defined a delta threshold such that when delta is below the threshold value, x will be in the offpeak or peak period. I want to get the average peak value where the average is only of values within x where the delta < threshold. Right now it is averaging out the outliers as well.
delta <- matrix(0,24,ncol=1)
for (i in 2:24){
# i-th element is the i-th hour per day
delta[i] = x[i,2]-x[i-1,2]
}
# Find hour at which max and min daily values occur
max_threshold = 0.15*max(delta)
min_threshold = 0.15*min(delta)
c <- abs(delta) < max_threshold
t1 <- which(delta>max_threshold)[1]-1 # t1: time index at end of off-peak
t2 <- which.max(delta) + 1 # t2 is time of initial peak
t3 <- which.min(delta)-2 # t3 is time of end peak
t4 <- which.min(delta) # t4 time index of evening off-peak
am <- mean(x[1:t1,2]) # average morning off-peak value
peak <- mean(x[t2:t3,2]) #average peak value
pm <- mean(x[t4:24,2]) # average evening off-peak value
> dput(x)
structure(list(time = structure(c(1451952000, 1451955600, 1451959200,
1451962800, 1451966400, 1451970000, 1451973600, 1451977200, 1451980800,
1451984400, 1451988000, 1451991600, 1451995200, 1451998800, 1452002400,
1452006000, 1452009600, 1452013200, 1452016800, 1452020400, 1452024000,
1452027600, 1452031200, 1452034800, 1452038400, 1452042000, 1452045600,
1452049200, 1452052800, 1452056400, 1452060000, 1452063600, 1452067200,
1452070800, 1452074400, 1452078000, 1452081600, 1452085200, 1452088800,
1452092400, 1452096000, 1452099600, 1452103200, 1452106800, 1452110400,
1452114000, 1452117600, 1452121200), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), Crow_education_Omer = c(0.019186330898848,
0.0192706664192825, 0.0182164724138513, 0.018174304653634, 0.019355001939717,
0.0197345117816722, 0.023951287803397, 0.0323848398468467, 0.0343245568168401,
0.0378244809148717, 0.0393003525224754, 0.0403545465279066, 0.0405232175687756,
0.0393425202826927, 0.0398907011655169, 0.0377401453944372, 0.0344932278577091,
0.0317101556833707, 0.0304872906370705, 0.0297282709531601, 0.0287584124681633,
0.0252584883701317, 0.0196080085010205, 0.0197345117816722, 0.0194815052203687,
0.0196080085010205, 0.0184273112149375, 0.0184694789751548, 0.0191441631386307,
0.019692344021455, 0.025469327171218, 0.0352522475416196, 0.0376136421137855,
0.0403967142881239, 0.0435592963044175, 0.0433484575033313, 0.0430532831818105,
0.042968947661376, 0.043306289743114, 0.044655658070066, 0.0424207667785518,
0.0416195793344241, 0.0382883262772615, 0.03769797763422, 0.0330173562501054,
0.0281680638251219, 0.0234452746807901, 0.0225597517162278)), row.names = 97:144, class = "data.frame")
Also, how would I be able to ggplot both the new simplified curve along with the original curve x on the same graph? I can't seem to melt or rbind() the new curve with reduced number of data points with x since my time column is POSIXCT.
Thanks.
This is just a partial solution, since it breaks down for the second day. I named the data.frame df instead of x.
library(ggplot2)
library(dplyr)
library(lubridate)
df_obj <- df %>%
group_by(day = day(time)) %>% # group by days
filter(day == 5) %>% # filter for day 5
mutate(
delta_rev = Crow_education_Omer - lag(
Crow_education_Omer,
default = first(Crow_education_Omer)
), # delta between day n and n-1
delta_for = lead(
Crow_education_Omer,
default = last(Crow_education_Omer)
) - Crow_education_Omer, # delta between day n-1 and n
max_tresh = 0.15 * max(delta_rev)
) %>%
group_by(grp = 1 - (abs(delta_rev) < 0.15 * max(delta_rev) | abs(delta_for) < 0.15 * max(delta_for)),
grp2 = cumsum(grp != lag(grp, default = 0))
) %>%
mutate(
average = mean(Crow_education_Omer) *
(1 - grp) *
(abs(first(Crow_education_Omer) - last(Crow_education_Omer)) < max_tresh)
)
First we need to modify your existing data.frame to build up your averages. Based on this calculation, we use ggplot2 for plotting:
df_obj %>%
ggplot(aes(x = time, y = Crow_education_Omer)) +
geom_point() +
geom_line(aes(color = "sample")) +
geom_line(data = df_obj[df_obj$average != 0, ], aes(x = time, y = average, color = "average")) +
xlab("Time") +
ylab("Value")
returns
But for day 6 this doesn't work as expected: Changing to filter(day == 6) and plotting again returns
which isn't the expected result. Changing the threshold value to 0.33 * max(delta) and plotting again creates
So, perhaps you can build up on this code to create a correct and working solution. Good luck!
A few explanations:
We build up delta_rev and delta_for. delta_rev equals your delta, so for a given row/data point i we calculate df[i,2] - df[i-1,2].
delta_for changes this, now we calculate df[i + 1,2] - df[i,2] for a given i. My idea here is: Using both, delta_rev and delta_for allows us to look at the preceeding and succeeding points. This gives us more information about the neighbours of a given point and is useful to determine if the point belongs to a group (am, peak, pm).
The group_by-function tries to build up the groups based on the treshhold. grp checks, if a data point is < 0.15 max(delta), grp2 creates a unique grouping number.
There are a few issues:
Based on this algorithm, there can be more than three groups.
The group_by finds another group between 15:00 and 20:00, we filter it out (that's the (abs(first(Crow_education_Omer) - last(Crow_education_Omer)) < max_tresh)-part). I'm not sure, if this is a good solution.
As stated above, this doesn't return a reasonable plot for day 6. Perhaps geom_point's df_obj[df_obj$average != 0, ]-part causes this.

R for data science exercise 5.5.2 question 1

I am solving an exercise in "R for data science. An exercise under "useful creations functions" in the chapter data transformation with dplyr. The question goes as follows using the nycflights13 dataset:
Currently dep_time and sched_time are convenient to look at, but hard to compute with because they are not really continuous numbers. Convert them to a more convenient representation of numbers of minutes since midnight.
And I saw this answer online:
transmute(flights,
dep_time_since_midnight = (dep_time %% 100) + ((dep_time %/% 100) * 60),
sched_dep_time_since_midnight = (sched_dep_time %% 100) + ((sched_dep_time %/% 100) * 60)
)
Please the question is I don't understand the conversion, this is somewhat of a mathematical problem than a coding problem. Please help
%% is read as "mod", and it gives you the remainder (e.g. 7 %% 3 = 1)
%/% is integer division (e.g. 7 %/% 3 = 2)
In working with dep_time:
hour = dep_time %/% 100
minute = dep_time %% 100
so, the above expression can be read as minutes + hour * 60

Print dates without scientific notation in rpart classification tree

When I create an rpart tree that uses a date cutoff at a node, the print methods I use - both rpart.plot and fancyRpartPlot - print the dates in scientific notation, which makes it hard to interpret the result. Here's the fancyRpartPlot:
Is there a way to print this tree with more interpretable date values? This tree plot is meaningless as all those dates look the same.
Here's my code for creating the tree and plotting two ways:
library(rpart) ; library(rpart.plot) ; library(rattle)
my_tree <- rpart(a ~ ., data = dat)
rpart.plot(my_tree)
fancyRpartPlot(my_tree)
Using this data:
# define a random date/time selection function
generate_days <- function(N, st="2012/01/01", et="2012/12/31") {
st = as.POSIXct(as.Date(st))
et = as.POSIXct(as.Date(et))
dt = as.numeric(difftime(et,st,unit="sec"))
ev = runif(N, 0, dt)
rt = st + ev
rt
}
set.seed(1)
dat <- data.frame(
a = runif(1:100),
b = rpois(100, 5),
c = sample(c("hi","med","lo"), 100, TRUE),
d = generate_days(100)
)
From a practical standpoint, perhaps you'd like to just use days from the start of the data:
dat$d <- dat$d-as.POSIXct(as.Date("2012/01/01"))
my_tree <- rpart(a ~ ., data = dat)
rpart.plot(my_tree,branch=1,extra=101,type=1,nn=TRUE)
This reduces the number to something manageable and meaningful (though not as meaningful as a specific date, perhaps). You may even want to round it to the nearest day or week. (I can't install GTK+ on my computer so I can't us fancyRpartPlot.)
One possible way might be to use the digits options in print to examine the tree and as.POSIXlt to convert to date:
> print(my_tree,digits=100)
n= 100
node), split, n, deviance, yval
* denotes terminal node
1) root 100 7.0885590 0.5178471
2) d>=1346478795.049611568450927734375 33 1.7406368 0.4136051
4) b>=4.5 23 1.0294497 0.3654257 *
5) b< 4.5 10 0.5350040 0.5244177 *
3) d< 1346478795.049611568450927734375 67 4.8127122 0.5691901
6) d< 1340921905.3460228443145751953125 55 4.1140164 0.5368048
12) c=hi 28 1.8580913 0.4779574
24) d< 1335890083.3241622447967529296875 18 0.7796261 0.3806526 *
25) d>=1335890083.3241622447967529296875 10 0.6012662 0.6531062 *
13) c=lo,med 27 2.0584052 0.5978317
26) d>=1337494347.697483539581298828125 8 0.4785274 0.3843749 *
27) d< 1337494347.697483539581298828125 19 1.0618892 0.6877082 *
7) d>=1340921905.3460228443145751953125 12 0.3766236 0.7176229 *
## Get date on first node
> as.POSIXlt(1346478795.049611568450927734375,origin="1970-01-01")
[1] "2012-08-31 22:53:15 PDT"
I also check the digits option in available in rpart.plot and fancyRpartPlot:
rpart.plot(my_tree,digits=10)
fancyRpartPlot(my_tree, digits=10)
I don't know how important the specific chronological date is in your classification but an alternative method would be to breakdown your dates by the characteristics. In other words, create bins based on the "year" (2012,2013,2014...) as [1,0]. "Day of the Week" (Mon, Tues, Wed, Thurs, Fri...) as [1,0]. Maybe even as "Day of Month" (1,2,3,4,5...31) as [1,0]. This adds a lot more categories to be classifying by but it eliminates the issue with working with a fully formatted date.

R adding to a difftime vector forgets about the units

When I extend a vector of difftimes by another difftime object, then it seems that the unit of the added item is ignored and overridden without conversion:
> t = Sys.time()
> d = difftime(c((t+1), (t+61)), t)
> d
Time differences in secs
[1] 1 61
> difftime(t+61, t)
Time difference of 1.016667 mins
> d[3] = difftime(t+61, t)
> d
Time differences in secs
[1] 1.000000 61.000000 1.016667
> as.numeric(d)
[1] 1.000000 61.000000 1.016667
This is in R 3.1.0. Is there a reasonable explanation for this behavior? I just wanted to store some time differences in this way for later use and didn't expect this at all. I didn't find this documented anywhere..
Okay, for now I'm just helping myself with always specifying the unit:
> d[3] = difftime(t+61, t, unit="secs")
> d
Time differences in secs
[1] 1 61 61
From help("difftime")
If units = "auto", a suitable set of units is chosen, the largest possible (excluding "weeks") in which all the absolute differences are greater than one.
units = "auto" is the default. So for a difference of 1 and 61 seconds, if you were to choose minutes,
difftime(c((t+1), (t+61)), t, units = "min")
# Time differences in mins
# [1] 0.01666667 1.01666667
One of those is less than one, so by default since you did not specify the units R chose them for you according to the guidelines above. Additionally, the units are saved with the object
d <- difftime(c((t+1), (t+61)), t)
units(d)
# [1] "secs"
But you can change the units with units<-
d[3] <- difftime(t+61, t)
d
# Time differences in mins
# [1] 0.01666667 1.01666667 1.01666667
units(d) <- "secs"
d
# Time differences in secs
# [1] 1 61 61

how to do statistic with time date

I got a serial of times, as following,
2013-12-27 00:31:15
2013-12-29 17:01:17
2013-12-31 01:52:41
....
my target is to know what time in a day is more important, like most times are in the period of 17:00 ~ 19:00.
In order to do that, I think I should draw every single time as a point in x-axes, and the unit of x-axes is minute.
I don't know how to do it exactly with R and ggplot2.
Am I on the right way? I mean, is there a better way to get my target?
library(chron)
# create some test data - hrs
set.seed(123)
Lines <- "2013-12-27 00:31:15
2013-12-29 17:01:17
2013-12-31 01:52:41
"
tt0 <- times(read.table(text = Lines)[[2]]) %% 1
rng <- range(tt0)
hrs <- 24 * as.vector(sort(diff(rng) * runif(100)^2 + rng[1]))
# create density, find maximum of it and plot
d <- density(hrs)
max.hrs <- d$x[which.max(d$y)]
ggplot(data.frame(hrs)) +
geom_density(aes(hrs)) +
geom_vline(xintercept = max.hrs)
giving:
> max.hrs # in hours - nearly 2 am
[1] 1.989523
> times(max.hrs / 24) # convert to times
[1] 01:59:22

Resources