Count occurence from table - count

this is my table ... I need to count the instance for the last column per date.
So basically need
date Count
2015-02-02 8
2015-02-03 10
2015-02-02 01:30:00 PM 1
2015-02-02 02:30:00 PM 1
2015-02-02 03:30:00 PM 1
2015-02-02 05:30:00 PM 1
2015-02-02 06:30:00 PM 1
2015-02-02 08:30:00 AM 1
2015-02-02 09:30:00 AM 1
2015-02-02 11:30:00 AM 1
2015-02-03 01:30:00 PM 2
2015-02-03 02:30:00 PM 2
2015-02-03 03:30:00 PM 2
2015-02-03 04:30:00 PM 2
2015-02-03 05:30:00 PM 2
2015-02-03 06:30:00 PM 2
2015-02-03 08:30:00 AM 2
2015-02-03 09:30:00 AM 2
2015-02-03 10:30:00 AM 2
2015-02-03 11:30:00 AM om 2
2015-02-04 01:30:00 PM 3
2015-02-04 02:30:00 PM 3
2015-02-04 03:30:00 PM 3
2015-02-04 05:30:00 PM 3
2015-02-04 06:30:00 PM 3
2015-02-04 08:30:00 AM 3
2015-02-04 09:30:00 AM 3
2015-02-04 10:30:00 AM 3
2015-02-04 11:30:00 AM 3
2015-02-05 01:30:00 PM 4
2015-02-05 02:30:00 PM 4
2015-02-05 03:30:00 PM 4
2015-02-05 04:30:00 PM 4
2015-02-05 05:30:00 PM 4
2015-02-05 06:30:00 PM 4
2015-02-05 08:30:00 AM 4
2015-02-05 09:30:00 AM 4
2015-02-05 10:30:00 AM 4
2015-02-05 11:30:00 AM 4
2015-02-06 01:30:00 PM 5
2015-02-06 02:30:00 PM 5
2015-02-06 08:30:00 AM 5
2015-02-06 09:30:00 AM 5
2015-02-06 10:30:00 AM 5
2015-02-06 11:30:00 AM 5

select DATE(datecolumn) as thedate, count(lastcol) from sometable group by thedate
similar question: https://stackoverflow.com/a/366610/636077

Related

How do I convert 12-hour to 24-hour time without info on AM or PM?

I received from data from a stream flow data logger but the time is recorded in 12-hour time without information on AM or PM. I can infer by looking at the order of the times whether it is AM or PM but I need to convert them to 24-hour time.
I have other logger data that uses 24-hour time so I need to make sure they match. I used the as.POSIXct() to format all the other data but I am having issues with this particular set.
I am using R for this analysis.
Here is what the data look like:
Date_Time PT.Level
2008-11-21 11:40:00 0.7502
2008-11-21 11:45:00 0.7502
2008-11-21 11:50:00 0.7480
2008-11-21 11:55:00 0.7458
2008-11-22 12:00:00 0.7458
2008-11-22 12:05:00 0.7436
2008-11-22 12:05:42 NA
2008-11-22 12:10:00 0.7436
2008-11-22 12:15:00 0.7414
# [...] [...]
2008-11-22 11:45:00 0.7304
2008-11-22 11:50:00 0.7304
2008-11-22 11:55:00 0.7304
2008-11-22 12:00:00 0.7282
2008-11-22 12:00:43 NA
2008-11-22 12:05:00 0.7282
2008-11-22 12:10:00 0.7282
2008-11-22 12:15:00 0.7282
Any suggestions?
Using ave with cumsum. If there's no switch within a day, we need case handling using table. For duplicated hours we may set diff == 0 to FALSE.
I don't know how complete your data is, but this should work if there are no dupes and always 00:00 and 12:00 is available each day.
v2 <- ave(as.numeric(substr(v1, 12, 13)) %% 12 == 0, as.Date(v1), FUN=function(x) {
if (length(table(x)) == 1) 2
else {
x[c(1, diff(x)) == 0] <- FALSE
cumsum(x)
}
})
v2 <- c("AM", "PM")[v2]
Result
cbind.data.frame(v, v1, v2)
# v v1 v2
# 1 2020-05-22 22:00:00 2020-05-22 10:00:00 PM
# 2 2020-05-22 23:00:00 2020-05-22 11:00:00 PM
# 3 2020-05-23 00:00:00 2020-05-23 12:00:00 AM
# 4 2020-05-23 00:01:00 2020-05-23 12:01:00 AM ## duplicated 12 stays AM
# 5 2020-05-23 00:59:00 2020-05-23 12:59:00 AM ## duplicated 12 stays AM
# 6 2020-05-23 01:00:00 2020-05-23 01:00:00 AM
# 7 2020-05-23 02:00:00 2020-05-23 02:00:00 AM
# 8 2020-05-23 03:00:00 2020-05-23 03:00:00 AM
# 9 2020-05-23 04:00:00 2020-05-23 04:00:00 AM
# 10 2020-05-23 05:00:00 2020-05-23 05:00:00 AM
# 11 2020-05-23 06:00:00 2020-05-23 06:00:00 AM
# 12 2020-05-23 07:00:00 2020-05-23 07:00:00 AM
# 13 2020-05-23 08:00:00 2020-05-23 08:00:00 AM
# 14 2020-05-23 09:00:00 2020-05-23 09:00:00 AM
# 15 2020-05-23 10:00:00 2020-05-23 10:00:00 AM
# 16 2020-05-23 11:00:00 2020-05-23 11:00:00 AM
# 17 2020-05-23 12:00:00 2020-05-23 12:00:00 PM
# 18 2020-05-23 13:00:00 2020-05-23 01:00:00 PM
# 19 2020-05-23 14:00:00 2020-05-23 02:00:00 PM
# 20 2020-05-23 15:00:00 2020-05-23 03:00:00 PM
# 21 2020-05-23 16:00:00 2020-05-23 04:00:00 PM
# 22 2020-05-23 17:00:00 2020-05-23 05:00:00 PM
# 23 2020-05-23 18:00:00 2020-05-23 06:00:00 PM
# 24 2020-05-23 19:00:00 2020-05-23 07:00:00 PM
# 25 2020-05-23 20:00:00 2020-05-23 08:00:00 PM
# 26 2020-05-23 21:00:00 2020-05-23 09:00:00 PM
# 27 2020-05-23 22:00:00 2020-05-23 10:00:00 PM
# 28 2020-05-23 23:00:00 2020-05-23 11:00:00 PM
# 29 2020-05-24 00:00:00 2020-05-24 12:00:00 AM
# 30 2020-05-24 01:00:00 2020-05-24 01:00:00 AM
# 31 2020-05-24 02:00:00 2020-05-24 02:00:00 AM
# 32 2020-05-24 03:00:00 2020-05-24 03:00:00 AM
# 33 2020-05-24 04:00:00 2020-05-24 04:00:00 AM
# 34 2020-05-24 05:00:00 2020-05-24 05:00:00 AM
# 35 2020-05-24 06:00:00 2020-05-24 06:00:00 AM
# 36 2020-05-24 07:00:00 2020-05-24 07:00:00 AM
# 37 2020-05-24 08:00:00 2020-05-24 08:00:00 AM
# 38 2020-05-24 09:00:00 2020-05-24 09:00:00 AM
# 39 2020-05-24 10:00:00 2020-05-24 10:00:00 AM
# 40 2020-05-24 11:00:00 2020-05-24 11:00:00 AM
# 41 2020-05-24 12:00:00 2020-05-24 12:00:00 PM
# 42 2020-05-24 13:00:00 2020-05-24 01:00:00 PM
# 43 2020-05-24 14:00:00 2020-05-24 02:00:00 PM
# 44 2020-05-24 15:00:00 2020-05-24 03:00:00 PM
# 45 2020-05-24 16:00:00 2020-05-24 04:00:00 PM
# 46 2020-05-24 17:00:00 2020-05-24 05:00:00 PM
# 47 2020-05-24 18:00:00 2020-05-24 06:00:00 PM
# 48 2020-05-24 19:00:00 2020-05-24 07:00:00 PM
# 49 2020-05-24 20:00:00 2020-05-24 08:00:00 PM
# 50 2020-05-24 21:00:00 2020-05-24 09:00:00 PM
##Result
cbind.data.frame(v, v1, v2)
[]()
# v v1 v2
# 1 2020-05-22 22:00:00 2020-05-22 10:00 PM
# 2 2020-05-22 23:00:00 2020-05-22 11:00 PM
# 3 2020-05-23 00:00:00 2020-05-23 12:00 AM
# 4 2020-05-23 01:00:00 2020-05-23 01:00 AM
# 5 2020-05-23 02:00:00 2020-05-23 02:00 AM
# 6 2020-05-23 03:00:00 2020-05-23 03:00 AM
# 7 2020-05-23 04:00:00 2020-05-23 04:00 AM
# 8 2020-05-23 05:00:00 2020-05-23 05:00 AM
# 9 2020-05-23 06:00:00 2020-05-23 06:00 AM
# 10 2020-05-23 07:00:00 2020-05-23 07:00 AM
# 11 2020-05-23 08:00:00 2020-05-23 08:00 AM
# 12 2020-05-23 09:00:00 2020-05-23 09:00 AM
# 13 2020-05-23 10:00:00 2020-05-23 10:00 AM
# 14 2020-05-23 11:00:00 2020-05-23 11:00 AM
# 15 2020-05-23 12:00:00 2020-05-23 12:00 PM
# 16 2020-05-23 13:00:00 2020-05-23 01:00 PM
# 17 2020-05-23 14:00:00 2020-05-23 02:00 PM
# 18 2020-05-23 15:00:00 2020-05-23 03:00 PM
# 19 2020-05-23 16:00:00 2020-05-23 04:00 PM
# 20 2020-05-23 17:00:00 2020-05-23 05:00 PM
# 21 2020-05-23 18:00:00 2020-05-23 06:00 PM
# 22 2020-05-23 19:00:00 2020-05-23 07:00 PM
# 23 2020-05-23 20:00:00 2020-05-23 08:00 PM
# 24 2020-05-23 21:00:00 2020-05-23 09:00 PM
# 25 2020-05-23 22:00:00 2020-05-23 10:00 PM
# 26 2020-05-23 23:00:00 2020-05-23 11:00 PM
# 27 2020-05-24 00:00:00 2020-05-24 12:00 AM
# 28 2020-05-24 01:00:00 2020-05-24 01:00 AM
# 29 2020-05-24 02:00:00 2020-05-24 02:00 AM
# 30 2020-05-24 03:00:00 2020-05-24 03:00 AM
# 31 2020-05-24 04:00:00 2020-05-24 04:00 AM
# 32 2020-05-24 05:00:00 2020-05-24 05:00 AM
# 33 2020-05-24 06:00:00 2020-05-24 06:00 AM
# 34 2020-05-24 07:00:00 2020-05-24 07:00 AM
# 35 2020-05-24 08:00:00 2020-05-24 08:00 AM
# 36 2020-05-24 09:00:00 2020-05-24 09:00 AM
# 37 2020-05-24 10:00:00 2020-05-24 10:00 AM
# 38 2020-05-24 11:00:00 2020-05-24 11:00 AM
# 39 2020-05-24 12:00:00 2020-05-24 12:00 PM
# 40 2020-05-24 13:00:00 2020-05-24 01:00 PM
# 41 2020-05-24 14:00:00 2020-05-24 02:00 PM
# 42 2020-05-24 15:00:00 2020-05-24 03:00 PM
# 43 2020-05-24 16:00:00 2020-05-24 04:00 PM
# 44 2020-05-24 17:00:00 2020-05-24 05:00 PM
# 45 2020-05-24 18:00:00 2020-05-24 06:00 PM
# 46 2020-05-24 19:00:00 2020-05-24 07:00 PM
# 47 2020-05-24 20:00:00 2020-05-24 08:00 PM
# 48 2020-05-24 21:00:00 2020-05-24 09:00 PM
I think this can easily be scaled up to minutes and seconds, I don't want to spoil your fun:)
Data:
v <- as.POSIXct(sapply(1:48, function(x) 1590174000 + x*60*60),
origin="1970-01-01")
v <- c(v[1:3], v[3]+60, v[3]+60*59, v[4:length(v)]) ## duplicate some 12 o'clocks
v1 <- format(v, "%Y-%m-%d %I:%M:%S")

Fill Missing Interval Values in r

I have a data with 4 variables, for which 2 of them are date variables. I would like to check whether the intervals for rows with TYPE == “OT” or TYPE == “NON-OT” fall within the interval of the preceding row with TYPE == “ICU”.
Data:
df <- structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1), TYPE = c("NON-OT", "NON-OT", "OT", "ICU", "OT",
"NON-OT", "OT", "NON-OT", "ICU", "OT", "OT", "ICU", "OT", "OT",
"NON-OT", "OT", "NON-OT"), DATE1 = structure(c(1427214540, 1427216280,
1427279700, 1427370420, 1427543700, 1427564520, 1427800800, 1427849280,
1427850240, 1427927400, 1428155400, 1428166380, 1428514500, 1428927000,
1429167600, 1429264500, 1429388160), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), DATE2 = structure(c(1427216280, 1427370420,
1427279700, 1427564520, 1427543700, 1427849280, 1427800800, 1427850240,
1428166380, 1427927400, 1428155400, 1429388160, 1428514500, 1428927000,
1429167600, 1429264500, 1430362020), class = c("POSIXct", "POSIXt"
), tzone = "UTC")), .Names = c("id", "TYPE", "DATE1", "DATE2"
), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-17L))
# id TYPE DATE1 DATE2
# 1 1 NON-OT 2015-03-24 16:29:00 2015-03-24 16:58:00
# 2 1 NON-OT 2015-03-24 16:58:00 2015-03-26 11:47:00
# 3 1 OT 2015-03-25 10:35:00 2015-03-25 10:35:00
# 4 1 ICU 2015-03-26 11:47:00 2015-03-28 17:42:00
# 5 1 OT 2015-03-28 11:55:00 2015-03-28 11:55:00
# 6 1 NON-OT 2015-03-28 17:42:00 2015-04-01 00:48:00
# 7 1 OT 2015-03-31 11:20:00 2015-03-31 11:20:00
# 8 1 NON-OT 2015-04-01 00:48:00 2015-04-01 01:04:00
# 9 1 ICU 2015-04-01 01:04:00 2015-04-04 16:53:00
# 10 1 OT 2015-04-01 22:30:00 2015-04-01 22:30:00
# 11 1 OT 2015-04-04 13:50:00 2015-04-04 13:50:00
# 12 1 ICU 2015-04-04 16:53:00 2015-04-18 20:16:00
# 13 1 OT 2015-04-08 17:35:00 2015-04-08 17:35:00
# 14 1 OT 2015-04-13 12:10:00 2015-04-13 12:10:00
# 15 1 NON-OT 2015-04-16 07:00:00 2015-04-16 07:00:00
# 16 1 OT 2015-04-17 09:55:00 2015-04-17 09:55:00
# 17 1 NON-OT 2015-04-18 20:16:00 2015-04-30 02:47:00
This is what I have done:
Obtain a new variable, INT that gives the interval between DATE1 and DATE2 for every row.
Obtain another variable, INT_ICU that gives the interval for rows with TYPE == “ICU” only and fill down (This is where the problem comes as the fill function in tidyr could not fill in the missing interval values.)
Obtain a logical variable, WITHIN_ICU, which gives TRUE if the interval is within the interval of ICU and FALSE otherwise.
Code:
library(tidyverse)
df %>%
mutate(INT = interval(DATE1, DATE2),
INT_ICU = if_else(TYPE == "ICU", interval(DATE1, DATE2), NA_real_)) %>%
fill(INT_ICU) %>%
mutate(WITHIN_ICU = INT %within% INT_ICU)
Output:
As you can see, there are a lot of missing values in INT_ICU variables even when I have applied fill function.
# id TYPE DATE1 DATE2 INT INT_ICU WITHIN_ICU
# <dbl> <chr> <dttm> <dttm> <S4: Interval> <S4: Interval> <lgl>
# 1 1 NON-OT 2015-03-24 16:29:00 2015-03-24 16:58:00 2015-03-24 16:29:00 UTC--2015-03-24 16:58:00 UTC NA--NA NA
# 2 1 NON-OT 2015-03-24 16:58:00 2015-03-26 11:47:00 2015-03-24 16:58:00 UTC--2015-03-26 11:47:00 UTC NA--NA NA
# 3 1 OT 2015-03-25 10:35:00 2015-03-25 10:35:00 2015-03-25 10:35:00 UTC--2015-03-25 10:35:00 UTC NA--NA NA
# 4 1 ICU 2015-03-26 11:47:00 2015-03-28 17:42:00 2015-03-26 11:47:00 UTC--2015-03-28 17:42:00 UTC 2015-03-26 11:47:00 UTC--2015-03-28 17:42:00 UTC TRUE
# 5 1 OT 2015-03-28 11:55:00 2015-03-28 11:55:00 2015-03-28 11:55:00 UTC--2015-03-28 11:55:00 UTC NA--NA NA
# 6 1 NON-OT 2015-03-28 17:42:00 2015-04-01 00:48:00 2015-03-28 17:42:00 UTC--2015-04-01 00:48:00 UTC NA--NA NA
# 7 1 OT 2015-03-31 11:20:00 2015-03-31 11:20:00 2015-03-31 11:20:00 UTC--2015-03-31 11:20:00 UTC NA--NA NA
# 8 1 NON-OT 2015-04-01 00:48:00 2015-04-01 01:04:00 2015-04-01 00:48:00 UTC--2015-04-01 01:04:00 UTC NA--NA NA
# 9 1 ICU 2015-04-01 01:04:00 2015-04-04 16:53:00 2015-04-01 01:04:00 UTC--2015-04-04 16:53:00 UTC 2015-04-01 01:04:00 UTC--2015-04-04 16:53:00 UTC TRUE
# 10 1 OT 2015-04-01 22:30:00 2015-04-01 22:30:00 2015-04-01 22:30:00 UTC--2015-04-01 22:30:00 UTC NA--NA NA
# 11 1 OT 2015-04-04 13:50:00 2015-04-04 13:50:00 2015-04-04 13:50:00 UTC--2015-04-04 13:50:00 UTC NA--NA NA
# 12 1 ICU 2015-04-04 16:53:00 2015-04-18 20:16:00 2015-04-04 16:53:00 UTC--2015-04-18 20:16:00 UTC 2015-04-04 16:53:00 UTC--2015-04-18 20:16:00 UTC TRUE
# 13 1 OT 2015-04-08 17:35:00 2015-04-08 17:35:00 2015-04-08 17:35:00 UTC--2015-04-08 17:35:00 UTC NA--NA NA
# 14 1 OT 2015-04-13 12:10:00 2015-04-13 12:10:00 2015-04-13 12:10:00 UTC--2015-04-13 12:10:00 UTC NA--NA NA
# 15 1 NON-OT 2015-04-16 07:00:00 2015-04-16 07:00:00 2015-04-16 07:00:00 UTC--2015-04-16 07:00:00 UTC NA--NA NA
# 16 1 OT 2015-04-17 09:55:00 2015-04-17 09:55:00 2015-04-17 09:55:00 UTC--2015-04-17 09:55:00 UTC NA--NA NA
# 17 1 NON-OT 2015-04-18 20:16:00 2015-04-30 02:47:00 2015-04-18 20:16:00 UTC--2015-04-30 02:47:00 UTC NA--NA NA
Desired Output:
# id TYPE DATE1 DATE2 WITHIN_ICU
# <dbl> <chr> <dttm> <dttm> <lgl>
# 1 1 NON-OT 2015-03-24 16:29:00 2015-03-24 16:58:00 NA
# 2 1 NON-OT 2015-03-24 16:58:00 2015-03-26 11:47:00 NA
# 3 1 OT 2015-03-25 10:35:00 2015-03-25 10:35:00 NA
# 4 1 ICU 2015-03-26 11:47:00 2015-03-28 17:42:00 TRUE
# 5 1 OT 2015-03-28 11:55:00 2015-03-28 11:55:00 TRUE
# 6 1 NON-OT 2015-03-28 17:42:00 2015-04-01 00:48:00 FALSE
# 7 1 OT 2015-03-31 11:20:00 2015-03-31 11:20:00 FALSE
# 8 1 NON-OT 2015-04-01 00:48:00 2015-04-01 01:04:00 FALSE
# 9 1 ICU 2015-04-01 01:04:00 2015-04-04 16:53:00 TRUE
# 10 1 OT 2015-04-01 22:30:00 2015-04-01 22:30:00 TRUE
# 11 1 OT 2015-04-04 13:50:00 2015-04-04 13:50:00 TRUE
# 12 1 ICU 2015-04-04 16:53:00 2015-04-18 20:16:00 TRUE
# 13 1 OT 2015-04-08 17:35:00 2015-04-08 17:35:00 TRUE
# 14 1 OT 2015-04-13 12:10:00 2015-04-13 12:10:00 TRUE
# 15 1 NON-OT 2015-04-16 07:00:00 2015-04-16 07:00:00 TRUE
# 16 1 OT 2015-04-17 09:55:00 2015-04-17 09:55:00 TRUE
# 17 1 NON-OT 2015-04-18 20:16:00 2015-04-30 02:47:00 FALSE
This should work
# use own function to fill rather than using dplyr's fill
f2 <- function(x) {
for(i in seq_along(x)[-1]) if(is.na(x#start[i])) x[i] <- x[i-1]#check if Start in S4 interval object is NA.
x
}
df %>%
mutate(INT = interval(DATE1, DATE2),
INT_ICU = if_else(TYPE == "ICU", interval(DATE1, DATE2), NA_real_)) %>%
mutate(INT_ICU = f2(t$INT_ICU)) %>% #instead of fill
mutate(WITHIN_ICU = INT %within% INT_ICU)
The output:
# A tibble: 17 x 6
id TYPE DATE1 DATE2 INT_ICU WITHIN_ICU
<dbl> <chr> <dttm> <dttm> <S4: Interval> <lgl>
1 1. NON-OT 2015-03-24 16:29:00 2015-03-24 16:58:00 NA--NA NA
2 1. NON-OT 2015-03-24 16:58:00 2015-03-26 11:47:00 NA--NA NA
3 1. OT 2015-03-25 10:35:00 2015-03-25 10:35:00 NA--NA NA
4 1. ICU 2015-03-26 11:47:00 2015-03-28 17:42:00 2015-03-26 11:47:00 UTC--2015-03-28 17:42:00 UTC TRUE
5 1. OT 2015-03-28 11:55:00 2015-03-28 11:55:00 2015-03-26 11:47:00 UTC--2015-03-28 17:42:00 UTC TRUE
6 1. NON-OT 2015-03-28 17:42:00 2015-04-01 00:48:00 2015-03-26 11:47:00 UTC--2015-03-28 17:42:00 UTC FALSE
7 1. OT 2015-03-31 11:20:00 2015-03-31 11:20:00 2015-03-26 11:47:00 UTC--2015-03-28 17:42:00 UTC FALSE
8 1. NON-OT 2015-04-01 00:48:00 2015-04-01 01:04:00 2015-03-26 11:47:00 UTC--2015-03-28 17:42:00 UTC FALSE
9 1. ICU 2015-04-01 01:04:00 2015-04-04 16:53:00 2015-04-01 01:04:00 UTC--2015-04-04 16:53:00 UTC TRUE
10 1. OT 2015-04-01 22:30:00 2015-04-01 22:30:00 2015-04-01 01:04:00 UTC--2015-04-04 16:53:00 UTC TRUE
11 1. OT 2015-04-04 13:50:00 2015-04-04 13:50:00 2015-04-01 01:04:00 UTC--2015-04-04 16:53:00 UTC TRUE
12 1. ICU 2015-04-04 16:53:00 2015-04-18 20:16:00 2015-04-04 16:53:00 UTC--2015-04-18 20:16:00 UTC TRUE
13 1. OT 2015-04-08 17:35:00 2015-04-08 17:35:00 2015-04-04 16:53:00 UTC--2015-04-18 20:16:00 UTC TRUE
14 1. OT 2015-04-13 12:10:00 2015-04-13 12:10:00 2015-04-04 16:53:00 UTC--2015-04-18 20:16:00 UTC TRUE
15 1. NON-OT 2015-04-16 07:00:00 2015-04-16 07:00:00 2015-04-04 16:53:00 UTC--2015-04-18 20:16:00 UTC TRUE
16 1. OT 2015-04-17 09:55:00 2015-04-17 09:55:00 2015-04-04 16:53:00 UTC--2015-04-18 20:16:00 UTC TRUE
17 1. NON-OT 2015-04-18 20:16:00 2015-04-30 02:47:00 2015-04-04 16:53:00 UTC--2015-04-18 20:16:00 UTC FALSE

How to add time (not minutes/seconds etc, just time E.g., 17:30:00) for one particular day to a list of POSIXct dates in R?

I have a data that consists of date and time which are in POSIXct format as shown below :
> DayandTime
date_time
1 2015-10-01 11:13:25
2 2015-10-01 12:38:09
3 2015-10-01 17:12:00
4 2015-10-02 11:44:05
5 2015-10-05 13:45:07
6 2015-10-05 14:53:31
7 2015-10-06 11:54:54
8 2015-10-06 16:29:22
9 2015-10-06 18:30:46
10 2015-10-07 08:41:35
11 2015-10-07 09:02:12
12 2015-10-07 10:50:25
13 2015-10-07 11:29:25
14 2015-10-07 11:49:22
15 2015-10-07 13:35:27
16 2015-10-07 15:01:17
17 2015-10-08 11:02:12
18 2015-10-08 12:03:50
19 2015-10-08 14:24:16
20 2015-10-08 14:28:31
For every day, I need to add 17.30 as the close time or end_time - on one particular day. Eg., For 01 October, 2015, the end time is October, 2015 17:30:00
The sample output is shown below :
date_time end_time
1 2015-10-01 11:13:25 2015-10-01 17:30:00
2 2015-10-01 12:38:09 2015-10-01 17:30:00
3 2015-10-01 17:12:00 2015-10-01 17:30:00
4 2015-10-02 11:44:05 2015-10-02 17:30:00
5 2015-10-05 13:45:07 2015-10-05 17:30:00
6 2015-10-05 14:53:31 2015-10-05 17:30:00
7 2015-10-06 11:54:54 2015-10-06 17:30:00
8 2015-10-06 16:29:22 2015-10-06 17:30:00
9 2015-10-06 18:30:46 2015-10-06 17:30:00
10 2015-10-07 08:41:35 2015-10-07 17:30:00
11 2015-10-07 09:02:12 2015-10-07 17:30:00
12 2015-10-07 10:50:25 2015-10-07 17:30:00
13 2015-10-07 11:29:25 2015-10-07 17:30:00
14 2015-10-07 11:49:22 2015-10-07 17:30:00
15 2015-10-07 13:35:27 2015-10-07 17:30:00
16 2015-10-07 15:01:17 2015-10-07 17:30:00
17 2015-10-08 11:02:12 2015-10-08 17:30:00
18 2015-10-08 12:03:50 2015-10-08 17:30:00
19 2015-10-08 14:24:16 2015-10-08 17:30:00
20 2015-10-08 14:28:31 2015-10-08 17:30:00
Adding hours or minutes or seconds to each element sounds simple but the results are not the way they are expected to be.
structure(list(date_time = structure(c(1443678205, 1443683289,
1443699720, 1443766445, 1444032907, 1444037011, 1444112694, 1444129162,
1444136446, 1444187495, 1444188732, 1444195225, 1444197565, 1444198762,
1444205127, 1444210277, 1444282332, 1444286030, 1444294456, 1444294711
), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = "date_time", row.names = c(NA,
-20L), class = "data.frame")
How about this
Convert Datetime into Date and then Concat the required time:
DayandTime$end_time <- as.POSIXct(paste(as.Date(as.POSIXct(DayandTime$date_time)), '17:30:00'))

cut by interval and aggregate over one month in R

I have the given data - all bike trips that started from a particular station over the month of October 2013. I'd like to count the amount of trips that occurred within ten-minute time intervals. There should be a total of 144 rows with a sum of all of the trips that occurred within that interval for the entire month. How would one cut the data.frame and then aggregate by interval (so that trips occurring between 00:00:01 and 00:10:00 are counted in the second row, between 00:10:01 and 00:20:00 are counted in the third row, and so on...)?
head(one.station)
tripduration starttime stoptime start.station.id start.station.name
59 803 2013-10-01 00:11:49 2013-10-01 00:25:12 521 8 Ave & W 31 St
208 445 2013-10-01 00:40:05 2013-10-01 00:47:30 521 8 Ave & W 31 St
359 643 2013-10-01 01:25:57 2013-10-01 01:36:40 521 8 Ave & W 31 St
635 388 2013-10-01 05:30:30 2013-10-01 05:36:58 521 8 Ave & W 31 St
661 314 2013-10-01 05:38:00 2013-10-01 05:43:14 521 8 Ave & W 31 St
768 477 2013-10-01 05:54:49 2013-10-01 06:02:46 521 8 Ave & W 31 St
start.station.latitude start.station.longitude end.station.id end.station.name
59 40.75045 -73.99481 2003 1 Ave & E 18 St
208 40.75045 -73.99481 505 6 Ave & W 33 St
359 40.75045 -73.99481 508 W 46 St & 11 Ave
635 40.75045 -73.99481 459 W 20 St & 11 Ave
661 40.75045 -73.99481 462 W 22 St & 10 Ave
768 40.75045 -73.99481 457 Broadway & W 58 St
end.station.latitude end.station.longitude bikeid usertype birth.year gender
59 40.73416 -73.98024 15139 Subscriber 1985 1
208 40.74901 -73.98848 20538 Subscriber 1990 2
359 40.76341 -73.99667 19935 Customer \\N 0
635 40.74674 -74.00776 14781 Subscriber 1955 1
661 40.74692 -74.00452 17976 Subscriber 1982 1
768 40.76695 -73.98169 19022 Subscriber 1973 1
So that the output looks like this
output
interval total_trips
1 00:00:00 0
2 00:10:00 1
3 00:20:00 2
4 00:30:00 3
5 00:40:00 4
Here it is using only start time:
library(lubridate)
library(dplyr)
tripduration <- floor(runif(6) * 1000)
start_times <- as.POSIXlt(
c("2013-10-01 00:11:49"
,"2013-10-01 00:40:05"
,"2013-10-01 01:25:57"
,"2013-10-01 05:30:30"
,"2013-10-01 05:38:00"
,"2013-10-01 05:54:49")
)
time_bucket <- start_times - minutes(minute(start_times) %% 10) - seconds(second(start_times))
df <- data.frame(tripduration, start_times, time_bucket)
summarized <- df %>%
group_by(time_bucket) %>%
summarize(trip_count = n())
summarized <- as.data.frame(summarized)
out_buckets <- data.frame(out_buckets = seq(as.POSIXlt("2013-10-01 00:00:00"), as.POSIXct("2013-10-01 06:0:00"), by = 600))
out <- left_join(out_buckets, summarized, by = c("out_buckets" = "time_bucket"))
out$trip_count[is.na(out$trip_count)] <- 0
out
out_buckets trip_count
1 2013-10-01 00:00:00 0
2 2013-10-01 00:10:00 1
3 2013-10-01 00:20:00 0
4 2013-10-01 00:30:00 0
5 2013-10-01 00:40:00 1
6 2013-10-01 00:50:00 0
7 2013-10-01 01:00:00 0
8 2013-10-01 01:10:00 0
9 2013-10-01 01:20:00 1
10 2013-10-01 01:30:00 0
11 2013-10-01 01:40:00 0
12 2013-10-01 01:50:00 0
13 2013-10-01 02:00:00 0
14 2013-10-01 02:10:00 0
15 2013-10-01 02:20:00 0
16 2013-10-01 02:30:00 0
17 2013-10-01 02:40:00 0
18 2013-10-01 02:50:00 0
19 2013-10-01 03:00:00 0
20 2013-10-01 03:10:00 0
21 2013-10-01 03:20:00 0
22 2013-10-01 03:30:00 0
23 2013-10-01 03:40:00 0
24 2013-10-01 03:50:00 0
25 2013-10-01 04:00:00 0
26 2013-10-01 04:10:00 0
27 2013-10-01 04:20:00 0
28 2013-10-01 04:30:00 0
29 2013-10-01 04:40:00 0
30 2013-10-01 04:50:00 0
31 2013-10-01 05:00:00 0
32 2013-10-01 05:10:00 0
33 2013-10-01 05:20:00 0
34 2013-10-01 05:30:00 2
35 2013-10-01 05:40:00 0
36 2013-10-01 05:50:00 1
37 2013-10-01 06:00:00 0
The lubridate library can provide one solution. It has a nice function for interval overlap logic. The below uses lapply to loop through the intervals provided in the data then buckets them accordingly.
library(lubridate)
start_times <- as.POSIXlt(
c("2013-10-01 00:11:49"
,"2013-10-01 00:40:05"
,"2013-10-01 01:25:57"
,"2013-10-01 05:30:30"
,"2013-10-01 05:38:00"
,"2013-10-01 05:54:49")
)
stop_times <- as.POSIXlt(
c("2013-10-01 00:25:12"
,"2013-10-01 00:47:30"
,"2013-10-01 01:36:40"
,"2013-10-01 05:36:58"
,"2013-10-01 05:43:14"
,"2013-10-01 06:02:46")
)
start_bucket <- seq(as.POSIXct("2013-10-01 00:00:00"), as.POSIXct("2013-10-01 06:0:00"), by = 600)
end_bucket <- start_bucket + 600
bucket_interval <- interval(start_bucket, end_bucket)
data_interval <- interval(start_times, stop_times)
int_list <- lapply(data_interval, function(x) ifelse(int_overlaps(x, bucket_interval),1,0))
rides_per_bucket <- rowSums(do.call(cbind, int_list))
out_df <- data.frame(bucket_interval, rides_per_bucket)
out_df
bucket_interval rides_per_bucket
1 2013-10-01 00:00:00 PDT--2013-10-01 00:10:00 PDT 0
2 2013-10-01 00:10:00 PDT--2013-10-01 00:20:00 PDT 1
3 2013-10-01 00:20:00 PDT--2013-10-01 00:30:00 PDT 1
4 2013-10-01 00:30:00 PDT--2013-10-01 00:40:00 PDT 0
5 2013-10-01 00:40:00 PDT--2013-10-01 00:50:00 PDT 1
6 2013-10-01 00:50:00 PDT--2013-10-01 01:00:00 PDT 0
7 2013-10-01 01:00:00 PDT--2013-10-01 01:10:00 PDT 0
8 2013-10-01 01:10:00 PDT--2013-10-01 01:20:00 PDT 0
9 2013-10-01 01:20:00 PDT--2013-10-01 01:30:00 PDT 1
10 2013-10-01 01:30:00 PDT--2013-10-01 01:40:00 PDT 1
11 2013-10-01 01:40:00 PDT--2013-10-01 01:50:00 PDT 0
12 2013-10-01 01:50:00 PDT--2013-10-01 02:00:00 PDT 0
13 2013-10-01 02:00:00 PDT--2013-10-01 02:10:00 PDT 0
14 2013-10-01 02:10:00 PDT--2013-10-01 02:20:00 PDT 0
15 2013-10-01 02:20:00 PDT--2013-10-01 02:30:00 PDT 0
16 2013-10-01 02:30:00 PDT--2013-10-01 02:40:00 PDT 0
17 2013-10-01 02:40:00 PDT--2013-10-01 02:50:00 PDT 0
18 2013-10-01 02:50:00 PDT--2013-10-01 03:00:00 PDT 0
19 2013-10-01 03:00:00 PDT--2013-10-01 03:10:00 PDT 0
20 2013-10-01 03:10:00 PDT--2013-10-01 03:20:00 PDT 0
21 2013-10-01 03:20:00 PDT--2013-10-01 03:30:00 PDT 0
22 2013-10-01 03:30:00 PDT--2013-10-01 03:40:00 PDT 0
23 2013-10-01 03:40:00 PDT--2013-10-01 03:50:00 PDT 0
24 2013-10-01 03:50:00 PDT--2013-10-01 04:00:00 PDT 0
25 2013-10-01 04:00:00 PDT--2013-10-01 04:10:00 PDT 0
26 2013-10-01 04:10:00 PDT--2013-10-01 04:20:00 PDT 0
27 2013-10-01 04:20:00 PDT--2013-10-01 04:30:00 PDT 0
28 2013-10-01 04:30:00 PDT--2013-10-01 04:40:00 PDT 0
29 2013-10-01 04:40:00 PDT--2013-10-01 04:50:00 PDT 0
30 2013-10-01 04:50:00 PDT--2013-10-01 05:00:00 PDT 0
31 2013-10-01 05:00:00 PDT--2013-10-01 05:10:00 PDT 0
32 2013-10-01 05:10:00 PDT--2013-10-01 05:20:00 PDT 0
33 2013-10-01 05:20:00 PDT--2013-10-01 05:30:00 PDT 0
34 2013-10-01 05:30:00 PDT--2013-10-01 05:40:00 PDT 2
35 2013-10-01 05:40:00 PDT--2013-10-01 05:50:00 PDT 1
36 2013-10-01 05:50:00 PDT--2013-10-01 06:00:00 PDT 1
37 2013-10-01 06:00:00 PDT--2013-10-01 06:10:00 PDT 1

How to sort a dataframe by column and get the index?

I want to sort a dataframe in R by a column and add the ranking to a new column.
Specifically, I want to rank the price column in the data.frame below (ascending) for every day. Then, I want to add a column indicating the rank of every hour of the day.
library(dplyr)
prices <- data.frame(time = c("2014-07-01 00:00:00 CEST","2014-07-01 01:00:00 CEST","2014-07-01 02:00:00 CEST","2014-07-01 03:00:00 CEST",
"2014-07-01 04:00:00 CEST","2014-07-01 05:00:00 CEST","2014-07-01 06:00:00 CEST","2014-07-01 07:00:00 CEST",
"2014-07-01 08:00:00 CEST","2014-07-01 09:00:00 CEST","2014-07-01 10:00:00 CEST","2014-07-01 11:00:00 CEST",
"2014-07-01 12:00:00 CEST","2014-07-01 13:00:00 CEST","2014-07-01 14:00:00 CEST","2014-07-01 15:00:00 CEST",
"2014-07-01 16:00:00 CEST","2014-07-01 17:00:00 CEST","2014-07-01 18:00:00 CEST","2014-07-01 19:00:00 CEST",
"2014-07-01 20:00:00 CEST","2014-07-01 21:00:00 CEST","2014-07-01 22:00:00 CEST","2014-07-01 23:00:00 CEST",
"2014-07-02 00:00:00 CEST","2014-07-02 01:00:00 CEST","2014-07-02 02:00:00 CEST","2014-07-02 03:00:00 CEST",
"2014-07-02 04:00:00 CEST","2014-07-02 05:00:00 CEST","2014-07-02 06:00:00 CEST","2014-07-02 07:00:00 CEST",
"2014-07-02 08:00:00 CEST","2014-07-02 09:00:00 CEST","2014-07-02 10:00:00 CEST","2014-07-02 11:00:00 CEST",
"2014-07-02 12:00:00 CEST","2014-07-02 13:00:00 CEST","2014-07-02 14:00:00 CEST","2014-07-02 15:00:00 CEST",
"2014-07-02 16:00:00 CEST","2014-07-02 17:00:00 CEST","2014-07-02 18:00:00 CEST","2014-07-02 19:00:00 CEST",
"2014-07-02 20:00:00 CEST","2014-07-02 21:00:00 CEST","2014-07-02 22:00:00 CEST","2014-07-02 23:00:00 CEST"),
price = c(31.75,30.54,30.10,29.32,25.97,26.90,33.59,41.06,40.99,42.44,40.00,39.94,35.69,36.00,36.00,35.17,34.94,35.18,39.00,
41.92,40.09,38.87,39.38,36.00,30.26,29.29,29.37,25.15,25.81,27.97,31.63,39.91,39.99,39.61,39.13,40.43,38.41,36.96,
36.00,34.95,33.82,36.08,38.59,39.91,39.02,36.90,38.88,32.59))
I am using arange from dplyr for the sorting as follows.
prices_sorted <- arrange(df, format(df$time, format="%Y-%m-%d"), real)
Is there a 'clean' way to arrive at the following?
prices_ranked
time price ranking
1 2014-07-01 00:00:00 CEST 31.75 5
2 2014-07-01 01:00:00 CEST 30.54 6
3 2014-07-01 02:00:00 CEST 30.10 4
4 2014-07-01 03:00:00 CEST 29.32 3
5 2014-07-01 04:00:00 CEST 25.97 2
6 2014-07-01 05:00:00 CEST 26.90 1
7 2014-07-01 06:00:00 CEST 33.59 7
8 2014-07-01 07:00:00 CEST 41.06 17
9 2014-07-01 08:00:00 CEST 40.99 16
10 2014-07-01 09:00:00 CEST 42.44 18
11 2014-07-01 10:00:00 CEST 40.00 13
12 2014-07-01 11:00:00 CEST 39.94 14
13 2014-07-01 12:00:00 CEST 35.69 15
14 2014-07-01 13:00:00 CEST 36.00 24
15 2014-07-01 14:00:00 CEST 36.00 22
16 2014-07-01 15:00:00 CEST 35.17 19
17 2014-07-01 16:00:00 CEST 34.94 23
18 2014-07-01 17:00:00 CEST 35.18 12
19 2014-07-01 18:00:00 CEST 39.00 11
20 2014-07-01 19:00:00 CEST 41.92 21
21 2014-07-01 20:00:00 CEST 40.09 9
22 2014-07-01 21:00:00 CEST 38.87 8
23 2014-07-01 22:00:00 CEST 39.38 20
24 2014-07-01 23:00:00 CEST 36.00 10
25 2014-07-02 00:00:00 CEST 30.26 4
26 2014-07-02 01:00:00 CEST 29.29 5
27 2014-07-02 02:00:00 CEST 29.37 6
28 2014-07-02 03:00:00 CEST 25.15 2
29 2014-07-02 04:00:00 CEST 25.81 3
30 2014-07-02 05:00:00 CEST 27.97 1
31 2014-07-02 06:00:00 CEST 31.63 7
32 2014-07-02 07:00:00 CEST 39.91 24
33 2014-07-02 08:00:00 CEST 39.99 17
34 2014-07-02 09:00:00 CEST 39.61 16
35 2014-07-02 10:00:00 CEST 39.13 15
36 2014-07-02 11:00:00 CEST 40.43 18
37 2014-07-02 12:00:00 CEST 38.41 22
38 2014-07-02 13:00:00 CEST 36.96 14
39 2014-07-02 14:00:00 CEST 36.00 13
40 2014-07-02 15:00:00 CEST 34.95 19
41 2014-07-02 16:00:00 CEST 33.82 23
42 2014-07-02 17:00:00 CEST 36.08 21
43 2014-07-02 18:00:00 CEST 38.59 11
44 2014-07-02 19:00:00 CEST 39.91 10
45 2014-07-02 20:00:00 CEST 39.02 8
46 2014-07-02 21:00:00 CEST 36.90 20
47 2014-07-02 22:00:00 CEST 38.88 9
48 2014-07-02 23:00:00 CEST 32.59 12
I was a little unclear on what order you wanted things in, but is this what you were looking for? Updated to rank by date (I added in some additional data so you could see that)
library(data.table)
prices <- data.table(time = c("2014-07-01 00:00:00 CEST", "2014-07-01 01:00:00 CEST", "2014-07-01 02:00:00 CEST","2014-07-01 03:00:00 CEST", "2014-07-01 04:00:00 CEST",
"2015-07-01 00:00:00 CEST", "2015-07-01 01:00:00 CEST", "2015-07-01 02:00:00 CEST","2015-07-01 03:00:00 CEST", "2015-07-01 04:00:00 CEST"),
price = c(31.75, 30.54, 30.10, 29.32, 25.97,31.75, 30.12, 31.10, 39.32, 25.97))
prices <- prices[,"date" := as.Date(time)]
prices.sorted <- prices[order(time),ranking := rank(price,ties.method='first'), by=date]
Here is my solution, it uses the base R solution sort:
prices %>% mutate(ranking = row_number(sort(price, decreasing = T)))
time price ranking
1 2014-07-01 00:00:00 CEST 31.75 5
2 2014-07-01 01:00:00 CEST 30.54 4
3 2014-07-01 02:00:00 CEST 30.10 3
4 2014-07-01 03:00:00 CEST 29.32 2
5 2014-07-01 04:00:00 CEST 25.97 1
Maybe this:
prices %>% arrange(price) %>% mutate(ranking=min_rank(price)) %>% arrange(time)
# time price ranking
#1 2014-07-01 00:00:00 CEST 31.75 5
#2 2014-07-01 01:00:00 CEST 30.54 4
#3 2014-07-01 02:00:00 CEST 30.10 3
#4 2014-07-01 03:00:00 CEST 29.32 2
#5 2014-07-01 04:00:00 CEST 25.97 1

Resources