Related
I ran into a problem in R trying to calculate employee's actual working time where I subtract the absence duration from the working time.
For one scheduled day there can be several durations of absence.
Example data frame of one scheduled day:
row_num StartDate EndDate Absence_StartDate Absence_EndDate
1 2019-11-13 14:30:00 2019-11-13 18:30:00 2019-11-13 15:45:00 2019-11-13 16:15:00
2 2019-11-13 14:30:00 2019-11-13 18:30:00 2019-11-13 08:15:00 2019-11-13 14:00:00
3 2019-11-13 14:30:00 2019-11-13 18:30:00 2019-11-13 15:30:00 2019-11-13 16:30:00
4 2019-11-13 14:30:00 2019-11-13 18:30:00 2019-11-13 08:00:00 2019-11-13 15:00:00
You can reproduce the data frame from this line:
df <- data.frame(StartDate = rep("2019-11-13 14:30:00", 4),
EndDate = rep("2019-11-13 18:30:00", 4),
Absence_StartDate = c("2019-11-13 15:45:00", "2019-11-13 08:15:00", "2019-11-13 15:30:00", "2019-11-13 08:00:00"),
Absence_EndDate = c("2019-11-13 16:15:00", "2019-11-13 14:00:00", "2019-11-13 16:30:00", "2019-11-13 15:00:00"))
The main problem I face, is that some of those absence's are overlaping each other, for example row_num 1 and 3. 1st row has no use to me as 3rd row overlaps the first one (-120 minutes from working time). 2nd row doesn't affect the working time, as it's out of Employee's schedule, 4th row overlaps 2nd row and it affects the working time (-15 minutes).
for this example, the working time is 240 minutes and 150 minutes of absence, so the actual working time is 90 minutes.
How could one write a code to calculate the actual working time, given that the StartDate and EndDate can vary (but the same day) and there can be several absences (some may won't even affect the working time at all).
Tried using intervals from lubridate library and using for loop but couldn't achieve the results.
Thank you!
====== UPDATE =======
The code below by "#AnilGoyal" works fine for the most of the time. But, there's some kind of problem which I tried to solve but couldn't do it.
Here's example which doesn't work:
dput(df2)
structure(list(empid = c(1, 1, 1, 1, 1, 1, 1, 1), Date = structure(c(18213,
18213, 18213, 18213, 18213, 18213, 18213, 18213), class = "Date"),
presence_start = structure(c(1573648200, 1573648200, 1573648200,
1573648200, 1573624800, 1573624800, 1573624800, 1573624800
), tzone = "", class = c("POSIXct", "POSIXt")), presence_end = structure(c(1573655400,
1573655400, 1573655400, 1573655400, 1573646400, 1573646400,
1573646400, 1573646400), tzone = "", class = c("POSIXct",
"POSIXt")), emprsn = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), absence_start = structure(list(
sec = c(0, 0, 0, 0, 0, 0, 0, 0), min = c(15L, 15L, 30L,
0L, 15L, 15L, 30L, 0L), hour = c(15L, 8L, 14L, 8L, 15L,
8L, 14L, 8L), mday = c(13L, 13L, 13L, 13L, 13L, 13L,
13L, 13L), mon = c(10L, 10L, 10L, 10L, 10L, 10L, 10L,
10L), year = c(119L, 119L, 119L, 119L, 119L, 119L, 119L,
119L), wday = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), yday = c(316L,
316L, 316L, 316L, 316L, 316L, 316L, 316L), isdst = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L), zone = c("EET", "EET", "EET",
"EET", "EET", "EET", "EET", "EET"), gmtoff = c(7200L,
7200L, 7200L, 7200L, 7200L, 7200L, 7200L, 7200L)), tzone = c("",
"EET", "EEST"), class = c("POSIXlt", "POSIXt")), absence_end = structure(list(
sec = c(0, 0, 0, 0, 0, 0, 0, 0), min = c(15L, 0L, 30L,
0L, 15L, 0L, 30L, 0L), hour = c(16L, 14L, 16L, 14L, 16L,
14L, 16L, 14L), mday = c(13L, 13L, 13L, 13L, 13L, 13L,
13L, 13L), mon = c(10L, 10L, 10L, 10L, 10L, 10L, 10L,
10L), year = c(119L, 119L, 119L, 119L, 119L, 119L, 119L,
119L), wday = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), yday = c(316L,
316L, 316L, 316L, 316L, 316L, 316L, 316L), isdst = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L), zone = c("EET", "EET", "EET",
"EET", "EET", "EET", "EET", "EET"), gmtoff = c(7200L,
7200L, 7200L, 7200L, 7200L, 7200L, 7200L, 7200L)), tzone = c("",
"EET", "EEST"), class = c("POSIXlt", "POSIXt"))), row.names = c(NA,
-8L), class = "data.frame")
> df2
empid Date presence_start presence_end emprsn absence_start absence_end
1 1 2019-11-13 2019-11-13 18:00:00 2019-11-13 20:00:00 1 2019-11-13 15:15:00 2019-11-13 16:15:00
2 1 2019-11-13 2019-11-13 18:00:00 2019-11-13 20:00:00 2 2019-11-13 08:15:00 2019-11-13 14:00:00
3 1 2019-11-13 2019-11-13 18:00:00 2019-11-13 20:00:00 3 2019-11-13 14:30:00 2019-11-13 16:30:00
4 1 2019-11-13 2019-11-13 18:00:00 2019-11-13 20:00:00 4 2019-11-13 08:00:00 2019-11-13 14:00:00
5 1 2019-11-13 2019-11-13 11:30:00 2019-11-13 17:30:00 1 2019-11-13 15:15:00 2019-11-13 16:15:00
6 1 2019-11-13 2019-11-13 11:30:00 2019-11-13 17:30:00 2 2019-11-13 08:15:00 2019-11-13 14:00:00
7 1 2019-11-13 2019-11-13 11:30:00 2019-11-13 17:30:00 3 2019-11-13 14:30:00 2019-11-13 16:30:00
8 1 2019-11-13 2019-11-13 11:30:00 2019-11-13 17:30:00 4 2019-11-13 08:00:00 2019-11-13 14:00:00
The code below, with some modification to account for several employee's and as later noticed, several Working Time's for one Employee in one day (for example:
from 2019-11-13 8:00 to 2019-11-13 14:00 and
from 2019-11-13 14:30 to 2019-11-13 16:30)
Because of this, each working time will have duplicated absences, but this shouldn't be the problem as far as I know.
with this example I tried to add presence_start to group_by, but total_absence comes as a 0 for both times, I've noticed, that both working times have only one instances of d2=1, maybe it's the problem? The total_absence for this example should be 6 hours and 2 hours.
Since the scenario has been revised, I am proposing alternate tidyverse strategy. Hope this works.
sample data modified a bit so that previous scenario also included.
> dput(df)
structure(list(empid = c("1", "1", "1", "1", "1", "1", "1", "1",
"2", "2", "2", "2", "2"), presence_start = structure(list(sec = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), min = c(0L, 0L, 0L, 0L,
30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L), hour = c(18L, 18L,
18L, 18L, 11L, 11L, 11L, 11L, 14L, 14L, 14L, 14L, 14L), mday = c(13L,
13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L),
mon = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L,
10L, 10L, 10L), year = c(119L, 119L, 119L, 119L, 119L, 119L,
119L, 119L, 119L, 119L, 119L, 119L, 119L), wday = c(3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), yday = c(316L,
316L, 316L, 316L, 316L, 316L, 316L, 316L, 316L, 316L, 316L,
316L, 316L), isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L), zone = c("IST", "IST", "IST", "IST", "IST",
"IST", "IST", "IST", "IST", "IST", "IST", "IST", "IST"),
gmtoff = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_)), class = c("POSIXlt",
"POSIXt")), presence_end = structure(list(sec = c(0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0), min = c(0L, 0L, 0L, 0L, 30L, 30L,
30L, 30L, 30L, 30L, 30L, 30L, 30L), hour = c(20L, 20L, 20L, 20L,
17L, 17L, 17L, 17L, 18L, 18L, 18L, 18L, 18L), mday = c(13L, 13L,
13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L), mon = c(10L,
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L),
year = c(119L, 119L, 119L, 119L, 119L, 119L, 119L, 119L,
119L, 119L, 119L, 119L, 119L), wday = c(3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), yday = c(316L, 316L, 316L,
316L, 316L, 316L, 316L, 316L, 316L, 316L, 316L, 316L, 316L
), isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L), zone = c("IST", "IST", "IST", "IST", "IST", "IST",
"IST", "IST", "IST", "IST", "IST", "IST", "IST"), gmtoff = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_)), class = c("POSIXlt", "POSIXt"
)), absrsn = c("1", "2", "3", "4", "1", "2", "3", "4", "5", "6",
"7", "8", "9"), absence_start = structure(list(sec = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), min = c(15L, 15L, 30L, 0L,
15L, 15L, 30L, 0L, 45L, 15L, 30L, 0L, 15L), hour = c(15L, 8L,
14L, 8L, 15L, 8L, 14L, 8L, 15L, 8L, 15L, 8L, 18L), mday = c(13L,
13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L),
mon = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L,
10L, 10L, 10L), year = c(119L, 119L, 119L, 119L, 119L, 119L,
119L, 119L, 119L, 119L, 119L, 119L, 119L), wday = c(3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), yday = c(316L,
316L, 316L, 316L, 316L, 316L, 316L, 316L, 316L, 316L, 316L,
316L, 316L), isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L), zone = c("IST", "IST", "IST", "IST", "IST",
"IST", "IST", "IST", "IST", "IST", "IST", "IST", "IST"),
gmtoff = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_)), class = c("POSIXlt",
"POSIXt")), absence_end = structure(list(sec = c(0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0), min = c(15L, 0L, 30L, 0L, 15L, 0L,
30L, 0L, 15L, 0L, 30L, 0L, 0L), hour = c(16L, 14L, 16L, 14L,
16L, 14L, 16L, 14L, 16L, 14L, 16L, 15L, 19L), mday = c(13L, 13L,
13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L), mon = c(10L,
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L),
year = c(119L, 119L, 119L, 119L, 119L, 119L, 119L, 119L,
119L, 119L, 119L, 119L, 119L), wday = c(3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), yday = c(316L, 316L, 316L,
316L, 316L, 316L, 316L, 316L, 316L, 316L, 316L, 316L, 316L
), isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L), zone = c("IST", "IST", "IST", "IST", "IST", "IST",
"IST", "IST", "IST", "IST", "IST", "IST", "IST"), gmtoff = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_)), class = c("POSIXlt", "POSIXt"
)), date = structure(c(18213, 18213, 18213, 18213, 18213, 18213,
18213, 18213, 18213, 18213, 18213, 18213, 18213), class = "Date")), row.names = c(NA,
-13L), spec = structure(list(cols = list(empid = structure(list(), class = c("collector_double",
"collector")), presence_start = structure(list(), class = c("collector_character",
"collector")), presence_end = structure(list(), class = c("collector_character",
"collector")), absrsn = structure(list(), class = c("collector_double",
"collector")), absence_start = structure(list(), class = c("collector_character",
"collector")), absence_end = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
# A tibble: 13 x 7
empid presence_start presence_end absrsn absence_start absence_end date
<chr> <dttm> <dttm> <chr> <dttm> <dttm> <date>
1 1 2019-11-13 18:00:00 2019-11-13 20:00:00 1 2019-11-13 15:15:00 2019-11-13 16:15:00 2019-11-13
2 1 2019-11-13 18:00:00 2019-11-13 20:00:00 2 2019-11-13 08:15:00 2019-11-13 14:00:00 2019-11-13
3 1 2019-11-13 18:00:00 2019-11-13 20:00:00 3 2019-11-13 14:30:00 2019-11-13 16:30:00 2019-11-13
4 1 2019-11-13 18:00:00 2019-11-13 20:00:00 4 2019-11-13 08:00:00 2019-11-13 14:00:00 2019-11-13
5 1 2019-11-13 11:30:00 2019-11-13 17:30:00 1 2019-11-13 15:15:00 2019-11-13 16:15:00 2019-11-13
6 1 2019-11-13 11:30:00 2019-11-13 17:30:00 2 2019-11-13 08:15:00 2019-11-13 14:00:00 2019-11-13
7 1 2019-11-13 11:30:00 2019-11-13 17:30:00 3 2019-11-13 14:30:00 2019-11-13 16:30:00 2019-11-13
8 1 2019-11-13 11:30:00 2019-11-13 17:30:00 4 2019-11-13 08:00:00 2019-11-13 14:00:00 2019-11-13
9 2 2019-11-13 14:30:00 2019-11-13 18:30:00 5 2019-11-13 15:45:00 2019-11-13 16:15:00 2019-11-13
10 2 2019-11-13 14:30:00 2019-11-13 18:30:00 6 2019-11-13 08:15:00 2019-11-13 14:00:00 2019-11-13
11 2 2019-11-13 14:30:00 2019-11-13 18:30:00 7 2019-11-13 15:30:00 2019-11-13 16:30:00 2019-11-13
12 2 2019-11-13 14:30:00 2019-11-13 18:30:00 8 2019-11-13 08:00:00 2019-11-13 15:00:00 2019-11-13
13 2 2019-11-13 14:30:00 2019-11-13 18:30:00 9 2019-11-13 18:15:00 2019-11-13 19:00:00 2019-11-13
Now follow this approach (after loading both libraries needed
library(tidyverse)
library(lubridate)
#calculating both presence and absence times for each employee for each day
df %>% group_by(empid, date, presence_start) %>%
mutate(absence_start = if_else(absence_start < presence_start, presence_start, absence_start),
absence_end = if_else(absence_end > presence_end, presence_end, absence_end),
absence_end = if_else(absence_end < absence_start, absence_start, absence_end)) %>%
arrange(empid, date, presence_start, absence_start) %>%
mutate(rowid = row_number()) %>%
mutate(absence_start = if_else(rowid >1 & absence_start < lag(absence_end), lag(absence_end), absence_start),
absence_end = if_else(absence_end < absence_start, absence_start, absence_end)) %>%
mutate(presence_end = if_else(rowid >1 & presence_end == lag(presence_end), presence_start, presence_end)) %>%
ungroup() %>%
mutate(presence = difftime(presence_end, presence_start, units = "mins"),
absence = difftime(absence_end, absence_start, units = "mins")) %>%
group_by(empid, date) %>% summarise(sum(presence), sum(absence))
Check the result
# A tibble: 2 x 4
# Groups: empid [2]
empid date `sum(presence)` `sum(absence)`
<chr> <date> <drtn> <drtn>
1 1 2019-11-13 480 mins 270 mins
2 2 2019-11-13 240 mins 105 mins
Manual check- For employee 1 two presence times (row 1-4 & 5-8) (120+360=480 minutes) For emp 2 (rows 9-13) (240 minutes)
absence times for employee 1 (rows 1-4 - 0 mins because all are duplicate; rows 6 & 8 are overlapped - net absence (11:30 to 14:00 - 150 minutes); rows 5 & 7 again overlap (net absence 1430 to 1630 - 120 minutes) - total for emp-1 is 270 minutes. For emp -2 -- rows 8 & 10 overlap (net absence 1430 to 1500 - 30 minutes); rows 9 & 11 overlap again (net 1530 to 1630 - 60 minutes), row 13 exceeding presence (net 1815 to 1830 - 15 minutes) total -105 minutes.
All tally with code result
Good Luck
To this problem, I propose a solution for calculation of total absence time, as follows-
Obviously the data frame will consist of employees therefore I have duplicated the df given, for two employees on a newly created empno field.
sample data used- (Please note that column names have also been tweaked a bit)
> dput(df)
structure(list(empno = c(1, 1, 1, 1, 2, 2, 2, 2), rownum = c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L), presence_start = c("2019-11-13 14:30:00",
"2019-11-13 14:30:00", "2019-11-13 14:30:00", "2019-11-13 14:30:00",
"2019-11-13 14:30:00", "2019-11-13 14:30:00", "2019-11-13 14:30:00",
"2019-11-13 14:30:00"), presence_end = c("2019-11-13 18:30:00",
"2019-11-13 18:30:00", "2019-11-13 18:30:00", "2019-11-13 18:30:00",
"2019-11-13 18:30:00", "2019-11-13 18:30:00", "2019-11-13 18:30:00",
"2019-11-13 18:30:00"), absence_start = c("2019-11-13 15:45:00",
"2019-11-13 08:15:00", "2019-11-13 15:30:00", "2019-11-13 08:00:00",
"2019-11-13 15:45:00", "2019-11-13 08:15:00", "2019-11-13 15:30:00",
"2019-11-13 08:00:00"), absence_end = c("2019-11-13 16:15:00",
"2019-11-13 14:00:00", "2019-11-13 16:30:00", "2019-11-13 15:00:00",
"2019-11-13 16:15:00", "2019-11-13 14:00:00", "2019-11-13 16:30:00",
"2019-11-13 15:00:00")), row.names = c(NA, -8L), class = "data.frame")
> df
empno rownum presence_start presence_end absence_start absence_end
1 1 1 2019-11-13 14:30:00 2019-11-13 18:30:00 2019-11-13 15:45:00 2019-11-13 16:15:00
2 1 2 2019-11-13 14:30:00 2019-11-13 18:30:00 2019-11-13 08:15:00 2019-11-13 14:00:00
3 1 3 2019-11-13 14:30:00 2019-11-13 18:30:00 2019-11-13 15:30:00 2019-11-13 16:30:00
4 1 4 2019-11-13 14:30:00 2019-11-13 18:30:00 2019-11-13 08:00:00 2019-11-13 15:00:00
5 2 1 2019-11-13 14:30:00 2019-11-13 18:30:00 2019-11-13 15:45:00 2019-11-13 16:15:00
6 2 2 2019-11-13 14:30:00 2019-11-13 18:30:00 2019-11-13 08:15:00 2019-11-13 14:00:00
7 2 3 2019-11-13 14:30:00 2019-11-13 18:30:00 2019-11-13 15:30:00 2019-11-13 16:30:00
8 2 4 2019-11-13 14:30:00 2019-11-13 18:30:00 2019-11-13 08:00:00 2019-11-13 15:00:00
Calculation for absence time needs creation of a few dummy fields as explained below-
library(tidyverse)
library(lubridate)
df2 <- df %>% pivot_longer(cols = -c(rownum, empno), names_to = c(".value", "event"), names_sep = "_") %>%
mutate(presence = ymd_hms(presence),
absence = ymd_hms(absence),
event = factor(event, levels = c("start", "end"), ordered =T),
absence_m = case_when(absence < presence & event == 'start' ~ presence,
absence > presence & event == 'end' ~ presence,
TRUE ~ absence),
dummy = ifelse(event == 'start', 1, -1)) %>%
group_by(empno, rownum) %>%
mutate(absence_m = case_when(event == 'end' & lag(absence_m) > absence_m ~ lag(absence_m),
TRUE ~ absence_m)) %>%
group_by(empno) %>% arrange(empno, absence_m) %>%
mutate(dummy = cumsum(dummy),
d2 = case_when(event == 'end' & lag(absence_m) == absence_m ~ 0,
event == 'start' & lead(absence_m) == absence_m ~0,
event == "start" & dummy == 1 ~ 1,
event == "end" & dummy == 0 ~ 1,
TRUE ~ 0)) %>%
filter(d2 == 1) %>%
mutate(absence_time = case_when(event == 'end' ~ (absence_m - lag(absence_m)),
TRUE ~ 0)) %>%
mutate(absence_time = replace_na(absence_time, 0)) %>%
summarise(total_absence = sum(absence_time))
> df2
# A tibble: 2 x 2
empno total_absence
<dbl> <drtn>
1 1 90 mins
2 2 90 mins
I am leaving calculation of total presence to you, which should not be difficult. Still if you have any doubts feel free to ask.
I regularly use the aggregate function to find means and sums of POSIXlt data per hour and per day. I am trying to use the same function on a new dataset to get hourly averages, but when I apply it, it changes the timestamp.
The data is a data.frame (called "moT") like this:
TS T
1 2016-06-26 10:10:34 19.662
2 2016-06-26 10:40:34 21.091
3 2016-06-26 11:10:34 23.388
4 2016-06-26 11:40:34 24.448
5 2016-06-26 12:10:34 25.513
6 2016-06-26 12:40:34 26.390
7 2016-06-26 01:10:34 27.468
8 2016-06-26 01:40:34 27.567
9 2016-06-26 02:10:34 26.977
10 2016-06-26 02:40:34 25.222
11 2016-06-26 03:10:34 23.100
12 2016-06-26 03:40:34 24.158
13 2016-06-26 04:10:34 21.951
14 2016-06-26 04:40:34 21.473
15 2016-06-26 05:10:34 19.948
16 2016-06-26 05:40:34 19.472
17 2016-06-26 06:10:34 18.806
18 2016-06-26 06:40:34 16.808
19 2016-06-26 07:10:34 15.282
20 2016-06-26 07:40:34 14.517
or as per suggested format:
structure(list(TS = structure(list(sec = c(34, 34, 34, 34, 34,
34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34),
min = c(10L, 40L, 10L, 40L, 10L, 40L, 10L, 40L, 10L, 40L,
10L, 40L, 10L, 40L, 10L, 40L, 10L, 40L, 10L, 40L), hour = c(10L,
10L, 11L, 11L, 12L, 12L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L,
5L, 5L, 6L, 6L, 7L, 7L), mday = c(26L, 26L, 26L, 26L, 26L,
26L, 26L, 26L, 26L, 26L, 26L, 26L, 26L, 26L, 26L, 26L, 26L,
26L, 26L, 26L), mon = c(5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L), year = c(116L,
116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L,
116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L), wday = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L), yday = c(177L, 177L, 177L, 177L, 177L, 177L,
177L, 177L, 177L, 177L, 177L, 177L, 177L, 177L, 177L, 177L,
177L, 177L, 177L, 177L), isdst = c(0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
zone = c("GMT+5", "GMT+5", "GMT+5", "GMT+5", "GMT+5", "GMT+5",
"GMT+5", "GMT+5", "GMT+5", "GMT+5", "GMT+5", "GMT+5", "GMT+5",
"GMT+5", "GMT+5", "GMT+5", "GMT+5", "GMT+5", "GMT+5", "GMT+5"
), gmtoff = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_)), .Names = c("sec", "min", "hour", "mday", "mon",
"year", "wday", "yday", "isdst", "zone", "gmtoff"), class = c("POSIXlt",
"POSIXt"), tzone = "Etc/GMT+5"), T = c(19.662, 21.091, 23.388,
24.448, 25.513, 26.39, 27.468, 27.567, 26.977, 25.222, 23.1,
24.158, 21.951, 21.473, 19.948, 19.472, 18.806, 16.808, 15.282,
14.517)), .Names = c("TS", "T"), row.names = c(NA, 20L), class = "data.frame")
I apply this code to "moT":
dat <- aggregate(moT["T"], format(moT["TS"], "%Y-%m-%d %H"), mean)
I expect this output (for the first five rows):
TS meanT
1 "2016-06-26 10" 20.3765
2 "2016-06-26 11" 23.918
3 "2016-06-26 12" 25.9515
4 "2016-06-26 13" 27.5175
5 "2016-06-26 14" 26.0995
that is what has happened when I have used the same function on other datasets.
but instead it is this:
TS meanT
1 "2016-01-07 00" 14.5650
2 "2016-01-07 01" 14.0380
3 "2016-01-07 02" 13.6540
4 "2016-01-07 03" 13.6540
5 "2016-01-07 04" 13.7500
Why is the date and time changing???
I have tried using POSIXct instead of POSIXlt, have tried reformatting the datetime objects in my csv files, have tried removing the time zone from the POSIXlt object.
I've seen this post
How to calculate average of a variable by hour in R
that would give me the result I want, but requires separating date and time into two columns. I'm happy to do that but I'd like to know why this is happening so I can avoid it in future and know which method to use for which data.
Thanks very much.
I have 9x3 dataframe DATA, where one column has timestamps (DateTime), one prices (Close), and one binary values (FOMCBinary).
I want to add column SignalBinary recording a 1 IF Close < an X value (1126 in this example) AND FOMCBinary > 0 in any of the two rows below, but only if SignalBinary = 0 in the row below (i.e. do not want consecutive 1s).
In the example here I need to record a 1 under SignalBinary only at 14:15:00. My coding attempt is instead recording a 1 at 14:15:00 and at 14:30:00. Should be fairly simple, don't understand why my code is not producing the desired result. How could I get this fixed?
DATA <- structure(list(DateTime = structure(list(sec = c(0, 0, 0, 0,0, 0, 0, 0, 0), min = c(30L, 15L, 0L, 45L, 30L, 15L, 0L, 45L,30L), hour = c(15L, 15L, 15L, 14L, 14L, 14L, 14L, 13L, 13L),mday = c(27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L), mon = c(0L,0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), year = c(116L, 116L, 116L,116L, 116L, 116L, 116L, 116L, 116L), wday = c(3L, 3L, 3L,3L, 3L, 3L, 3L, 3L, 3L), yday = c(26L, 26L, 26L, 26L, 26L,26L, 26L, 26L, 26L), isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L,0L, 0L), zone = c("EST", "EST", "EST", "EST", "EST", "EST","EST", "EST", "EST"), gmtoff = c(NA_integer_, NA_integer_,NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,NA_integer_, NA_integer_)), .Names = c("sec", "min", "hour","mday", "mon", "year", "wday", "yday", "isdst", "zone", "gmtoff"), class = c("POSIXlt", "POSIXt")), Close = c(1127.2, 1127.5,1126.9, 1128.3, 1125.4, 1122.7, 1122.8, 1117.3, 1116), FOMCBinary = c(0,0, 0, 0, 0, 0, 1, 0, 0)), .Names = c("DateTime", "Close", "FOMCBinary"), row.names = 2131:2139, class = "data.frame")
Xvalue = 1126
#For comparing lagged or forward rows
rowShift <- function(x, shiftLen = 1L) {
r <- (1L + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r]) }
DATA$SignalBinary <- ifelse(
DATA$Close < Xvalue & (
rowShift(DATA$FOMCBinary, +1) > 0 |
(rowShift(DATA$FOMCBinary, +2) > 0 & rowShift(DATA$FOMCBinary, +1) == 0))
, 1, 0)
##Note rowShift(DATA$FOMCBinary, +1) is equivalent to DATA$FOMCBinary[seq(nrow(DATA))+1]##
Output for DATA after calculations:
DateTime Close FOMCBinary SignalBinary
2131 2016-01-27 15:30:00 1127.2 0 0
2132 2016-01-27 15:15:00 1127.5 0 0
2133 2016-01-27 15:00:00 1126.9 0 0
2134 2016-01-27 14:45:00 1128.3 0 0
2135 2016-01-27 14:30:00 1125.4 0 1 => UNWANTED 1
2136 2016-01-27 14:15:00 1122.7 0 1
2137 2016-01-27 14:00:00 1122.8 1 0
2138 2016-01-27 13:45:00 1117.3 0 NA
2139 2016-01-27 13:30:00 1116.0 0 NA
Thank you very much.
had a closer look. you wanted to remove the first consecutive 1 in SignalBinary but you didnt sweep through SignalBinary. Here is a rough code
DATA <- structure(list(DateTime = structure(list(sec = c(0, 0, 0, 0,0, 0, 0, 0, 0), min = c(30L, 15L, 0L, 45L, 30L, 15L, 0L, 45L,30L), hour = c(15L, 15L, 15L, 14L, 14L, 14L, 14L, 13L, 13L),mday = c(27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L), mon = c(0L,0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), year = c(116L, 116L, 116L,116L, 116L, 116L, 116L, 116L, 116L), wday = c(3L, 3L, 3L,3L, 3L, 3L, 3L, 3L, 3L), yday = c(26L, 26L, 26L, 26L, 26L,26L, 26L, 26L, 26L), isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L,0L, 0L), zone = c("EST", "EST", "EST", "EST", "EST", "EST","EST", "EST", "EST"), gmtoff = c(NA_integer_, NA_integer_,NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,NA_integer_, NA_integer_)), .Names = c("sec", "min", "hour","mday", "mon", "year", "wday", "yday", "isdst", "zone", "gmtoff"), class = c("POSIXlt", "POSIXt")), Close = c(1127.2, 1127.5,1126.9, 1128.3, 1125.4, 1122.7, 1122.8, 1117.3, 1116), FOMCBinary = c(0,0, 0, 0, 0, 0, 1, 0, 0)), .Names = c("DateTime", "Close", "FOMCBinary"), row.names = 2131:2139, class = "data.frame")
Xvalue = 1126
#For comparing lagged or forward rows
rowShift <- function(x, shiftLen = 1) {
r <- (1 + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r]) }
DATA$SignalBinary <- as.numeric(DATA$Close < Xvalue & rowShift(DATA$FOMCBinary, +1) > 0)
DATA$SignalBinary <- c(sapply(1:(nrow(DATA)-1), function(n) {
if (is.na(DATA$SignalBinary[n+1])) return(NA)
if (DATA$SignalBinary[n+1]) return(0)
DATA$SignalBinary[n]
}), tail(DATA$SignalBinary,1))
DATA
anotherDATA <- structure(list(DateTime = structure(list(sec = c(0, 0, 0, 0, 0, 0, 0, 0, 0), min = c(30L, 15L, 0L, 45L, 30L, 15L, 0L, 45L,30L),
hour = c(15L, 15L, 15L, 14L, 14L, 14L, 14L, 13L, 13L),mday = c(27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L),
mon = c(0L,0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), year = c(116L, 116L, 116L,116L, 116L, 116L, 116L, 116L, 116L),
wday = c(3L, 3L, 3L,3L, 3L, 3L, 3L, 3L, 3L), yday = c(26L, 26L, 26L, 26L, 26L,26L, 26L, 26L, 26L),
isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L,0L, 0L), zone = c("EST", "EST", "EST", "EST", "EST", "EST","EST", "EST", "EST"),
gmtoff = c(NA_integer_, NA_integer_,NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,NA_integer_, NA_integer_)),
.Names = c("sec", "min", "hour","mday", "mon", "year", "wday", "yday", "isdst", "zone", "gmtoff"), class = c("POSIXlt", "POSIXt")),
Close = c(1127.2, 1127.5,1126.9, 1128.3, 1125.4, 1122.7, 1122.8, 1117.3, 1116),
FOMCBinary = c(0,0, 0, 0, 0, 1, 1, 0, 0)), .Names = c("DateTime", "Close", "FOMCBinary"), row.names = 2131:2139, class = "data.frame")
Xvalue = 1126
#For comparing lagged or forward rows
rowShift <- function(x, shiftLen = 1) {
r <- (1 + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r]) }
anotherDATA$SignalBinary <- as.numeric(anotherDATA$Close < Xvalue & rowShift(anotherDATA$FOMCBinary, +1) > 0)
anotherDATA$SignalBinary <- c(sapply(1:(nrow(anotherDATA)-1), function(n) {
if (is.na(anotherDATA$SignalBinary[n+1])) return(NA)
if (anotherDATA$SignalBinary[n+1]) return(0)
anotherDATA$SignalBinary[n]
}), tail(anotherDATA$SignalBinary,1))
anotherDATA
I've to plot these data:
day temperature
02/01/2012 13:30:00 10
10/01/2012 20:30:00 8
15/01/2012 13:30:00 12
25/01/2012 20:30:00 6
02/02/2012 13:30:00 5
10/02/2012 20:30:00 3
15/02/2012 13:30:00 6
25/02/2012 20:30:00 -1
02/03/2012 13:30:00 4
10/03/2012 20:30:00 -2
15/03/2012 13:30:00 7
25/03/2012 20:30:00 1
in the x-axis I want to label only the month and the day (e.g. Jan 02 ). How can I do this using the command plot() and axis()?
First, you will need to put your date text into a dtae class (e.g. as.POSIXct):
df <- structure(list(day = structure(list(sec = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0), min = c(30L, 30L, 30L, 30L, 30L, 30L, 30L,
30L, 30L, 30L, 30L, 30L), hour = c(13L, 20L, 13L, 20L, 13L, 20L,
13L, 20L, 13L, 20L, 13L, 20L), mday = c(2L, 10L, 15L, 25L, 2L,
10L, 15L, 25L, 2L, 10L, 15L, 25L), mon = c(0L, 0L, 0L, 0L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L), year = c(112L, 112L, 112L, 112L,
112L, 112L, 112L, 112L, 112L, 112L, 112L, 112L), wday = c(1L,
2L, 0L, 3L, 4L, 5L, 3L, 6L, 5L, 6L, 4L, 0L), yday = c(1L, 9L,
14L, 24L, 32L, 40L, 45L, 55L, 61L, 69L, 74L, 84L), isdst = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L)), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt")), temperature = c(10L, 8L,
12L, 6L, 5L, 3L, 6L, -1L, 4L, -2L, 7L, 1L)), .Names = c("day",
"temperature"), row.names = c(NA, -12L), class = "data.frame")
df
df$day <- as.POSIXct(df$day, format="%d/%m/%Y %H:%M:%S")
Your dates should now plot correctly. Don't apply the x-axis, by using the argument xaxt="n". Afterwards, you can create a sequence of dates where you would like your axis labeled, and apply this with axis.POSIXct:
plot(df$day, df$temperature, t="l", ylab="Temperature", xlab="Date", xaxt="n")
SEQ <- seq(min(df$day), max(df$day), by="months")
axis.POSIXct(SEQ, at=SEQ, side=1, format="%b %Y")
Similarly, to get a daily axis, simply modify the SEQ and axis.POSIXct code accordingly. For example, you may try:
plot(df$day, df$temperature, t="l", ylab="Temperature", xlab="Date", xaxt="n")
SEQ <- seq(min(df$day), max(df$day), by="days")
axis.POSIXct(SEQ, at=SEQ, side=1, format="%b %d")
x
structure(list(Date = structure(c(15358, 15359, 15362, 15363,
15364, 15365), class = "Date"), EndTime1 = structure(list(sec = c(0,
0, 0, 0, 0, 0), min = c(45L, 25L, 7L, 19L, 5L, 23L), hour = c(5L,
8L, 3L, 4L, 4L, 3L), mday = c(18L, 18L, 18L, 18L, 18L, 18L),
mon = c(0L, 0L, 0L, 0L, 0L, 0L), year = c(113L, 113L, 113L,
113L, 113L, 113L), wday = c(5L, 5L, 5L, 5L, 5L, 5L), yday = c(17L,
17L, 17L, 17L, 17L, 17L), isdst = c(0L, 0L, 0L, 0L, 0L, 0L
)), .Names = c("sec", "min", "hour", "mday", "mon", "year",
"wday", "yday", "isdst"), class = c("POSIXlt", "POSIXt")), EndTime2 = structure(list(
sec = c(0, 0, 0, 0, 0, 0), min = c(45L, 41L, 11L, 27L, 19L,
34L), hour = c(7L, 15L, 5L, 7L, 8L, 5L), mday = c(18L, 18L,
18L, 18L, 18L, 18L), mon = c(0L, 0L, 0L, 0L, 0L, 0L), year = c(113L,
113L, 113L, 113L, 113L, 113L), wday = c(5L, 5L, 5L, 5L, 5L,
5L), yday = c(17L, 17L, 17L, 17L, 17L, 17L), isdst = c(0L,
0L, 0L, 0L, 0L, 0L)), .Names = c("sec", "min", "hour", "mday",
"mon", "year", "wday", "yday", "isdst"), class = c("POSIXlt",
"POSIXt")), EndTime3 = structure(list(sec = c(0, 0, 0, 0, 0,
0), min = c(7L, 59L, 30L, 48L, 46L, 58L), hour = c(8L, 15L, 5L,
7L, 8L, 5L), mday = c(18L, 18L, 18L, 18L, 18L, 18L), mon = c(0L,
0L, 0L, 0L, 0L, 0L), year = c(113L, 113L, 113L, 113L, 113L, 113L
), wday = c(5L, 5L, 5L, 5L, 5L, 5L), yday = c(17L, 17L, 17L,
17L, 17L, 17L), isdst = c(0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"))), .Names = c("Date", "EndTime1",
"EndTime2", "EndTime3"), row.names = c(NA, 6L), class = "data.frame")
\n
y_limits = as.POSIXct(c(strptime("00:00", "%H:%M"), strptime("23:29", "%H:%M")))
y_breaks = seq(from=strptime("00:00", "%H:%M"),
to=strptime("23:29", "%H:%M"), by="2 hours")
y_labels = format(y_breaks, "%H:%M")
s<-as.POSIXlt("09:00", format="%H:%M")
ggplot(x, aes(Date, EndTime1, group=1, colour="Team1")) + geom_line() + scale_y_datetime(limits=y_limits, breaks=y_breaks, labels=y_labels) + geom_line(aes(Date, EndTime2, colour="Team2")) + geom_line(aes(Date, EndTime3, colour="Team3")) + geom_hline(yintercept=s, colour="red")
I like to draw a geom_line at s but not working. Getting error as:
Error : Invalid intercept type: should be a numeric vector, a function, or a name of a function
Error in if (nrow(layer_data) == 0) return() : argument is of length zero
Any ideas How I can draw a vertical line at a given s?
s<-as.POSIXct(c("09:00"), format="%H:%M")
s<-as.numeric(s)